Ralph Melton
United States Pennsylvania

(This is a textonly version of an article that includes tables. I've done the best I can with the tables for the forums, but they look bad. A version with betterlooking tables is available online at https://www.icloud.com/pages/000hPZOYsdjE6ywHobWm_gQQ#2000_... . The spreadsheet with my data and statistics is at https://www.icloud.com/numbers/000RmM8_PKiSx8NHaoGvcZfw#200... .)
I have been an enthusiastic fan of Pandemic since I first played it in 2008. I like it because it involves cooperation and careful planning, which makes it a good way to talk to people through a game. For several years, I’ve been playing a lunchtime game once a week with friends. In 2013, ZMan Games came out with a version of Pandemic for the iPad. It doesn’t have networked multiplayer capabilities, but it works very well as a solitaire game in which one person controls all the players in the game. I’ve played the iPad version extensively, and I started recording my games as a little science project to see if I could use experiments to support our debates about which Roles were strongest. (In many years I judge science fair projects for the Pennsylvania Junior Academy of Science for the school where my wife taught; I think that investigating a game like would be a totally legitimate and interesting science fair project.)
From the first 2000 games I’ve recorded, I have discovered that the Roles in Pandemic and its expansion On the Brink are balanced better than I had expected.
Hypotheses
I assumed that changing the number of players or the number of Epidemic cards in the deck might significantly change the game; therefore, I’m going to use the shorthand XPYE to describe a game with X players and Y Epidemic cards in the deck.
I had a few hypotheses I was hoping to prove:
1. Pandemic is easier with fewer players. In particular, 3PYE is easier than 4PYE. We’d formed this conclusion early on in our Pandemic history, and had taken it for granted since – but it was worth validating.
2. The Epidemiologist Role is weaker than most Roles. In particular, it is weaker than the Scientist Role. The Scientist can discover a cure with four city cards of the same color, instead of the usual five. The Epidemiologist can take a card from another player in the same city once per turn as a free action. I thought that the limit on how often the Epidemiologist could take cards and the fact that the Epidemiologist still had the limit of seven cards in hand made the Epidemiologist clearly inferior to the Scientist. (The original version of the Epidemiologist had been even weaker, and we had boosted it with a house rule in the group I usually play with.)
3. The Troubleshooter Role is significantly weaker than most Roles. The Troubleshooter has two abilities; they can fly to a city by simply showing the card for that city (unlike normal players who have to discard the card to make a direct flight), and they can see upcoming Infection cards at the beginning of their turn. I thought that the second power was very weak, because it seemed rare that that would provide information that would change the actions you take. (Even knowing that a city would have an Outbreak only helps if you’re able to move to that city.) In our play group, we had boosted the Troubleshooter with a house rule to look at the upcoming cards on every player’s turn, not just their own.
4. The Field Operative Role is significantly weaker. The Field Operative’s special ability is that it can collect samples of disease cubes as it treats disease, and then it can Discover a Cure with three cubes of one color and three city cards of the same color. But it can only collect a sample once per turn, and that restriction seemed too strict; it seemed that if you made a bad choice about what color to collect early on, it would be hard to catch up.
Methods
I played a bunch of games of Pandemic on the iPad, and recorded which roles I used and whether I won or lost. For each game that I recorded, I chose the number of players and the number of Epidemics beforehand, but used randomly selected roles.
I started recording games on December 4, 2013. On December 13, 2013, the makers of the iPad game released a new version with the Roles and Special Events from the Pandemic: On the Brink expansion. I bought those Roles and Special Events, and I’ve been playing with them ever since.
One critique of my methods is that I have not recorded quite all my games. There are a few reasons that I have not recorded games:
There have been games that I have decided not to record before beginning play. For example, when I was playing games sitting by Lori’s hospital bed, I decided beforehand that I was probably not playing at my best, and therefore chose not to record them. In early of December 2015, I didn’t record several games, because I had declared to myself that at 2000 games, I would write up my results, and I didn’t have time to write then. So analyzing this data to draw conclusions about when I have played iPad games of Pandemic would be dubious, but these unrecorded games should not cast doubt on the conclusions about Pandemic itself.
There have been some games that I have decided not to play or record, because the set of Roles has been very similar to that of a game that I’ve just played. For example, if I play one game with Scientist, Researcher, and Troubleshooter, and then my next game starts with Epidemiologist, Researcher, and Troubleshooter, I might decide that I don’t want to play such a group twice in a row and end the game before the first turn. This has happened about a dozen times. This might mean that the distribution of roles in my games is not quite random, but I think that this has happened rarely enough that there should be no strong effect.
There have been a dozen or two games that I have quit in midgame because I failed to execute my intended plan. The most common way this happens is that I plan to move the Epidemiologist to another player and take a card, do move the Epidemiologist into place, and then forget to take the card before I end the action phase of the Epidemiologist’s turn. If I was playing with my friends, we would fix that error; “You said beforehand that you were moving into position in order to take the card, so here you go.” But the iPad version won’t let you undo once you’ve drawn cards, so there’s no way to fix that error. I think that not recording those games gives data that more accurately represents the results I’d get with my friends, but I still have some qualms about those abandoned games.
Another potential critique is that I might have improved my Pandemic skills over the course of all these games. I started playing the iPad version after five years of playing Pandemic once a week or thereabouts, so I was well past the initial part of the learning curve. However, the Contingency Planner and Quarantine Specialist Roles only became available in 2013, so it is possible that I encountered a learning effect with those Roles. I also only started playing fiveplayer games in September 2014, so it is possible that there was a learning effect for the particular complications of fiveplayer games.
One more critique is that my Pandemic play might not be representative of other players. Perhaps there is some point of strategy that I consistently overlook, or I’m too frugal in my use of Special Events or something like that. I have no controls against such bias in this project, because involving other players would have made this more work.
I also note that I’ve discovered occasional errors in my spreadsheet of results. For example, I’ve noticed occasional games for which I’ve recorded both a victory and a type of failure, an obvious contradiction. I’ve fixed some of these errors when I felt certain of the correct answer, but there have been some errors that I did not have enough data to fix, and presumably some errors that I haven’t noticed at all.
Results
The first thing I learned from this project: it takes a lot more data points to gather meaningful statistics than I had expected. I had thought that with several hundred games, I’d be able to prove hypotheses about single Roles being weak, and start to be able to investigate theories about interactions with multiple Roles. But after 346 games of 4P5E, I had no measurable differences in the performance of any Roles.
So I changed directions and tried to confirm a hypothesis I considered obvious: Hypothesis: Increasing the number of Epidemics makes the game harder. It didn’t take many 4PxE games to confirm this: Success Percentages for 4Player Games
Success Percentage 84.3% 59.3% Confidence Interval 3.0% 18.5% Minimum of Confidence Interval 81.3% 40.8% Maximum of Confidence Interval 87.3% 77.8% (All confidence intervals are at p < 0.05.)
This is not a surprising result; I have a straightforward theoretical model of why increasing the number of Epidemics makes the game harder, and I have anecdotal data to support this. But it’s nice to show that it is possible to confirm the obvious.
(The other thing I learned with this exercise is that I prefer to play with a higher chance of winning. I usually play at a difficulty higher than 5 Epidemics when I play with my friends, but I didn’t enjoy playing 4P6E on the iPad. It may be that I’m more tolerant of failure when playing with my friends, or it may be that our house rules make a significant difference in our success rate. Our house rules: we deal each player two Roles and choose one of those Roles in collaboration with the other players, and we choose collaboratively which player goes first.)
From there, I turned to investigating the hypothesis that 3P5E is easier than 4P5E. I started off with a very strong streak that made it look very likely that the hypothesis would be confirmed. But I wanted to continue until it was confirmed at p < 0.05, and as I kept playing I started losing more often. After more than 500 games, my success percentages are nearly the same, at 84.3% ± 3.0% for 4P5E and 84.5% ± 3.0% for 3P5E. Success Percentages for 5Epidemic Games
Number of Games 63 549 579 643 Success Percentage 73.0% 84.5% 84.3% 92.8% Confidence Interval 11.0% 3.0% 3.0% 2.0% Minimum of Confidence Interval 62.0% 81.5% 81.3% 90.8% Maximum of Confidence Interval 84.0% 87.5% 87.3% 94.8% But you should notice a lexical subterfuge there: I’ve stated a hypothesis about some games being easier, but the data I’ve presented is about what games are more winnable. It includes all the games that I barely win through extensive experience. Does that really measure what’s easier?
The statistics I’ve collected don’t record whether a game was easy, so I tried to define “easy” in terms of the statistics I recorded. I defined a proxy for an easy game by this logic:
A game is hard if I feel like I barely win, i.e., almost lose.
There are two ways to recognize that I almost lost a game:
1. The game is lost if the number of Outbreaks reaches 8. So a game that is threatening to hit that number of Outbreaks feels harrowing. I chose a threshold of 6 Outbreaks as marking a game that was nearly lost through too many Outbreaks.
2. The game is lost if a player needs to draw Player cards and there are not enough cards to draw. When the Player deck gets low, we start to worry and count out how many turns each player will get, in order to make sure that they will get enough turns to find the last Cures. I chose a threshold of 7 cards or fewer to identify a game that was nearly lost, because in a fourplayer game, that threshold means that the game is won on the last possible turn for some player.
Therefore, a game is easy if it is not nearly lost in either of those ways. (There is a third way to lose, by running out of cubes for one disease. The statistics I’ve recorded do not let me track how close I come to losing in that way. However, I know from experience that running low on cubes is strongly correlated with having many Outbreaks, so the first condition captures almost all of these cases.)
This proxy is inaccurate in both directions; there are easy games which this criterion would consider hard (such as a game with multiple Eradications, no Outbreaks, and a straightforward Cure of the last disease with 7 cards left), and there are hard games that this criterion would consider easy (such as a game that’s almost out of cubes of one color with five Outbreaks). But it is in the ballpark and it is objective.
Percentages of Easy Games for 5Epidemic Games
Number of Games 63 549 579 643 Success Percentage 38.1% 51.7% 37.7% 36.9% Confidence Interval 12.0% 4.2% 3.9% 3.7% Minimum of Confidence Interval 26.1% 47.5% 33.8% 33.2% Maximum of Confidence Interval 50.1% 55.9% 41.6% 40.6% Although my success ratios are nearly equivalent for 3P5E and 4P5E, I am significantly more likely to have an easy game with 3P5E. So perhaps there’s some merit to my early hypothesis about 3P5E being easier.
On the other hand, if this is a trend, it does not seem to extend to 2P5E. I didn’t play quite enough games of 2P5E to get a statistically significant result, but it certainly seems that I am less likely to win 2P5E than 3P5E.
In September 2014, it finally happened that all five of our usual Pandemic group were able to play on the same day. That turned my attention to fiveplayer games. I had been timid about fiveplayer games, because I thought that having five players would make it much harder to collect five cards of a color in a single player’s hand. However, I was wrong; it turns out that I am more likely to win 5P5E than 4P5E. I think there are two factors that explain this:
1. Though cards do get more distributed, the limit on the number of cards in a player’s hand is less of a burden.
2. The player deck includes two Special Event Cards per player, so 5P5E provides a lot of cards to mitigate problems.
At the beginning of my 5P5E play, I thought that the Contingency Planner was going to be exceptionally good in 5P5E play; the Contingency Planner’s special ability is to pick up Special Event cards that have been used in order to use them again, and with ten Special Event cards in the deck, there’s a lot of potential for that to work out well. The first 44 5P5E games with the Contingency Planner were victories, and I thought I was on track to demonstrate superiority there – but the Contingency Planner has now regressed to the mean. I have no significant evidence that any role is better than any other in 5P5E.
I also thought that the Field Operative would be particularly hampered in 5P5E. It takes three turns of sampling for the Field Operative to get a Cure, and if the Field Operative is not the first or second player, he only gets five turns total. But the Field Operative’s performance is in the middle of the pack, not distinguishable from everyone else. It seems that he either gets one Cure easily, or two Cures with a lot of help and support  and that’s enough to do his share.
Another interesting result with 5P5E: I get more Eradications with 5P5E than with fewer players. A lot more  71% more Eradications than with 4P5E. (I will try to get Eradications even when it will not help win the game, but that’s true in games with fewer players as well.) I believe this is due to a combination of more players having greater mobility to get to the final cubes for Eradication, and having more Special Events to help bring off an Eradication.
Eradications in 5Epidemic Games
Number of Games 63 549 579 643 Average Number of Eradications 0.48 0.55 0.65 1.11 Confidence Interval 0.19 0.06 0.06 0.07 Minimum of Confidence Interval 0.29 0.49 0.59 1.04 Maximum of Confidence Interval 0.67 0.61 0.71 1.18
An interesting but statistically suspect observation: the greater number of Eradications also shows up at the far end of the distribution, with games in which I manage to Eradicate all the diseases. Eradicating all the diseases is very tricky, because the iPad implementation ends the game immediately when a Cure is discovered for the last game; in order to Eradicate the last disease, you have to eliminate all the cubes of that color with or before the action that discovers the Cure. Before I started playing 5player games, I managed to get a complete Eradication only once in 1357 games; with 5P5E, I’ve managed to get complete Eradications 14 times in 643 games.
Conclusions and Further Work
With over 500 games each of 3P5E, 4P5E, and 5P5E, I have found no evidence that any Role outperforms any other. This is a great surprise to me, because I felt certain that some Roles were excellent and some were weak. I salute the designers for balancing the Roles so well.
For future work, I’d like to measure the effects of playing with a Role that was obviously weaker. Consider a “Civilian” Role with no special abilities. This Role would obviously be inferior to any Role that did have special abilities; how many games would it take to prove statistically that it was inferior? I have considered trying to simulate this by marking one player as a Civilian and never using any special powers. But the Medic, Containment Specialist, and Quarantine Specialist have powers enforced by the game, so it would be hard to play one of those as a Civilian.
But in the immediate future, I’m more likely to investigate another path: The creators have just added the Virulent Strain Challenge to the iPad game. I assume that I’m less likely to win with the Virulent Strain Challenge than without – but how much more difficult is it? In particular, how does the difficulty of a fiveEpidemic game with Virulent Strain compare to the difficulty of a sixEpidemic game?
One final conclusion: I am still really enjoying playing Pandemic. And I have some numbers that shed a light on why. It comes down to the difference between victorious games and easy games. Even with my years of experience, over half of my games (with 5P5E) land in the zone where I win, but I feel I win only with cleverness and a bit of luck. That is my sweet spot for cooperative games, and Pandemic hits that sweet spot again and again.


JeanPhilippe Thériault
Canada Montreal Quebec

One thing to note is that the Core game has a fixed 5 event cards in the player deck (which makes more players significantly harder) while playing with the additional On the Brink content changes that to (nbPlayers*2) event cards in the deck, which should balance things out a bit (2 Players with OTB being obviously harder and 3+ Players with OTB being obviously easier than the same numbers without OTB).


Richard Harris
United Kingdom

Wow! Thank you for sharing all your data and work.
I am really pleased that your final conclusion is what it is.


Sataranji
United States Twin Cities Minnesota

Excellent work! I need to dig in further but wanted to report that I found what is hopefully a typo (emphasis mine):
Ralph Melton wrote: The Scientist can discover a cure with four city cards of the same color, instead of the usual seven.
I hope you mean five.


Helen Slater
United Kingdom Sheffield South Yorkshire

Wow, that's some extensive work there! Thank you.
Very interesting results because I too thought some roles were "weaker" than others.


Ralph Melton
United States Pennsylvania

sataranji wrote: Excellent work! I need to dig in further but wanted to report that I found what is hopefully a typo (emphasis mine): Ralph Melton wrote: The Scientist can discover a cure with four city cards of the same color, instead of the usual seven. I hope you mean five.
Thank you for the correction. I've edited the original post.


James
United States McDonough Georgia

Excellent data. And Welcome to BGG!


Stephen Sparks
United States Kansas City Missouri

Be careful of small datasets, they can be misleading. Comparing 579 games of one type to 27 games of another is not a good comparison as only playing 27 games still has a high variance. The difference in successes is almost as large as your confidence interval.


Ralph Melton
United States Pennsylvania

spazz451 wrote: Be careful of small datasets, they can be misleading. Comparing 579 games of one type to 27 games of another is not a good comparison as only playing 27 games still has a high variance. The difference in successes is almost as large as your confidence interval.
I'm certainly not a stats expert, so I might have made mistakes in my calculation. I invite you to doublecheck my spreadsheet. I certainly didn't play enough 6Epidemic games to have much information about how much the extra Epidemic affects my win chances.
However, for the particular claim that I'm less likely to win with 6 Epidemics than with 5 Epidemics, that matches my intuition and experience enough that I feel pretty confident that that claim will hold.


Byron S
United States Ventura California
I don't remember what I ate last night
but I can spout off obscure rules to all sorts of game like nobody's business!

Ralph Melton wrote: spazz451 wrote: Be careful of small datasets, they can be misleading. Comparing 579 games of one type to 27 games of another is not a good comparison as only playing 27 games still has a high variance. The difference in successes is almost as large as your confidence interval. I'm certainly not a stats expert, so I might have made mistakes in my calculation. I invite you to doublecheck my spreadsheet. I certainly didn't play enough 6Epidemic games to have much information about how much the extra Epidemic affects my win chances. However, for the particular claim that I'm less likely to win with 6 Epidemics than with 5 Epidemics, that matches my intuition and experience enough that I feel pretty confident that that claim will hold. Ah, perception bias is a powerful, wonderful thing I'm not saying you're wrong, but there's certainly bias at work there!




Hi Ralph. First off, Welcome to BGG.
This is really awesome. Thanks for all your time and all your games which have been played. I have a few critiques that may come later, although you've critiqued yourself quite well. After just a quick read, I couldn't find anything saying whether you did or didn't play with your house rules, which I was curious about (although it seemed strongly suggested that you did not).
Edit: Hi Ralph, just a brief note before you read on, while the table formatting I've provided below will be very convenient, I happened to catch a pretty significant error in your calculations. Before you reimplement these tables, I would recommend applying the fixes I've suggested below.
Although I'll have more to say later, I just wanted to provide you with tables to make your data more easily read. You can just copy/quick quote my post and copy/paste the table below. Credit for how to do this goes to Tall_Walt. I had to change the size down to 9 to get it to fit into one line though. Maybe someone else can format it better so that the table doesn't look funny when you just change the spacing of the titles.
Success Percentages for 4Player Games
Success Percentage 84.3% 59.3% Confidence Interval 3.0% 18.5% Minimum of Confidence Interval 81.3% 40.8% Maximum of Confidence Interval 87.3% 77.8%
Success Percentages for 5Epidemic Games
Number of Games 63 549 579 643 Success Percentage 73.0% 84.5% 84.3% 92.8% Confidence Interval 11.0% 3.0% 3.0% 2.0% Minimum of Confidence Interval 62.0% 81.5% 81.3% 90.8% Maximum of Confidence Interval 84.0% 87.5% 87.3% 94.8%
Percentages of Easy Games for 5Epidemic Games
Number of Games 63 549 579 643 Success Percentage 38.1% 51.7% 37.7% 36.9% Confidence Interval 12.0% 4.2% 3.9% 3.7% Minimum of Confidence Interval 26.1% 47.5% 33.8% 33.2% Maximum of Confidence Interval 50.1% 55.9% 41.6% 40.6%
Eradications in 5Epidemic Games
Number of Games 63 549 579 643 Average Number of Eradications 0.48 0.55 0.65 1.11 Confidence Interval 0.19 0.06 0.06 0.07 Minimum of Confidence Interval 0.29 0.49 0.59 1.04 Maximum of Confidence Interval 0.67 0.61 0.71 1.18
Edit: Got the table to work/stop wrapping text around it with [ clear] command, from the same thread as linked above.
Edit7/8: Went ahead and fixed up the rest of the tables.
SBS.




I am nearly certain that your method for determining your confidence intervals is wrong. I can't figure out why, but you effectively use a 1 for a win and a 0 for a loss and calculate std deviation (not standard error as your table lists) among other things. This however suggests some very strange things. It, for instance, suggests that 1191 is more indicative of the win percentage being accurate (i.e. within 1% as calculated), where as a 6060 (i.e. 50% win rate) has an error of 8.9% inherent to it. This doesn't really intuitively make sense (at least to me). The two different win percentages should carry the same amount of error which would be solely based on the number of games played.
I thought and talked with some of my colleagues and we believe the problem lies in the standard deviation (your 0.484 number) being meaningless, as well as the arbitrary assignment of 1 or 0 to win/loss. The standard deviation represents says that ~68% of your values lie within one standard deviation in either direction. This is entirely unsurprisingly TRUE, because ~63% of your values are 1... which lie on the upper end of that std dev. and ~95% lie within 2 standard deviations in either direction. This is obviously true too as you hit the rest of your values later.
If instead I provide a 6060 win rate, we get a standard deviation of 0.5, which is again virtually meaningless. The problem with these is that the on/off nature does not lend itself towards meaning with a population standard deviation which is required for the subsequent confidence interval calculation.
This said, I'm fairly certain that the confidence intervals provided are meaningless. I'm not sure really how to fix this problem, but I'll think on it for a little longer. If anyone in the mean time can provide reasoning one way or the other for this, I'd greatly appreciate that input.
edit: A brief look for 2variable problems has revealed the ztest. The biggest problem here is going to be determining our null hypothesis. Considering that we are trying to determine if any characters are better than others, I would propose averaging all of their win % and making that your null hypothesis. i.e. if a characters win % is not far enough away from the average character's win %, then it is not statistically better or worse than the average character. I think this method has a couple flaws to it, namely our determination of the null hypothesis value, but this will significantly better represent whether or not some characters are better than others than the current confidence interval provided.
SBS.




Hi,
I've taken your data and created a short example of the ztest. I changed from my decision to use an average for the null hypothesis, and instead decided to compare directly each role, with the null hypothesis being one of the two character's win rates. i.e. in order to determine if one role clearly had an advantage over another role.
Although I think this works, I'm a little skeptical about how the number of games should be selected (which role to choose, for instance). However, it turns out that regardless of that choice, (for comparable numbers (i.e. in this case number of games) >~30, it statistically matters very little, apparently from what I have read.)
According to this ztest, according to your games, the epidemiologist and troubleshooter are weaker than many of the other roles. The generalist is weaker than some of the roles. The Quarantine Specialist is the best role (i.e. outperforms in a manner that is statistically significant according to the ztest the most other roles).
Full data and calculations can be found and reviewed here. or at this link: https://docs.google.com/spreadsheets/d/14JsVtV8susGALZedl_kZ...
These ztests were performed on the 4P5E games.
SBS.


Ralph Melton
United States Pennsylvania

Smellybluesocks wrote: Hi Ralph. First off, Welcome to BGG.
This is really awesome. Thanks for all your time and all your games which have been played. I have a few critiques that may come later, although you've critiqued yourself quite well. After just a quick read, I couldn't find anything saying whether you did or didn't play with your house rules, which I was curious about (although it seemed strongly suggested that you did not).
Thank you very much for the reformatted tables. They are indeed much easier to read. I have edited them into the original post. (I have not changed the numbers as you suggested; I'll discuss that in another post.)
I did not play with any of my house rules for these tests. The reason is that I was playing the iPad implementation, which strongly enforces the rules without any option for customization.
(One possible difference between the iPad version and the rules as written: I think that the iPad version chooses the order of players randomly, where the 2nd edition says "The players look at the City cards they have in their hand. The player with the highest City population goes first." I have assumed this difference is negligible. Since the first edition said "the person who was sick most recently goes first", I believe that the designers just intend the population check as a way to randomly choose a starting player.)


Ralph Melton
United States Pennsylvania

Smellybluesocks wrote: I am nearly certain that your method for determining your confidence intervals is wrong. I can't figure out why, but you effectively use a 1 for a win and a 0 for a loss and calculate std deviation (not standard error as your table lists) among other things. This however suggests some very strange things. It, for instance, suggests that 1191 is more indicative of the win percentage being accurate (i.e. within 1% as calculated), where as a 6060 (i.e. 50% win rate) has an error of 8.9% inherent to it. This doesn't really intuitively make sense (at least to me). The two different win percentages should carry the same amount of error which would be solely based on the number of games played.
I thought and talked with some of my colleagues and we believe the problem lies in the standard deviation (your 0.484 number) being meaningless, as well as the arbitrary assignment of 1 or 0 to win/loss. The standard deviation represents says that ~68% of your values lie within one standard deviation in either direction. This is entirely unsurprisingly TRUE, because ~63% of your values are 1... which lie on the upper end of that std dev. and ~95% lie within 2 standard deviations in either direction. This is obviously true too as you hit the rest of your values later.
If instead I provide a 6060 win rate, we get a standard deviation of 0.5, which is again virtually meaningless. The problem with these is that the on/off nature does not lend itself towards meaning with a population standard deviation which is required for the subsequent confidence interval calculation.
This said, I'm fairly certain that the confidence intervals provided are meaningless. I'm not sure really how to fix this problem, but I'll think on it for a little longer. If anyone in the mean time can provide reasoning one way or the other for this, I'd greatly appreciate that input.
edit: A brief look for 2variable problems has revealed the ztest. The biggest problem here is going to be determining our null hypothesis. Considering that we are trying to determine if any characters are better than others, I would propose averaging all of their win % and making that your null hypothesis. i.e. if a characters win % is not far enough away from the average character's win %, then it is not statistically better or worse than the average character. I think this method has a couple flaws to it, namely our determination of the null hypothesis value, but this will significantly better represent whether or not some characters are better than others than the current confidence interval provided.
Smellybluesocks wrote: Hi, I've taken your data and created a short example of the ztest. I changed from my decision to use an average for the null hypothesis, and instead decided to compare directly each role, with the null hypothesis being one of the two character's win rates. i.e. in order to determine if one role clearly had an advantage over another role. Although I think this works, I'm a little skeptical about how the number of games should be selected (which role to choose, for instance). However, it turns out that regardless of that choice, (for comparable numbers (i.e. in this case number of games) >~30, it statistically matters very little, apparently from what I have read.) According to this ztest, according to your games, the epidemiologist and troubleshooter are weaker than many of the other roles. The generalist is weaker than some of the roles. The Quarantine Specialist is the best role (i.e. outperforms in a manner that is statistically significant according to the ztest the most other roles). Full data and calculations can be found and reviewed here. or at this link: https://docs.google.com/spreadsheets/d/14JsVtV8susGALZedl_kZ...These ztests were performed on the 4P5E games.
So let's talk about the statistics.
I don't want to claim much statistical expertise. I've been operating off of memories of statistics from high school in 1987 and Wikipedia articles. (I have a friend with a PhD in statistics, but he's been too busy to offer any feedback on my work.)
I feel pretty confident about my calculation of win ratios (though I would never claim that my spreadsheet is errorfree). I believe that the unresolved questions are about how to decide which calculations of win ratios are significant.
The source I've been following for calculating confidence intervals is https://en.wikipedia.org/wiki/Binomial_proportion_confidence... (using the "Normal Approximation Interval". (That page says that other formulas give more precise remarks, but says "The central limit theorem applies poorly to this distribution with a sample size less than 30 or where the proportion is close to 0 or 1.... A frequently cited rule of thumb is that the normal approximation is a reasonable one as long as np > 5 and n(1 − p) > 5, however even this is unreliable in many cases")
I've taken a slightly different path in my formulas than that paragraph does, but they should be algebraically equivalent and I've checked that my formulas match that formula for several cases.
However, it is possible that I'm being too strict about looking for confidence intervals that don't overlap at all. https://en.wikipedia.org/wiki/Confidence_interval says "If two confidence intervals overlap, the two means still may be significantly different. Accordingly, and consistent with the MantelHaenszel Chisquared test, is a proposed fix whereby one reduces the error bounds for the two means by multiplying them by the square root of ½ (0.707107) before making the comparison." If I apply that reduction in error bounds, I get the same results of which differences are statistically significant that you do in your spreadsheet using the Ztest.
I'd appreciate seeing a discussion of the ztest that you propose using. The discussion of the Ztest in https://en.wikipedia.org/wiki/Ztest also depends on calculating the standard deviation and standard error of the sample, so if your claim is correct that the standard deviation is meaningless, that would also rule out using the Ztest.
Now, let me reply pointbypoint.
I do code win or loss as a binary 0or1 variable. I don't have a concise citation to back me up, but I think this is fairly standard practice  for example, I've seen that approach used with statistical discussions of possiblyweighted coins, with heads=1 and tails=0.
In your argument that my calculation of the standard deviation is meaningless, you say that "The standard deviation represents says that ~68% of your values lie within one standard deviation in either direction." I disagree with that.
I think that the standard deviation is welldefined for this data. The standard deviation is the square root of the average of the squared deviations from the average value. (https://en.wikipedia.org/wiki/Standard_deviation) That's welldefined for any data set. In particular, in the case of a 50% win rate that you describe, the average is 0.5 and every data point has a deviation of 0.5 or 0.5. So every squared deviation is 0.25, the average squared deviation is 0.25, and the standard deviation is 0.5. That seems perfectly reasonable to me; the standard deviation is supposed to measure how far samples are from the mean, and every sample is exactly 0.5 from the mean.
Where I think you go astray is when you say "The standard deviation represents says that ~68% of your values lie within one standard deviation in either direction." That is only true for a normal distribution, and we are in agreement that when every value is either 0 or 1, the result is not a normal distribution.
I believe that the key that makes the use of the standard deviation useful in calculating these confidence intervals is that although the results of individual games do not follow a normal distribution, the sum of lots of those games does approximate a normal distribution. But I admit that I don't have a deep understanding here; I'm just parroting https://en.wikipedia.org/wiki/Binomial_proportion_confidence....
I do want to acknowledge what you say here:
Quote: This however suggests some very strange things. It, for instance, suggests that 1191 is more indicative of the win percentage being accurate (i.e. within 1% as calculated), where as a 6060 (i.e. 50% win rate) has an error of 8.9% inherent to it. This doesn't really intuitively make sense (at least to me). The two different win percentages should carry the same amount of error which would be solely based on the number of games played.
I do not have the same intuition that the two different win percentages should carry the same amount of error  but I can see why you might. However, I don't have a deep enough statistical background to be confident either way.
I do appreciate your discussion, and I hope that my tone has been one of constructive discussion as well.


Ralph Melton
United States Pennsylvania

Smellybluesocks also raised the question of what the null hypothesis should be for these statistical questions. I've grappled with this myself, and I'm not sure what the answer should be.
For example, lets take my initial hypothesis that the Troubleshooter is kind of weak. We could phrase this in these ways: 1. The probability of a win when the Troubleshooter is playing is lower than the overall probability of a win. 2. The probability of a win when the Troubleshooter is playing is lower than the probability of a win when the Troubleshooter is not playing.
Algebraically, those two statements are equivalent; whatever the actual probabilities are, those statements are both true or both false. But with my data (assuming that my calculations are correct), #2 is statistically significant at p < .05, but #1 is not. So the phrasing matters, and I don't know what it ought to be.
There's another question of whether 95% confidence is really strong enough. With 13 different Roles I'm evaluating in multiple different configurations, it's fairly likely that there would be some false positives  Roles that look better or worse at a 95% confidence level even though they aren't really so different.
For example, according to SBS's calculations, Epidemiologist performed worse than average to a statistically significant degree in 4P5E. But in 3P5E, Epidemiologist performed better than average (87.9% vs. 84.5%  not statistically significant). And in 5P5E, Epidemiologist performed better than average (94.3% vs. 92.8%  not statistically significant).
I have no explanation of why 4P5E would be particularly worse for Epidemiologist than 3P or 5P. So I'm inclined to consider that a fluke instead of a real result  but I wish I had a more rigorous notion of how to weed out those flukes.
(One possibility for the Epidemiologist specifically: since I'd used house rules for the Epidemiologist (and scorned the Epidemiologist since I thought it was weak), I didn't have as extensive experience with the Epidemiologist before I started these experiments as with other Roles. And I played most of the 4P5E games before playing the 3P5E and 5P5E games. So perhaps I was playing Epidemiologist badly in my early games.)


Ralph Melton
United States Pennsylvania

I have some interesting results to add to this conversation about the Virulent Strain challenge:
I've now played 500 games of 4P5EV. My win rate for 500 games has been 84.0% +/ 3.2%; my win rate for 586 games of 4P5E without Virulent Strain was 84.1% +/ 3.0%. (According to the same calculation by which I came up with the previous confidence intervals.) Some times I look at stats and think that with some more games to narrow the confidence intervals, it might develop into a statistically significant difference. This is not one of those times. To my eye, these stats give no basis for any belief that I win less often with Virulent Strain than without.
But it's obvious that Virulent Strain is harder. (Obvious does not necessarily mean true.) The Virulent Epidemics add ways to fail and don't add ways to succeed.
So maybe Virulent Strain turns easy games into hard games? Not really. Percentage of easy games: 37.5% +/ 3.9% without Virulent Strain vs. 39.2% +/ 4.3% with Virulent Strain.
Maybe Virulent Strain makes it harder to get Eradications? For Eradications per successful game: 0.52 +/ 0.06 for 4P5E vs. 0.83 +/ 0.08 for 4P5EV. That's a statistically significant difference of more Eradications with Virulent Strain.
Down at the nighanecdotal end of the scale, I got 4 fourEradication games in 500 games of 4P5EV, compared to none in 586 games of 4P5E.
When I saw this trend developing, I identified three possible explanations: 1. I've been having a lucky streak, and am likely to get slapped down soon. This seemed the most likely possibility. 2. Over the thousandplus games I've played since my 4P5E games, I've honed my Pandemic skills a bit. 3. The actual increase in difficulty from Virulent Strain is much smaller than it seems.
The 'lucky streak' hypothesis is less plausible now after 500 games.
The 'I've improved my skills' hypothesis has some appeal. I'd like to think that I could improve my skills incrementally over a thousand games, and I can think of ways that I've improved my skills.
And there's a point in favor that Virulent Strain does cause me to lose some games, so I must be making it up somehow. Here's some additional data about how much difficulty Virulent Strain adds: After the first 100 games, I started recording games in which I felt afterward that I had lost because of Virulent Strain. This is necessarily subjective, and some of the games that I lost because of Virulent Strain would have turned out to be losses even if I had not been playing with Virulent Strain. But with those qualifications, I recorded 21 games out of 400 lost due to Virulent Strain, or about 5.25% of those games. In 3 of those games, I recorded that I thought I would have lost anyway. But still, that suggests that Virulent Strain was the difference between failure and victory in at least 2 or 3% of my games. But the final victory percentage was almost identical, so I must be making up those games  which seems to suggest an improved skill.
But here's the counterargument: if we hypothesize that I improved my Pandemic skills while playing 500 games of 3P5E and 700 games of 5P5E. It seems obvious (again, not the same as true) that if I improved while playing 5P5E, I would particularly improve at 5P5E. Although I have not done a rigorous analysis, I did not notice any improvement at my win rate with 5P5E over the course of those games  my overall win rate drifted slightly down as I played.
Conclusion: The data shows pretty clearly that with four players and five Epidemics, I win as often with Virulent Strain as without. It's very perplexing, and I don't have a good explanation.




Ralph Melton wrote:
So let's talk about the statistics.
I don't want to claim much statistical expertise. I've been operating off of memories of statistics from high school in 1987 and Wikipedia articles. (I have a friend with a PhD in statistics, but he's been too busy to offer any feedback on my work.)
I too, have only wikipedia to really guide me. I spent much of last Friday reading wiki to try to figure out if I could find anything specific.
Ralph Melton wrote: I feel pretty confident about my calculation of win ratios (though I would never claim that my spreadsheet is errorfree). I believe that the unresolved questions are about how to decide which calculations of win ratios are significant. The source I've been following for calculating confidence intervals is https://en.wikipedia.org/wiki/Binomial_proportion_confidence... (using the "Normal Approximation Interval". (That page says that other formulas give more precise remarks, but says "The central limit theorem applies poorly to this distribution with a sample size less than 30 or where the proportion is close to 0 or 1.... A frequently cited rule of thumb is that the normal approximation is a reasonable one as long as np > 5 and n(1 − p) > 5, however even this is unreliable in many cases") I've taken a slightly different path in my formulas than that paragraph does, but they should be algebraically equivalent and I've checked that my formulas match that formula for several cases. However, it is possible that I'm being too strict about looking for confidence intervals that don't overlap at all. https://en.wikipedia.org/wiki/Confidence_interval says "If two confidence intervals overlap, the two means still may be significantly different. Accordingly, and consistent with the MantelHaenszel Chisquared test, is a proposed fix whereby one reduces the error bounds for the two means by multiplying them by the square root of ½ (0.707107) before making the comparison." If I apply that reduction in error bounds, I get the same results of which differences are statistically significant that you do in your spreadsheet using the Ztest. I'd appreciate seeing a discussion of the ztest that you propose using. The discussion of the Ztest in https://en.wikipedia.org/wiki/Ztest also depends on calculating the standard deviation and standard error of the sample, so if your claim is correct that the standard deviation is meaningless, that would also rule out using the Ztest.
I'd like to say that I have more to offer on this. I don't at the moment have much to say though. I've read a bit and honestly there's too much to learn. Although I grasp that the ztest is useful for testing two populations of samples against one another, I'm not fully certain of its applicability here. It seems like maybe and if so I think that it would be as I applied it, where each population (role) is tested against each other role to determine if there's any significant difference (rather than against the mean of the populations as I initially suggested).
Ralph Melton wrote: Now, let me reply pointbypoint.
I do code win or loss as a binary 0or1 variable. I don't have a concise citation to back me up, but I think this is fairly standard practice  for example, I've seen that approach used with statistical discussions of possiblyweighted coins, with heads=1 and tails=0.
Yeah. This is really normal and the first thing I did. You're right as you describe later that this is really not the problem.
Ralph Melton wrote: In your argument that my calculation of the standard deviation is meaningless, you say that "The standard deviation represents says that ~68% of your values lie within one standard deviation in either direction." I disagree with that. I think that the standard deviation is welldefined for this data. The standard deviation is the square root of the average of the squared deviations from the average value. ( https://en.wikipedia.org/wiki/Standard_deviation) That's welldefined for any data set. In particular, in the case of a 50% win rate that you describe, the average is 0.5 and every data point has a deviation of 0.5 or 0.5. So every squared deviation is 0.25, the average squared deviation is 0.25, and the standard deviation is 0.5. That seems perfectly reasonable to me; the standard deviation is supposed to measure how far samples are from the mean, and every sample is exactly 0.5 from the mean.
Although I'm not entirely convinced from how you describe it here, I am convinced by the next portion.
Ralph Melton wrote: Where I think you go astray is when you say "The standard deviation represents says that ~68% of your values lie within one standard deviation in either direction." That is only true for a normal distribution, and we are in agreement that when every value is either 0 or 1, the result is not a normal distribution.
This is true but more importantly is the next portion that you describe, which says that despite the nonnormal distribution, we achieve a normal distribution of probabilities of win rate (not a normal distribution of wins/losses, per se.)
Ralph Melton wrote: I believe that the key that makes the use of the standard deviation useful in calculating these confidence intervals is that although the results of individual games do not follow a normal distribution, the sum of lots of those games does approximate a normal distribution. But I admit that I don't have a deep understanding here; I'm just parroting https://en.wikipedia.org/wiki/Binomial_proportion_confidence....
Yup. Again, worth noting, it's not the sum of lots of games, but the ratio achieved by lots of games. x1 + xn /n assuming there is some true probability will approach that probability given infinite trials. We assume that's true with pandemic and a single person who does not get better or worse is playing those games. A bit of a stretch, but probably relatively trivial.
Ralph Melton wrote: I do want to acknowledge what you say here: Smellybluesocks wrote: This however suggests some very strange things. It, for instance, suggests that 1191 is more indicative of the win percentage being accurate (i.e. within 1% as calculated), where as a 6060 (i.e. 50% win rate) has an error of 8.9% inherent to it. This doesn't really intuitively make sense (at least to me). The two different win percentages should carry the same amount of error which would be solely based on the number of games played. I do not have the same intuition that the two different win percentages should carry the same amount of error  but I can see why you might. However, I don't have a deep enough statistical background to be confident either way.
I actually would like to expand briefly on why this seemed to be to me. If I flip a coin 100 times and receive 90/10, why is it that the statistics of these coin flips would have a smaller confidence interval (i.e. less error achieved for the 95% CI) than if I received a 50/50 split. How is it that my confidence on the ratio of the coins ability to flip heads or tails is affected by anything other than the number of times it is flipped and the null hypothesis?
The easy answer to this might be standard deviation but it would have to be the standard deviation of the ratios. Clearly a smaller average difference from the average would explain a smaller CI and although I have calculated these standard deviations and seen your calculations, which are in agreement, I am having trouble accepting that they are meaningfully the same as the std deviation of the ratios, which I believe they are. Nonetheless, I can't quite wrap my head around it and feel conflicted here.
Ralph Melton wrote: I do appreciate your discussion, and I hope that my tone has been one of constructive discussion as well.
Of course. I like pandemic a lot and am happy when such a good piece is contributed to the boards (even if I strongly disagree with its conclusions ;P)




Ralph Melton wrote: Smellybluesocks also raised the question of what the null hypothesis should be for these statistical questions. I've grappled with this myself, and I'm not sure what the answer should be.
For example, lets take my initial hypothesis that the Troubleshooter is kind of weak. We could phrase this in these ways: 1. The probability of a win when the Troubleshooter is playing is lower than the overall probability of a win. 2. The probability of a win when the Troubleshooter is playing is lower than the probability of a win when the Troubleshooter is not playing.
Algebraically, those two statements are equivalent; whatever the actual probabilities are, those statements are both true or both false. But with my data (assuming that my calculations are correct), #2 is statistically significant at p < .05, but #1 is not. So the phrasing matters, and I don't know what it ought to be.
I actually disagree that those are algebraically the same. A short proof:
With x as the win rate: 1 has two comparisons. The first is the same as 2. It is the win rate of the troubleshooter(TS).
TS win rate = x_TS = Wins_TS / GamesPlayed_TS
The second, which it is compared to, is not the same equation as in 2.
Overall win rate = x_Overall = ( Wins_TS + Wins_role2 + Wins_role3 Wins_roleN ) / Games_total
2 has the same x_TS, so we'll skip that.
The second equation is different as it does not any longer include the TS in the overall win rate.
Win Rate With no TS = x_noTS = (Wins_role1 + wins_role2 + Wins+roleN ) / (Games_total  Games_TS)
Otherwise you are correct. If one of those statements is true, the other must also be true, and if one statement is false, the other must also be false.
Ralph Melton wrote: There's another question of whether 95% confidence is really strong enough. With 13 different Roles I'm evaluating in multiple different configurations, it's fairly likely that there would be some false positives  Roles that look better or worse at a 95% confidence level even though they aren't really so different.
For example, according to SBS's calculations, Epidemiologist performed worse than average to a statistically significant degree in 4P5E. But in 3P5E, Epidemiologist performed better than average (87.9% vs. 84.5%  not statistically significant). And in 5P5E, Epidemiologist performed better than average (94.3% vs. 92.8%  not statistically significant).
I have no explanation of why 4P5E would be particularly worse for Epidemiologist than 3P or 5P. So I'm inclined to consider that a fluke instead of a real result  but I wish I had a more rigorous notion of how to weed out those flukes.
(One possibility for the Epidemiologist specifically: since I'd used house rules for the Epidemiologist (and scorned the Epidemiologist since I thought it was weak), I didn't have as extensive experience with the Epidemiologist before I started these experiments as with other Roles. And I played most of the 4P5E games before playing the 3P5E and 5P5E games. So perhaps I was playing Epidemiologist badly in my early games.)
I won't lament too much on the reasoning other than say that it's important to think of why the data may not be accurate, but to avoid too much assumption that it is if there isn't a good reason to. The epidemiologist gets more uses of her once per turn action in a 3p game, for instance, which might make it more useful, while in a 5p game she may have more people to easily trade with than a 4p game and doesn't have a significantly decreased number of trades going from a 4p to 5p game. (I don't know that these ARE true, but these certainly seem like they could be plausible explanations, as many roles do get significantly better/worse for varying player games).
That aside, I have wondered about the CI purely from a numbers perspective. For instance, is 140 games even enough to be statistically significant in a 4P game? With 13 total roles, there are 12 other possible roles to fill the second slot, 11 the third, and 10 the fourth.
Following C(n,r) = n! / ( r! (n  r)! )
where n is the total possible in the group (13) and r is the subgroup size (4):
There are 715 unique subgroups of roles. With only 579 games played, it is impossible that you have played every single combination. It certainly begs the question of randomness of the roles and makes it significantly harder to suggest more precisely why a role is performing better or worse than the others.
A more systematic approach may be necessary to learn more about the statistics of the games. There are a pair of excellent threads by user swatso, here, as well as here at this link. over in the strategy section of these forums.
SBS.


Bart Rachemoss
United States Silver City New Mexico
The spirit of BGG as expressed by Esgaldil: "We lose nothing by trying to be helpful"

Overall this was an fantastic post. Thank you.
Ralph Melton wrote: The Scientist can discover a cure with four city cards of the same color, instead of the usual five. The Epidemiologist can take a card from another player in the same city once per turn as a free action. I thought that the limit on how often the Epidemiologist could take cards and the fact that the Epidemiologist still had the limit of seven cards in hand made the Epidemiologist clearly inferior to the Scientist. I agree the Epidemiologist is weaker than the Scientist (I've used New Assignment to get the Scientist role a number of times but never to get the Epidemiologist, OTOH maybe this is suboptimal play on my part) mostly due to the hand size limit but I think it is closer than what you say might suggest because cards taken by the Epidemiologist can be used for things other than making cures while the Scientist's ability only works for cures.
In general (but not always) the power of a role depends on how many more choices it opens up. This is one of the reasons why the Dispatcher is so powerful. It is easy to overlook some of the less obvious choices a role creates and thus underestimate the power of some of the roles.


Snooze Fest
United States Hillsborough North Carolina
We love our pups!! Misu, RIP 28 Nov 2010. Tikka, RIP 11 Aug 2011.

OK, but what are your results with 2P6E?



