Recommend
90 
 Thumb up
 Hide
97 Posts
1 , 2 , 3 , 4  Next »   | 

Tapestry» Forums » General

Subject: Let’s Talk Tapestry Data (after >1300 plays) rss

Your Tags: Add tags
Popular Tags: [View All]
J Kaemmer
United States
Iowa City
Iowa
flag msg tools
badge
Avatar
mbmbmbmbmb
Let’s Talk Tapestry Data (after >1300 plays)

Some of you may recall that in early September, around the time the first copies were being delivered in North America, I posted a request to the community to help me collect data on the new Stonemaier boardgame: Tapestry.

The link to the Google Form used to record the info: https://bit.ly/TapestryStats
The link to the Google Sheet containing the results: https://drive.google.com/open?id=11OksMsWiV_g7NApFyaknSkg83A...

This effort was actually inspired by Stonemaier first suggesting that players should log their data to the official website and the web-page was linked directly in the rulebook. I applaud the desire to provide long-term support for a game including data-based balance fixes… I decided to perform an independent review because I simply LOVE sifting through big piles of data and thought there were some ways it could be improved slightly through a marginally more thorough survey.

To this effect I shared the initial form, before I even received my own copy, and over the past 50+ days you all have helped me log 1351 separate games, as of me starting to write this (more games have likely been added since), and I wanted to share the results with you!

Firstly, I want to point out that this is a LOT of information. This is not just 1351 games (trials/samples), it’s 3128 civilization plays (data points). That kind of data volume makes it so that statistical analysis is more than possible, even when drilling down to individual player counts. So please, let’s get ahead of any criticisms regarding sample sizes. These are hard numbers, not a psych study, not even a bio one with dozens of confounding variables. For the record, statistical significance becomes reliable when we have at least 30 data points and once we get over 100, we can start making assumptions based on the sample population being substantially large. This is GREAT data and we can start pulling real conclusions from it, so thank you to everyone who contributed.

(WITHOUT FURTHER ADO, I APOLOGIZE FOR THIS ENORMOUS WRITE-UP, BUT IT'S WORTH THE READ)

--------------------------------------

Player Counts and Civ Choice

First up let’s talk about Player Counts and our Civilization distribution:



Of the plays recorded for my study, the majority were 2-player. I am not surprised that player count is inversely proportional to the number of games recorded. It’s easier to get 2 people together than 5, and when you play with 5 it can be difficult to ask everyone to pause the evening of fun so you can punch in the numbers. I AM surprised by the number of solo plays, but I suppose it still follows the same line of logic. Interestingly, because we get more data with higher player counts, we still have MORE data from multiplayer games rather than solo, with 4-player making up the second largest proportion of data (2p=940, 4p=796, 3p=653, 5p=384, Solo=355)

In regards to civilizations we have pretty even sample sizes for each of the civs. There are some noteworthy deviations, however. We would expect, in a perfect world, to see each civ 1/16th of the time or 6.25%. Instead of that, though, there are some that have recorded more plays relatively than expected, namely Merrymakes, Leaders, Craftsmen, Militants, and Entertainers.

I would posit this is because they are very straightforward civs that give obvious and automatic benefits (also a few reviewers did mention a few of these directly as good civilizations for new players to learn the game with).

On the low end, for selection rate, are the Chosen, Mystics, and Traders. I would again posit this is a direct result of these having abilities that require a greater base-level understanding of the game to understand how to use them effectively. Each of these 3 civs interacts with and depends on how other players behave in the game. As such, they expect a certain level of planning ahead to get the most out of them, so I would call these Civilizations “advanced” and I am not surprised we received less data from them (also there is the fact early forums discussed the mechanical matchup problems of Chosen v Futurists ad nauseam).

I will confess that Mystics and Chosen land a bit fewer plays than I would like, with some other civs recording over twice as many games, but they still more than 100 each. This is still plenty to work with, just not as robust across the board. We DO have every pairwise combination of civs recorded, on top of that, but we don’t have them at every player count yet or in sufficient quantity to drill too deeply into every matchup (so get working, friends!)

---------------------------------

Distributions

Next up I want to talk about point distributions, I know you readers probably want to jump straight to tier-lists but bear with me. The statistical tests of significance that I plan on using further into this discussion depend on an assumption that score distributions are “Normal” so I wanted to nail that assumption down concretely. Normality refers to distributions that look like the bell-curve you’ve probably seen in your grade/primary-school days. Big surprise incoming… they ARE normal… or at least close enough for government work.



The most effective way to determine normality is to mostly eyeball it (I’m not really kidding) but when looking at the results we do see some long tails and slight lop-sidedness so I made sure to also quantify the “skewness.” Using both Pearson’s and Galton’s equations for skew we can pretty safely say that the majority of these histograms are symmetrical enough to call normal (we want skew to equal 0.0, in the range of 0.5-1.0 is considered moderate, and values>1.0 are definitely skewed).



Note I said a majority of the civ results, however. There is one specific Civilization that exhibits moderate skew to the right, and I’m sure you’re shocked to learn it is the Futurists! The numbers (gsk=.26, psk=.51)still are within reasonable parameters, so we don’t need to throw the normality assumption out, but this is probably our first indication that there’s something weird going on with them. Nomads (gsk=.16, psk=.39) and Architects (gsk=.09, psk=.39) also exhibit some skew tendencies, but not enough to even be called moderate… maybe “slight” would be appropriate.

Other skew calculation methods might indicate more of the civs have moderate skew, but that has a lot to do with the presence of positive outliers. See the Entertainers who have 14 outliers, at this time, whose Fisher-Pearson skew is 1.2, which should be questioned due to it also having an extreme kurtosis (measure of tail length, compared to stddev) of 2.51. To eliminate the effects of these long tails we could simply trim off the outliers, since high Kurtosis values do not necessarily mean that something is not normally distributed, just that it is more disperse. We will not be doing analysis on filtered data, however, for this breakdown, but I just wanted to mention that at least one method of analyzing skew does indicate at least 2/3rds of the Civs have moderate skew. I am not concerned, though, because dispersion and long tails are fairly consistent across the civs and work with our assumptions..

On this subject, I want to comment further on the shape of the data distributions. You see, the Tapestry score data exhibits very tight spreads (stddev= 50 to 70) as well as a consistent skew to the right, ever so slightly, and this is because we have outliers in the form of exceptionally high scores that drag the data out to the right. I posit this is a result of particularly good synergies between Tapestry Cards, Tiles, bonus civs...etc. that occur only infrequently (e.g. someone got lucky). The one Civ that bucks the trend of right skewed distributions that are very normal in character, at least in a noticeable way, is the Chosen. Their distribution is mostly symmetrical but it isn’t quite the right shape (negative skew and highly negative kurtosis). In fact, it’s almost bi-modal (i.e. two peaks) and the lower peak is ‘heavier’ than the higher one so it curiously has shifted the mean suspiciously below the median. A deeper investigation into the Chosen may yield interesting results but is outside the scope of this discussion.

I’d also like to share these cool box plots that identify the outliers and compare relative distribution shapes.



Note outliers were identified using 1.5x the interquartile range (difference between the 25th percentile and the 75th percentile) added/subtracted to upper and lower quartiles, which is fairly common practice. This means that a score of 400 for the Chosen would be considered an outlier because it is WAY more than if you add 1.5xIQR to the 75th percentile -> 1.5*(209-127)+209=332 which is less than 400. I do use a filtered mean (without outliers) for comparing civs for most of this discussion, but I do not use filtered data sets for the statistical tests.

-------------------------------------------

Final Scores

Now let’s get into the part everybody cares about and stop going over nittgy-gritty stats validation. Here are the results for scores, with outliers excluded.



For reference, global median and mean are in the 190’s, so we’ve got a big chunk of the civs pretty well balanced, 6 of the16 total fall very nearly to those values.

BUT we do have some more exciting results. 4 Civ’s are much higher than the mean, and 4 much lower. When I say much lower, I’m not talking coincidentally either. I’m talking about statistical significance.

At p<.001 (or with >99.9% confidence) the Futurists (230), Craftsmen (214), Heralds (208), and Militants (204) have distributions that are higher than the survey population. So if we compare the results of a single Civ (let’s say Futurists) to every single other data point not from that civ (e.g. the 15 non-Futurist civs) we would reasonably expect the scores in question to be a representative sub-set of the population. In the Futurists case that’s 170 score samples vs 3128 other non-Futurist score samples, and those 170 data points were determined to be so far off from that assumption that we can say, with over 99.9% certainty, that it falls much too high to possibly be representative of the whole. The aforementioned civs all returned results so far outside of expectations that we can say that are definitely not representative of, and are significantly greater than, the average score distribution found in Tapestry.

Similarly, we found that Traders (161), Alchemists (166), Chosen (167), and Merrymakers (179) fell below the expected distribution (also with p<.001, or 99.9% confidence).

Hovering around significance, but not quite to the same degree as the already discussed civs, are the Architects (181) and Entertainers (180). Their scores were lower, on average, than most other civs, but not so far as to raise it to a significant degree (p<.05, or 95% confidence).

I want to editorialize a bit further here. These results do not necessarily mean that Tapestry is an unbalanced game. It not my goal, nor within the scope of this data, to make that claim. It DOES mean, however, that certain factions perform better or worse than average, with quite significant certainty. In fact, the Futurists in particular have risen to qualify as statistically significant to such a degree that they are not only scoring higher than average just at an overall level, but even when controlling for player count! In the same test space we can see Traders also score lower on average when controlling for player count. Other civs show significance at specific player counts, just not completely across the board like the previous 2. Alchemists, Craftsmen, and Chosen all share the distinction of results that are significant differences at several player counts. More data would help expand this portion of the investigation.



To really ram the point home that score results depend heavily on Civilization choice I dumped out the data for an ANOVA test. Performing the ANOVA test determined that the p-value for intergroup score variation is 3.4E^-43 which is a shockingly strong determination! We could have been comfortable with p<.001, but instead we received results consistent with p<.0000000000000000000000000000000000000000001. This means that with extreme certainty we can say that the scoring potential between Civilizations are NOT equal.

-------------------------------------------

Trends

On the subject of scores there’s two other trends I haven’t gone into much detailed inquiry about, but I thought might be worth mentioning.

Firstly, is that as time went on from the initial release of the game, player scores steadily increased, on average. This might be indicative that Tapestry is a game that rewards skilled play to some extent; it must factor somewhat into performance and allow players to get better at the game. Also the general trends have pretty consistently placed the civs in the same positions relatively.



Also player count is negatively correlated with average score. Across each civ, as player count goes up, average score goes down. This is also to be expected given that as player count goes up, so does competition for landmarks. With fewer landmarks per player, there is reduced scoring potential from the capital mat and there are fewer chances for free resources gained by completing districts.



Regression analysis determined that player count does certainly impact scoring potentials overall (p=8.68E-13) but it impacts some civs more than others. Each Civ had a negative correlation (meaning that as player count went up, scores went down on average) but it was highly variable, with only half of the Civs reporting statistically significant correlations. Some Civs were majorly affected, like the Chosen (p=.00013) who on average lost almost 15 vp per player added to the game! On the other hand there are 8 civs who did not report significant results. This may in part be limited by sample sizes, but I believe there is a secondary reason and that is most of these Civs have reasons to benefit from increased player counts. Inventors, Traders, Heralds, and even Militants/Nomads stand to benefit from increased player counts, at least marginally.

Perhaps surprisingly, Futurists are one of the big losers when more opponents get involved, likely because they no longer have guaranteed access to 2/3rds of the landmarks!



-------------------------------------------

Win Rates

Looking at win rates also results in some surprising conclusions. Firstly, that there are some civs with obviously higher or lower win rates than expected. When controlling for number of games played at various player counts, some exceed expectations by quite a bit. Unsurprisingly, the Futurists clean up relative to expectations as well as other civs. They win far more often than would be expected in a perfectly balanced scenario. To elaborate: we would expect the Futurists to win a little over one third of the games, based on the player counts for the games they participated in. Instead, they win an incredible 58.5%!!!

It’s pretty easy to see that 58.5% is much greater than 34.7% (a difference of over 23%), but this is not just a spurious result. This is also statistically significant at p<.001, or 99.9% confidence. Other significant results for win rates include the Militants (44.7%) and Craftsmen (41.8%) on the high end, and Alchemists (22.5%), Merrymakers (20.9%), and Traders (12.8%) on the low end. Traders in particular are very much below the expected results, winning only 12.8% of the games they participate in as opposed to the expected 33.4%!!!



When controlling for player count we still see some significance that again emphasize the unusual performance of our usual suspects. The Futurists have significantly positive results in 3 out of 4 player counts, the Merrymakers and Traders on the other hand have significantly low results in all 4 multi-player modes.



Perhaps more surprising is that average score does not fully correlate to win rate. It might end up trending in that direction over time, but the results so far are very much dependent on which civs are in the games (matchups). This mixed with the uneven play counts for the civs means that while we see some obvious winners and losers, there’s still some pretty messy results in the middle.

I also ran ANOVA on placement (1st place, 2nd place, 3rd place...etc) and found that the intergroup variation returned a p-value of 3.46E-17. Which seems pitiful compared to the previously discussed score results, but it is still a very strong conclusion that the competitive placement potential of Civs is not equal.

-------------------------------------------

Matchups (Pairwise Comparisons)

Regarding those matchups, I performed some pair-wise comparisons based on winners as well as score differentials. I’ll let the charts speak for themselves mostly. We don’t have as many matches between certain civ pairings, as we need to reach statistical significance much of the time but there were still some very strong conclusions to be found for certain civs.

Matchup Counts


Matchup Score Comparisons


Matchup Win Rate Comparisons


The big takeaway here is that Traders do NOT fair well directly against any other civ. They have losing records across the board (statistically significant for all 15 civs) and statistically significant scoring disparities for 7 out of 15 civs. Alchemists (11 for win rates, 6 for scores) and Merrymakers also have quite a few significant losing matchups (8 win, 8 scores). Conversely Futurists have a fairly significant account of winning matchups (11 win, 9 scores) as do Craftsmen (7 win, 10 score).

-------------------------------------------

Justifying Purpose

In regards to the original purpose of this investigation, to provide high quality data analysis of Tapestry plays based on Stonmaier’s idea published in the rulebook, I want to justify the need for this second study. The difference between the two methods is that this study collects scores and player civs for each participant as opposed to just the winners.

Here are the limited results. If you only focus on the winner results it results in very different answers, limited severely by numerous confounding variables such as player count and opposing civs.



Reported wins for my data set show very few Traders (1.5%), but oddly Futurists (7.4%) poll lower than several other civs, even though we know they have massive performance disparities in the full data set. This is mostly because Futurists are played less frequently than other high-performing civs like the Craftsmen (9.3%) or Militants (9.6%) that also have strong results. While they win a high proportion of their games, they are played less often, and as such the results are artificially hidden. On the other hand, it magnifies the problem with the Traders who are picked seldomly AND have notable performance issues.

Winning scores DO follow pretty closely to the trend of average scores (increased by about 30ish points, average winning score of 232) but there are a few civs that got jumbled up in the rankings. This is likely because sometimes when certain civs hit a good stride they perform quite well and the final score is greatly inflated over their average scores. Somehow, Traders (208), the civ with the lowest win rate, lowest average score, and has consistently bad matchups against every other civ… is not the bottom of the score rankings when you look at only data from the winners. This is EXACTLY the blind spot we are trying to avoid.

In the same way, Inventors (241) and Heralds (254) perform exceptionally well when they win, beating out several other high powered civs like Militants (237). I posit these three civs (Traders, Inventors, and Heralds) benefit greatly from synergies established by random elements. Namely the Explore Tiles, Technology offer, and Tapestry cards, respectively. If the right stuff comes out early, they can snowball into a much greater final score than would normally be expected.

-------------------------------------------

Final Remarks

To close, I suppose I need to address a few potential sources of error or confounding variables for this to be a relatively scholarly write-up. I fully understand this data is not completely without fault. Sources of error are introduced by me leaving out several potentially impactful variables. Player skill, potential rules mistakes, what tapestry cards are played, secondary civilizations, what tracks were pursued, and turn order all might have not-insignificant effects on the outcome of these games. I cannot say that the entirety of the variability seen in these scores stems from the choice of starting civ...

I CAN say, however, that the choice of DOES still contribute to the relative success rate and scoring potential, with extreme statistical certainty.

A more robust study could definitely tease out the influence of each of these variables and seek to create a unifying theory to balance them all… but that is not my goal. All I wanted to do was to compare the 16 civilizations in this game and to determine if there was any amount of disparity stemming from their innate abilities…

-------------------------------------------

Theory-crafting Nonsense

This leads me to my final section, one entirely based on theory but structured from these results and my experiences. This is my attempt at explaining WHY these civilizations exhibit the differences they do.

I theorize that the core differences between each civ, while most operate very uniquely, can be distilled to a singular concept I will call “resource advantage.”

I posit that every mechanic and benefit present in Tapestry has a quantifiable resource equivalent and can therefore be directly compared. Each civilization therefore brings to the table a different amount of total potential and expected resource equivalents, resulting in some that have a “resource advantage”

Here are the mechanics/benefits present in Tapestry and their resource equivalents.

-1 Resource=1 Resource (duh, easy)
-5vp=1 Resource (Look at the exchange rate on the Guilds tapestry card, merrymakers civ and the explore track’s exchange rates for bonuses)
-Tapestry Card=1 Resource (second spot on 3 tracks, see also Tapestry for VP exchange rates)
-Tech Card= 1 Resource (see the tech track, first space, third space, and bonuses)
-Upgrade= 1-3 resources depending on what the returns are and the timing
-Income Buildings= 3 resources (assuming they are pulled in the first era that’s what they pay out, gaining them in future eras reduces their advantage)
-Free advance/Science Die with benefits (3 resources, assuming it is used to advance on a track at that level for max benefits)
-Advancing without benefits/science die without benefits = 1 resource (you might not give this a value but the success of the Futurists should tell you that is not true, so I value it just like the Science track does)
-2 Explore Tiles=1 Resource (see explore track and Tapestry Cards)
-Explore Action=2 resources (since most give you at LEAST 1 resource worth of stuff, though some give more, and you get vp it rounds up to -2 resources of value on average)
-Conquer action= 1 resource (while area control CAN get you more vp and certain tiles with the right dice roll can get you more, most of the time this just returns a single resource or equivalent VP)

Using these exchange rates, plus some interpretation of the circumstances regarding some of the abilities, I was able to rate each civ based on their potential resource advantage, minimum resource gain, and what their expected resource advantage in an average game would be.



After synthesizing these results something very compelling jumped out at me: I was able to predict the emergent tier list from the actual play results with incredible accuracy! I truly think that the balance in Tapestry can really be distilled solely down to the fact that some civs consistently gain an amount of resource advantage from the very start. The Futurists are so dominant because they receive an exorbitant 20 resource equivalents automatically at the beginning! Nobody comes close to that, and the Futurists will never receive less than the full benefit. On the other hand, the Traders receive almost nothing for sure, and cap out at a miserable 11 resource equivalents assuming perfect situations arise.




This means that looking at the score data from earlier in this study, matchup results, and the current resource advantage of each civ likely provides our way forward towards a more “balanced” outcome. If we can increase or mitigate the relative resource advantages of various civilizations we could potentially eliminate the perceptions of inherent imbalance to the game. (other chance elements like tiles, cards, and dice rolls will remain, but no one will have an inherent advantage or disadvantage from the start).

Personally, I would prefer that any future balance changes stay within the existing mechanics of a civ or bonuses occur early on in the game for the buffs/nerfs rather than end of game vp handicaps. This is because resource advantage is not likely to scale linearly but also depends on the time frame the resource equivalents are gained.

Here are some crazy balancing ideas to kick off further discussion (though I make no promises of balance nor do I want you enter play statistics of variant games):

Alchemists: Needs a buff of about 4 resources, I think. How about the first time they bust they get to keep the results, they can keep rolling but a second bust loses everything. Or maybe they get a resource of their choice regardless of busting?

Architects: already have a very strong top potential but maybe need a little boost of 2 resources or so. So what if they also gain 3 vp each time they complete a perfect district?

Chosen: probably need about 2-4 resources in advantage. How about they draw +2 Tapestry Cards to start and are guaranteed to go first?

Craftsmen: need a nerf of 4-6 resources, so what if they cannot claim landmarks or score their capital mats?

Entertainers: need about 2 more resources, what if they get to take their first move forward in the 1st era instead of the 2nd era?

Futurists: They need to be nerfed 10-ish resources, so how about they take a 10vp penalty each time they are not the first civ to enter an era, plus they should only start with 2 bonus resources of their choice.

Heralds: These guys might need a nerf of 3-5 resources, what if when they copy another civ’s Tapestry card, they need to give that civ a resource?

Historians: No change needed

Isolationists: No change needed

Inventors: No change needed

Leaders: No change needed

Merrymakers: Need a buff of about 2 resources, so similar to Entertainers maybe they should get to move up their track starting in the first era?

Militants: could probably be nerfed 2-4 resources, so what if they just didn’t get the VP from their track?

Mystics: No change needed

Nomads: No change needed

Traders: need a buff of 4-6 resources so maybe double the benefits from placing the cubes?
173 
 Thumb up
312
 tip
 Hide
  • [+] Dice rolls
Would it be possible to use the play-database with identified players to perform an Elo-analysis and figure out how much skill matters?

How much of the variance in score is due to skill-disparity?

EDIT: Saw that there was no player-identification so that this can't be done. Oh well. It would be interesting to figure out how much of the variance is just randomness and how much is due to depth.
4 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Zachary Homrighaus
United States
Clarendon Hills
Illinois
flag msg tools
mbmbmbmbmb
Elusive2 wrote:
Would it be possible to use the play-database with identified players to perform an Elo-analysis and figure out how much skill matters?

How much of the variance in score is due to skill-disparity?

EDIT: Saw that there was no player-identification so that this can't be done. Oh well. It would be interesting to figure out how much of the variance is just randomness and how much is due to depth.

Agree,

I applaud OP and think this is really solid work... especially like the theory-crafting nonsense at the end.

BUT, I am still convinced that a meaningful portion of the data collected is garbage (or at a minimum not meaningful) because Tapestry has a distinct learning curve. If we ignore solo play, we are dealing with people here. One of those people owns the game and almost certainly has more experience with it than those they are playing against... except for the first play in which no one at the table knows what they are doing. This can apply to both strategy and the basic rules of the game. Without some way to indicate player experience, it's hard to say X or Y data point is meaningful unless the goal is to support some theory about which Civs are best for new players. Without doubt, Futurists is a strong Civ and probably is not in balance with the others. I suggest that part of the reason we see this in the data is for exactly the reason you stated at the end of your post. Futurists get their 20 resources advantage before the game begins and zero skill is required from that point forward. Compare that to Architects or Mystics where you gain nothing to start and need to play a strategy gained through experience to benefit. Put another way, Futurists have a very high floor that is relatively close to their ceiling. Other Civs have a much wider gap between floor and ceiling and in the hands of an inexperienced player, they are much more likely to be scraping the floor than reaching the ceiling.

Don't get me wrong. I'm not saying the data analysis is flawed or that the aim of the exercise is pointless. I just think there needs to be some attempt to qualify the players before suggesting nerfs/buffs.
7 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Joe Pilkus
United States
South Riding
Virginia
flag msg tools
designer
badge
Avatar
mbmbmbmbmb
J.,

I am absolutely awed by both the breadth of your post and the fact that you made it eminently accessible to a non-math guy. This seems to underscore what many of us like me (who work often off of gut instinct...it's a Bureau thing) have known all along and this level of analysis provides all of the information one would want to make the case. The ideas you have presented have no doubt been seen by Jamey. While he doesn't drop a post often, typically leaving it to answering questions, folks like Dusty, myself, and others certainly convey this level of work back to him as it certainly helps in driving the next chapter of the game.

Cheers,
Joe
19 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
King Maple
Estonia
Tallinn
Harjumaa
flag msg tools
badge
Avatar
mbmbmbmbmb
All this makes me wonder how much (or little) even successful companies actually analyse their playtesting results. While some of the specifics are hard to define due to much smaller sample size, wouldn't Stonemaier detect the extremes really easily even with much smaller sample size?
1 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
J Kaemmer
United States
Iowa City
Iowa
flag msg tools
badge
Avatar
mbmbmbmbmb
The Professor wrote:
J.,

I am absolutely awed by both the breadth of your post and the fact that you made it eminently accessible to a non-math guy.

Thank you Joe! That is the highest praise I would want to receive.

I try quite hard to make my analyses accessible to non-professionals. In my career, I spend a lot of time in front of non-technical decision-makers helping them to make educated decisions, so it's been a skill refined over many years of practice
12 
 Thumb up
5.25
 tip
 Hide
  • [+] Dice rolls
Jeff Warrender
United States
Averill Park
New York
flag msg tools
designer
publisher
badge
Avatar
mbmbmbmbmb
Slashdoctor wrote:
All this makes me wonder how much (or little) even successful companies actually analyse their playtesting results. While some of the specifics are hard to define due to much smaller sample size, wouldn't Stonemaier detect the extremes really easily even with much smaller sample size?

Well, and there's the related business question that's interesting: it would have been easier to have assembled a more complete data set for, say, 6 civs rather than 16, so it would be interesting to hear why they elected to push for the game to include 16, instead of rolling out some of the civs as expansions. I guess either way, there's no getting around having to assemble a massive amount of data the more possible combinations you allow. But I also suppose that in a scenario in which the base game has only 6 civs, its fan base becomes the potential playtest pool for some of the 'expansion civs' and gives you an easier time of generating those large quantities of data, perhaps.
3 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Steve H
msg tools
This is an amazing analysis! I love that you were able to pull this together and slice the data in so many different ways. Your plots are all illuminating, with great explanations and takeaways.

You have clearly proven, beyond a shadow of a doubt, that various civilizations have exhibited scoring and win rate differentials, some of them quite large. The caveat to that, as you mentioned, is that you can't draw a direct line between early performance and balance. Just because a particular civilization is scoring lower or higher than the others at +1 week after release, doesn't mean that that civilization is unbalanced. There is a difference in early performance and potential.
1 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Y P
United States
Mississippi
flag msg tools
Avatar
mbmbmbmbmb
Thank you so much for the incredibly detailed and thorough analysis! I'll leave it up to the statistics experts to determine how valid your analysis is, but from a layman's perspective you did an excellent job of explaining it and it certainly seems to make sense.

This confirms in black and white (and blue and red ) what players have been reporting regarding how the factions stack up. We can argue about the definition of "balance" and what makes a game balanced until the cows come home, but the differences between civs are quite obvious even to the statistically-untrained eye.
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Jamey Stegmaier
United States
St. Louis
Missouri
flag msg tools
designer
publisher
badge
Avatar
mbmbmbmbmb
In case it helps, here are the latest stats from my simple survey:

29 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Matt
msg tools
zjhomrighaus wrote:
BUT, I am still convinced that a meaningful portion of the data collected is garbage (or at a minimum not meaningful) because Tapestry has a distinct learning curve. If we ignore solo play, we are dealing with people here. One of those people owns the game and almost certainly has more experience with it than those they are playing against... except for the first play in which no one at the table knows what they are doing. This can apply to both strategy and the basic rules of the game.

Right. "Bad data" is way worse than "not enough data". Controlling for player skill, as you point out, is a virtual requirement for a study of game balance.

But it's not just that. We have no idea which games in this sample are played with house rules like "look at two tapestry cards, choose one", which will artificially boost scores. We have no idea how many people have been playing key rules incorrectly.

zjhomrighaus wrote:
Don't get me wrong. I'm not saying the data analysis is flawed or that the aim of the exercise is pointless. I just think there needs to be some attempt to qualify the players before suggesting nerfs/buffs.

Yeah, everything in this post is very interesting, but quantity of data does not compensate for quality of data. It's hard to draw meaningful conclusions without knowing how these games are being played and who's playing them.
6 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
J Kaemmer
United States
Iowa City
Iowa
flag msg tools
badge
Avatar
mbmbmbmbmb
jameystegmaier wrote:
In case it helps, here are the latest stats from my simple survey:


Thanks for the comparison Jamey! (especially with twice as much data)

It looks like your top 4 match mine and validate each other:
-Futurists
-Craftsmen
-Heralds
-Militants
5 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Sean Hagans
United States
Colorado
flag msg tools
mb
I love this type of data, but agree that more information is necessary. For instance, unique data entities (players) as a defining filtering factor. The current metrics analysis is based on the assumption that every data set is a unique entity. If we had data that helped us identify what our sample size of the player population this data represents, I would feel more secure in the analysis performed.
1 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Matt P
msg tools
Fantastic statistical analysis!

One thought on "quick balance":

Do you think just giving the lower tier-ed civs bonus resources would be "good enough" to prevent weaker civs from getting blown out of the water? Say, +1 starting resource per tier (or maybe per two tiers)

So Futurists would get +0 resources, Nomads +1,...., Traders +4.

Alternatively if you had 3 tiers instead of 5, it could be +0, +1, and +2 to avoid "over" balancing the game.

One reason I'd suggest only adding resources is that it isn't as much fun if you reduce the amount of things a player can do (for example, by reducing starting resources).

The obvious issue here, of course, is that each civ may not snowball at the same rate, within the same tier.
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
jameystegmaier wrote:
In case it helps, here are the latest stats from my simple survey:


Jamey, since your data and ISwear's data seem to congrue, perhaps an update is in order. I can imagine a sticker sheet included in future printings of Tapestry that players use to buff/nerf their civs. Perhaps for folks who've already purchased Tapestry, this could be available as a PDF online...
2 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Jamey Stegmaier
United States
St. Louis
Missouri
flag msg tools
designer
publisher
badge
Avatar
mbmbmbmbmb
TheBosun: Indeed! My plan is to post it as text on our website and as a PDF, and then I may include that PDF in future printings and the expansion.
18 
 Thumb up
5.00
 tip
 Hide
  • [+] Dice rolls
J Chi
msg tools
Slashdoctor wrote:
All this makes me wonder how much (or little) even successful companies actually analyse their playtesting results. While some of the specifics are hard to define due to much smaller sample size, wouldn't Stonemaier detect the extremes really easily even with much smaller sample size?

There's a variety of factors that can affect and hinder the ability for companies to perform effective analysis. Note, I'm not claiming any of what I'm about to list as something Stonemaier has done, but just broad aspects that are true, particularly in regards to any time data analysis is done.

1) Getting meaningful data is hard. In a board game, with dozens of components, hundreds of potential matchups, thousands of possible cards and scenarios in play, how do you set up your "tests" to generate meaningful data? There's simply not enough time and resources to test every possible permutation. So this is where the "design of experiment" comes in. This allows you to minimize the number of cases you need to run in order to be able to make determinations about what you are testing (for example, if you're checking to see if a certain card or faction is overpowered). However, many board game designers don't have the math, science or engineering background to do this properly, so a lot of data they gather from playtests is "noise" and not very useful.

2) Interpreting the data is hard. Again, most designers don't have the background and unfortunately human instinct when it comes to statistics is generally poor. Our gut instinct of what is "fair" does not align with what the math says. So without proper training, a person might look at a faction winning often and dismiss it as "oh, well it only happens 5% more often. That's not too bad", but not really understand the significance of what that means.

3) Ego. Designers put a lot of work into their game, and it can be hard to accept that there's a problem. Data can be dismissed "Oh, well they were just new to the game. If they understood it and played it properly, I'm sure it'll be fine".

4) Bad data. A big problem is what assumptions are being made when testing the game. Let me give a weird but plausible situation. Let's say you get a group of playtesters and generate thousands of points of data. But one assumption you failed to take into account is how they shuffle cards. If these playtesters do not riffle or mash shuffle, and only perform light overhand shuffles, the cards that may stay clumped up the way they were at the end of the last game. Often, at the end of the game when putting the game away, players stack up their cards together, so there's often a lot of synergy in that group of cards. Imagine if for most of your data, the players weren't properly randomizing their cards. So they were instead, consistently playing in such a way that the cards for the game were being revealed in groups that had a lot of synergy, as opposed to being more random. That would change the game considerably, and look much different in the data.

This is part of the reason why I'm always a little wary when designers make claims that their games were thoroughly playtested. It isn't just about the amount of data, but the quality, the control of the assumptions, and the background needed to perform a correct analysis and draw a correct conclusion.
1 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Jesse Haulk
United States
Cumming
Georgia
flag msg tools
Avatar
mb
Quote:
The most effective way to determine normality is to mostly eyeball it (I’m not really kidding)

Or you could...you know...use a QQPlot.
1 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Bill Collins
United States
Connecticut
flag msg tools
mbmbmb
1) Excellent work.
2) Nice presentation. No seriously. Very nice presentation.
3) Clear articulation for the lay person is appreciated.
4) Concur on the resource gathering being the backbone of success in this game.
5) Interesting ideas on tweaking the Civs maybe you could break them out into a separate thread for comment? (And link back here.)
6) Makes me wonder if you subjected the Tresham Civilization game (Avalon Hill) to such analysis - admittedly with different variables - if you could show that it's harder to play Crete mathematically. Which we all know already.

I'm curious. Did you break out the solo play data vs. the groups of humans at all? The reason that I ask is that while in theory the two categories should not matter, there's definitely a lot of anecdotal evidence from these forums that many people - especially myself - have taken time to get the automa rules down. In addition to the standard learning curve issues on the game, there are challenges to grokking* the Automa rules. I note from memory that there are at least 5, maybe up to 8, commonly missed rules in solo play that could be affecting results.

On the other point about only releasing 6 Civs vs. 16, I think that's a move that Stonemaier made to try to put their customers first. They tried to "get it all in" with a truly quality product that would not require anyone to buy something later. Which means that the point is one of "damned if you do; damned if you don't." Had they held back and announced 6 or 8 beginning Civs with more to follow there would have been a geeky hue and cry. Or rather, an additional subset of the complaints out there about this very decent game.

*There are a subset of players who read through the Automa rules, grasped them immediately and perfectly, and who have not had this issue. I applaud these folks and envy their gifts.
4 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
J Kaemmer
United States
Iowa City
Iowa
flag msg tools
badge
Avatar
mbmbmbmbmb
BlueSapphire wrote:
Quote:
The most effective way to determine normality is to mostly eyeball it (I’m not really kidding)

Or you could...you know...use a QQPlot.

I don't normally use those, but isn't that also just a visual comparison technique? (and therefore still eyeballing it?)
1 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
S R
msg tools
Firstly, thank you for creating that spreadsheet and analysis. The data is very valuable. Also, given the retail release this month, a new thread will let new buyers know of the existence of this spreadsheet.

With 16 civs, I never expected Stonemaier to have vetted every single detail to make all civs "equal" in scoring potential. And your data clearly shows that some civs need fixing. I'm looking forward to an updated analysis in March 2020 with even more data. And I'm convinced that the findings will not be very different from what you have here.

This is already great analysis, but here're a couple of suggestions if you're interested:
-I'd remove solo games from the overall average, or atleast add another "Average (non-solo)" column.
- Also, it might be valuable to have two more average columns - 2&3p average, 4&5p average - because the board size is different between those two groups of players. And it might show if certain civs are doing well at 2&3p, while struggling at 4&5p. One theory I have is that Chosen will do "reasonably well" at 2&3p, but struggle at "4&5p".

Once again, great job!
2 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
J Kaemmer
United States
Iowa City
Iowa
flag msg tools
badge
Avatar
mbmbmbmbmb
rsr15 wrote:

-I'd remove solo games from the overall average, or atleast add another "Average (non-solo)" column.
- Also, it might be valuable to have two more average columns - 2&3p average, 4&5p average - because the board size is different between those two groups of players. And it might show if certain civs are doing well at 2&3p, while struggling at 4&5p. One theory I have is that Chosen will do "reasonably well" at 2&3p, but struggle at "4&5p".

Those are fair suggestions.

-filtering solo games is a pretty good idea, though they don't mess up the data much right now, but the fact there is a guaranteed response/selection bias there AND some Civs aren't even supposed to be played solo (at least one person oopsed that rule on Futurists, btw). Weirdly, however, is that by cutting off solo plays, many average scores will decrease even more.

-I hadn't really thought about map differences, but the fact the center island is further away probably doesn't help the Chosen (I attributed it solely to competition but that's probably flawed based on your point) and does change some dynamics amongst other civs like Nomads.
1 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Devin Smith
England
Southampton
Hampshire
flag msg tools
Avatar
mbmbmbmbmb
One facet of this data that I find surprising is just how low the average scores are compared to what I'd expect--in games played locally 200 is almost always going to be last place in a 4 or 5p game. (I'm behind on uploading scores, sorry.)

An interesting breakout would be to control for player skill in the win-rates section. Something like the game's average or median score could be used as a proxy for player skill given the anonymised data--this would be better in 4/5 player games. I would expect that certain civs (mystics, e.g.) to do better in the hands of better players.

________

Re: civ adjustments. There's a huge time-value-of-money factor in this game. +1 Resource in era 1 is... three times (?) as good as in era 4? Some number like that, at least.
1 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Bill Heaton
Canada
Victoria
BC
flag msg tools
Come back Bale...
badge
In 'Poch we trust!
Avatar
mbmbmbmb
Woah, only on BGG. You Sir, are the MAN!
4 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Bill Heaton
Canada
Victoria
BC
flag msg tools
Come back Bale...
badge
In 'Poch we trust!
Avatar
mbmbmbmb
Slashdoctor wrote:
All this makes me wonder how much (or little) even successful companies actually analyse their playtesting results. While some of the specifics are hard to define due to much smaller sample size, wouldn't Stonemaier detect the extremes really easily even with much smaller sample size?

They (SM) playtest an obscene amount from what I know. Remember, the same people will test this over and over and therefore get better at the game, softening out the first play issues that may rise up when using specific factions. It's a tough balance:

- you may know first play of certain factions can be tough
- you also know that after multiple plays it's been shown even with others
- however, many people will only ever play the game 2 or 3 times

What to do?
1 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
1 , 2 , 3 , 4  Next »   |