103 Posts
BGG» Forums » Gaming Related » General Gaming
Subject: Rating with confidence  going TOTALLY NUTS wth statistics!
Your Tags:  Add tags 
Popular Tags:  View All]  [

Warning: This article is not for the faint hearted.
You may want to take it one section per day. (I'm a stats geek. I do this for fun!? So sue me!)
Can BGG Rankings Quality Be Improved?
"Ratings" and "ranking" systems in general, and the BGG ratings/rankings in particular, are generally supposed to be an indication of how a community as a whole feels, ie a guideline as to how individuals new to an item may respond. I've been slicing and dicing the ratings numbers every which way, and presenting a range of alternatives over the past year. I've concluded that the BGG ratings do a pretty darn fine job in spite of everything I have to say below. The only "better" systems are sufficiently complex as to be frustratingly opaque to most users (without time and inclination to study how they would work) and give only marginal improvements in their representativeness.
This article explores ideas around:
There are several reasons why a game ranking might be "wrong" or "unjustified"...
We Are Not Clones
We are all individuals, unfettered by any requirement to individually agree with any overall game rankings. Which, to put it another way, means there's no reason why the rankings have to agree with you or I. But, hopefully, the BGG rankings measure total general feeling in some way.
So we're not trying to work out my favourites or yours, which anyone can see anyway if they want to. We're trying to work out ours.
Representative Samples  A (Self)Select Group Of Players
A game ranking may get an artificial boost if the people who rate it are selfselected for people who would particularly enjoy it. For example, a niche trivia game about stamp collecting in potholes under railway lines may only appeal to 40 people in the world. If the niche market is completely transparent then they're the only people who will ever try the game, and rate it. So the rating will be unrelated to what the general community would feel if they ever opened up that game. Perhaps this doesn't matter, since although the rating may draw attention to the game, the game itself will still be transparently uninteresting to most players. The less clearly niche a game's market is, the wider a variety of people will come to rate it.
Representative Samples  Confidence
Even if all the ratings are clean data and the current "Bayesian average" technique is the most appropriate technique, collecting ratings is like taking a poll. You don't get everyone, just a subset of the players. You hope their opinions are representative of the overall player group. But if the sample group is too small a subset, or if you just happen to get an unusual sample, you end up with a poor quality result. So how reliable is the current formulation? Assuming we are taking an arbitrary poll of people who've played each game, how likely is it that a rating is "wrong"?
I spent a lazy coupla (sick) days working out confidence intervals. This is a pretty messy calculation ... each game winds up with a distribution curve of the likely average rating for the entire player base, given the sample subset of players we have, and then there is some probability that the "real" averages of two games actually cross over from their current positions, causing games to change rankings. A game may swap its way multiple rankings up or down.
Assuming that we've all rated accurately from our hearts, it turns out the confidence intervals around the rankings based on Bayesian averages are pretty tight:
There are NO GAMES where the 99% confidence range stretches (asymetrically) further than 7.5 ranks AND further than 6% of the current nominal rank. The biggest 99% confidence range of any game in the top 20 is Hannibal: Rome vs. Carthage which (for 99% confidence) may be up to "1.8" ranks worse than current. The biggest stretch for any game past #100 is Manila which might improve by up to 5.4%.
So at least one thing works around here... the actual calculation parameters give decently tight results. Now if only I could get users to rate along guidelines!
The Rating Game #1  Multiple Accounts
The rankings may be adversely affected by people "gaming" the ratings system by adding multiple user accounts. I always find it astounding that fully 20% of the roughly 25,000 BGG user accounts that rate any games only rate exactly one game! Perhaps some of those accounts rating only a few games are extra accounts created by people who want to unfairly influence the voting. In which case one supposes there would be some big counts of 9s, 10s, 1s, or 2s in the ratings by accounts giving few ratings.
Imagine we took all the users who only rated a single game, and pooled their ratings. What sort of ratings distribution would they have? How does it compare to BGG as a whole? How does it compare to users who rate exactly two games? Or three? It would be interesting to cook up a comparative chart showing the ratings distributions for single, twin, triple raters etc.
That would take a little while, so here's one I prepared earlier today:
Clearly the accounts rating just one game tend to be quite a bit more, ahem, "exuberant" than general BGG feeling. You can be pretty sure none of those actually get past the BGG ShillSeeker filter to be counted in the BGG game ratings. But curiously there are definite trends...
Percentages of 9s and 10s
Percentages of 1s and 2s
(NB: The 98th percentile for 2s fluctuates wildly about 10%
The 98th percentile for 1s fluctuates wildly about 5%.)
There's evidence of possible ratings b.s. in that spike in percentage of 1s for lowratingcount user accounts, where an extra 0.5% of the ratings by those accounts are visibly unexpected 1s. ie about 50 ratings in about 40 accounts out of the whole of BGG's million ratings on 25,000 accounts. SHAME SHAME SHAME!
It's also possible that the surge of 10s for accounts with few ratings includes more shenanigans. This is less convincing since the behaviour trend fits reasonably well all the way down to onegameraters asis, but possibly up to several percent of those 10s are in excess of expectation.
In fact, I found the relationship between the number of ratings users give and the average ratings behaviour of those users was so strongly patterened as to be amazing.
(Chart is animated.)
So now I'm more willing (than I used to be) to believe most of those 6,500 fewratings accounts are in good faith. And compared to the backdrop of over a million ratings given by BGG users, even a few hundred "spurious" accounts would have pretty much zilch impact. (Especially now, since I doubt they get past the ShillSeeker filter.)
Out of curiosity, here's the same animated chart concept but based on games grouped by ratingspergame.
(Chart is animated.)
This shows a fairly strong (if somewhat jittery) relationship where the more ratings a game has the higher the average rating. On average.
Nice to know we're not completely foolish.
btw on reflection it makes perfect sense that on average users who rate more games have lower ratings averages. A player with only a few games under their belt is likely to have mostly been exposed to "quality" games, whereas to get to rate hundreds of games you have to reach a little deeper. I thought to show you a chart of this effect, but it turns out it's only small. Mostly, people who rate lots of games are just pessimistic. (And since they rate sooooo many games, they're bringing down the average.) It's easy to assess what I call a user's "optimism" by taking the average difference between their rating and the BGG raw averages.
I wrote a geeklist showing some of the outcomes of adjusting for "optimism".
http://www.boardgamegeek.com/geeklist/20084
The Rating Game #2  Overstating Sentiment
The rankings may also be adversely affected by people "gaming" the ratings system by eg giving a "1" when they actually feel "6" because they're "trying to balance" some other crap rating. (Bad! Go to your room! You're just messing it up worse.) Aldie's ShillSeeker filter probably already takes out the crap ratings you're "trying to balance out". If someone goes putting a spurious "1" or "2" because they don't think a game deserves its current "6.5" average, all they're gonna do is end up getting themselves filtered out and losing their say entirely. We'd rather know what each user really thinks... the averages will take care of themselves.
I've said before that the impact of this is much less than people are willing to believe. From the charts above you can see there are so few instances of people giving spurious "1" ratings that they disappear in the averages. Which means that only a tiny handful of individual games might show any measurable effect from this at all. Check the ratings distribution histogram of a game you worry about. Does it look unlikely? And if there seem to be a few extra 1s (or more likely a few extra 4s) calculate how much impact it has.
There's been a couple of geeklists in the past year which go through games with questionable extra ratings at a low or high rating value, and frankly the peaks generally make some degree of sense, being well known games with an enthusiastic fanbase but where most people have a fairly indifferent reaction.
So are there significant numbers of users piling on the 1s or 10s? If a user is giving an "excess" of 1s, or 2s, or 10s, how would we gauge that? Mostly, we'd instinctively judge it by looking at the excess of, say, 1s compared to the number of 2s and 3s they gave and compared to the overal bulge of their ratings curve in the 4 to 8 range.
ie if the user has two peaks in their ratings curve, they have an "unusual" bias in their ratings. This might be intentional ratings manipulation, or might just be because they like one rating description better than another. Also, there's over a hundred users who habitually ignore some ratings levels, eg only using the even numbers or only using 1s, 5s, and 10s. (For this exercise I don't tally these patterned users... they come under the next section.)
Following is a chart of users with twin ratings peaks. The three primary characteristics are ...
1) How deep is the dip or trough between the two peaks?
2) Which peak might a casual observer wonder if it's "suspicious"?
3) How many "suspicious" or "excess" ratings are there in that peak?
To measure the "trough", I take the lowest rating count between the two peaks and divide by the height of the smaller peak. eg if a user gives 27x6, 33x7, 26x8, 20x9, and 25x10 then there are two peaks (7 and 10) and the ratio here is 20/25 = 80%
To choose which peak is "suspicious" I take the peak further from the user's average.
To calculate an "excess" rating count, I subtract the peak value from the trough value. In the example above this would be 2520=5 "excess" 10s.
(Chart is animated.)
Bearing in mind there are over a million ratings on BGG, the chart above shows a tiny impact, with a total of under 5,000 ratings possibly being "overstated".
It's interesting how many "excess" 4 ratings there are. Does this include people trying to downplay games but not get busted by the ShillSeeker filter?
Aldie is watching you.
Finally, here's the curve of ALL ratings on BGG:
That's so textbook it's beautiful. Honestly, if there were significant numbers of voters tossing extra 1s or 2s or 10s about, it'd show up here too. It's not that it never happens, but that the effect is trivial.
So I could strip out all the twin peak raters and recalculate the Bayesian ratings. I compared the effect of doing this to the effect of stripping a similar count of random users (with a similar number of ratings between them).
Strip Random Users Strip Twin Peaks
Percentile Rank Change Rank Change
Avg 4.5% 4.2%
90th 10.1% 9.7%
95th 13.0% 12.5%
98th 17.7% 16.5%
99th 22.4% 18.9%
So the final lowdown is that the actual impact of the twin peaks raters collectively is not materially different from "other" raters.
The Rating Game #3  Users With Artificial Ratings Curves
The rankings may be adversely affected by people forcing their ratings into a set pattern or preconceived distribution. I recently ran a pair of analysis which demonstrated this is counterproductive. Each gamer's qualityofgame distribution is different, especially given the difference of opinion on how many times you should play a game before you rate it. Some users will give almost no low ratings, because they won't play a bad game more than once and they won't rate a game from one play. Others are either more likely to give a bad game several goes, or else they're happy to rate from one bad play. Which means each user's ratings distribution should be a bit different. If you artificially force a precise rating curve on yourself, your ratings will not carry the same meaning as anyone else's. (Your ratings may even get completely filtered out by the ShillSeeker.)
Most users follow the guidelines, and as a result show a fairly similar pattern of ratings curve shape. The four big parameters of ratings curves are:
x the number of games rated
x the average rating given by the user
x standard deviation which measures how much ratings are spread
x skew which measures how lopsided one way or the other.
We've already seen a little bit about averages vs number of ratings, but the following density charts tell more of the story.
(The "Number of Ratings" scales are all logarithmic. Chart 2 & 3 only show users with at least 9 ratings.)
These charts show how most people's ratings curves fit in fairly normal ranges, but a few odds bods are outside the norm. I mentioned that some are deliberately using a subset of the ratings values, but others just have very skewed curves, or very low or high averages.
Avg vs Number of Ratings: Tells more about how the optimism/pessimism kicks in. At any number of games rated some users have averages down to 5.5, but the maximum average slowly declines the more games are rated. Fair enough.
Standard Deviation vs Average: Shows that in a very general way the lower your average rating the more likely you are to have your ratings more spread out. You can also see there's some folks with averages of "10"! These users all give less than ten 10s, and I'm pretty sure they all get nuked by the ShillSeeker filter.
Avg vs Skew: My favourite. Interesting is the little mediumdense spatter sticking out on the left at average 5 to 5.5, skew roughly zero. These are largely the users forcing their ratings into a "balanced" distribution, ie giving roughly the same number of 1s as 10s, the same number of 2s as 9s and so on, often in a triangle. Alas, by doing this such people are making their ratings values mean something different from most people's, so the values getting merged into averages are now apples and oranges. Oops. Fortunately, this chart shows there aren't very many.
To give you an idea how the "corners" (circled in red) of that common rating space might actually look as a ratings curve, here's the averages at each corner:
(I also included the most common user style when considering all of Average, Standard deviation, and Skew.)
Only the "Avg 5.5 Skew 0" chart bothers me here, but the total ratings count is fairly small. Even smaller is the count of users who give roughly "uniform" ratings distribution (ie they rate roughly the same number of games at each of 1 to 10)... there's maybe only a dozen of these folks at all, which is just as well.
Since each user's qualityofgames actually rated will be different, we can only get quality rankings from averages if most of us rate each game independently... without considering any of our other ratings.
Methodology 101  The Ratings Calculation May Miss The Point
Maybe an entirely different calculation is in order.
What are we trying to say by "the community likes game G 3rd best"?
x Do we mean that of all the people who played G they liked it 3rd best? (Which means if only ten people ever played it the game could still be number 3 overall.)
x Do we mean that this game is the third most played game? (But would that be by time or by session counts or by player counts? How about all those party games where one session might be fifteen people? Or all those familyaccessible games that get more play time than the games we really like?)
x Do we mean a fuzzy "total gaming fun"? For example, just totalling up the values of (Ratings5.5) for each game or maybe counting the number of ratings over 6, rather than getting the game averages.
The current system does a pretty decent job of combining a rather nebulous "people who play this game like it best" with "lots of people are playing this game"... by virtue of the Bayesian system slightly favouring more widespread games.
I recently wrote a geeklist totalling "goodwill" for all games, ie for each game a total of (EachUserVoteForThisGame  EachUserOverallAverage)
http://www.boardgamegeek.com/geeklist/20420
It's surprisingly effective in achieving it's objective.
There's a range of possible other choices for how to construct rankings.
Methodology 101.1  Make The Rankings Match The Most Assertions
There's a little over a million ratings in the BGG database saying "how good are games A,B,C...". 20% of users are oneratingwonders, while some rate hundreds or (even thousands!) of games. If someone rates four games (A,B,C,D) they're making fuzzy individual statements about each game, but very firm comparative statements about whether A is better than C or D is better than A. A fourgame rater is making only four fuzzy individual game statements, but they're making six very firm comparative statements... AvB, AvC, AvD, BvC, BvD, CvD.
There's over sixty million (!) of these "A is better than B" statements captured in the BGG data. The average user's ranking assertions agree about 70% with BGG rankings. Which is to say, pick any two games you rate differently and there's a 70% chance the BGG rankings agree with your statement of which game is the better of the two. (This is, of course, assuming we have all rated honestly.)
One possible ranking technique is to make as many of these assertions true as possible.
Out of the 62 million assertions of "game A is better than B", of course we don't all agree with each other. In fact 15.2 million assertions (A > B) are in direct opposition to 15.2 million others (B > A). No matter what rankings BGG sets, at least 24.6% of our assertions would agree and at least 24.6% would disagree. The ranking mechanism only has an opportunity to optimise the other 49.8% of our assertions.
Benefits: At a first instinct, you can't get much better than saying that the maximum possible number of user preference assertions are made true by the rankings. Also, there's much less scope or impact from trying to game the ratings since eg giving Caylus a 1 isn't more significant than just rating it a little lower than its competitors.
Problems: It's not all clean cut. If there's any debateable point about a rankings method then, well, it can be debated. In this case especially there are two things:
1) What to do with all the assertions of "A=B"?
As noted there are about 62 million assertions "A>B". But there's also 13 million "A=B". The rankings don't admit that A can equal B... and also "A=B" is a heck of a lot more precise than "A>B". For the calculation above I simply ignored all the "A=B" but I'd rather not.
2) Giving more ratings is exponentially more influential.
(Well actually it's an n squared relationship.)
By the time you get to those handful of users who give 1000+ game ratings, they're making 300,000 "A>B" assertions each. I'm not convinced that's an appropriate ratio of influence... 1000 games rated being 1000 times as influential as 33 games rated, or 100 games rated being 100 times as influential as 10 games rated.
It works that way from the games' perspective too. The ten most rated games all have over 4,000 ratings. Assertions involving those ten games (comparing to all games) make up 7.1% of all the comparative assertions, ie about 4.5 million. Should 0.3% of the games have 7% of the influence on all rankings?
There are ways to adjust this, such as instead of each assertion being worth 1 in optimising make each assertion worth (1 / CountOfThatUsersRatings). But any adjustment like this feels like a kludge to me.
3) Additionally... it's obtuse.
It's hard to visualise how it all works out, so users may have a feeling of the ranking system being a statistical black box.
4) It's computationally intense.
As the database grows the amount of computing power required to calculate this grows exponentially. It's so painful to calculate out fully I can't even give you what the top ten would look like.
So how well do the current BGG rankings match our individual assertions?
In total, the BGG rankings agree with about 44.8 million assertions we make.
But as noted, 15.2 million are assured while 15.2 million are assured in disagreement.
Taking out those 30.4 million from the total set of assertions leaves 31.5 million up for grabs (ie about 49.8% of all our assertions)
Of those, 29.5 million agree with BGG, or about 93.7%
That's a pretty good matching rate for such a simple rating strategy!
Some of that 6% gap is "deliberately" introduced by the Bayesian adjustment. Carcassonne has a wide audience so it's ranked higher than A Victory Lost: Crisis in Ukraine 19421943 even though over 80% of the people rating both games reckon Carc is not as good. If we removed that Bayesian adjustment and merely took all the raw averages the "available user assertions" match rate would improve by 0.7 million to 95.8%, but at a "cost" of putting niche games higher in the rankings than is deemed desirable by those that set the calculations.
I haven't yet done the analysis to try to optimise this last few % (ie 1.3 million assertions). That'll probably be a geeklist for another time. (FOLLOWUP: As expected, the assertions are most matched with no Bayesian adjustment applied at all. The adjustment that has the least impact is to add ratings of 6.5, which as expected is somewhere between the average game average of 6.2 and the average user rating given of 6.8. Interestingly, the impact of the Bayesian adjustment is not much on how many assertions get matched... it costs us 1.2% of total matches.)
Not all of the assertions gap may be resolvable at all. Indeed, almost certainly it isn't. Suppose there are three game raters:
User (1) says A > B and doesn't mention C
User (2) says B > C and doesn't mention A
User (3) says C > A and doesn't mention B
These three assertions don't directly oppose each other, but at most two of them can be true. As such I'm surprised we can get as high a match rate as we do.
Methodology 101.2  "Condorcet" or "PairWise" Ranking
Technique 101.1 is close to the Condorcet technique. In Condorcet, you take every pair of games A and B, check all the people who rate both, and count how many prefer A and how many prefer B to determine a single assertion "A>B" or "B>A". Then you try to optimise the rankings so this shorter list of assertions is true. If you keep the 30ratings cutoff, then with 3500 games to rank there would only be a maximum of about six million assertions to optimise. In reality not every pair of games has any user who ranks them both, so the number of game vs game aggregate assertions to optimise is currently much lower at about 4.6 million.
Benefits: Another reasonably intuitive idea for you who have read this far. If BGG users collectively say "A>B" and "B>C", the rankings do their best to show "A>B>C". And this is another technique where there's little room to mess with the ratings.
Note: As before, not all assertions are resolvable since there will be loops where "A>B", "B>C", "C>A".
Problems: This technique can allow obscure games to fill the top ten. For example, only 30 people may ever play game W and because it's a niche game they all think it's the bees knees. They collectively put this at the tops of their ratings. So "W>EverythingElse" will optimise the rankings. But, honestly, nobody else wants to know about W. Or the other niche games that might float into the top ten list.
So how well do the current BGG rankings match our headtohead collective assertions?
Out of the 4,550,866 game vs game assertions for Condorcet, 74% match the current BGG game rankings. There's also 8% which assert "A=B", which as before is both a much more precise statement than "A>B" and also isn't represented in a sequential ranking system.
Methodology 101.3  Standardise
Another technique might be to standardise everyone's ratings before taking averages, eg make ratings uniform or give everyone the same average and standard deviation. The problem here as I mentioned before is that each user may genuinely rate different quality of games, either because our exposure is different or because some of us won't rate a bad game after one play (and won't play a bad game twice).
For a case in point as to how standardisation can go wrong, here's two actual users. Each rates five games:
A) Chutes and Ladders (1), Monopoly (2), Risk (6), Carcassonne (7), Chess (9)
B) The Princes of Florence (7), Ticket to Ride (7.5), Ra (8), Puerto Rico (9), Tigris & Euphrates (10)
These two users each rate fairly much in line with BGG expectations. It's technically possible that A actually has the same total enjoyment from his five games as B has from hers, but it seems unlikely. Yet any standardisation of these two users is going to state that "Chutes and Ladders = The Princes of Florence", "Monopoly = Ticket to Ride" etc. which for my mind just feels plain wrong... not because I personally may disagree with that assertion of the relative quality of these ten games but because I'm pretty sure these two users would disagree with that assertion of the relative quality of these ten games. So to draw that conclusion from their ratings would be woeful.
I recently did a couple of geeklists which, for my mind, shows this isn't a terribly good idea:
Uniform: http://www.boardgamegeek.com/geeklist/19960
Normal: http://www.boardgamegeek.com/geeklist/20115
Methodology 102  The Ratings Calculation May Be Off
EDIT: Oops! In an opus this size for pure curiosity, in spite of all my cross checking there was bound to be a mistake somewhere. All the sensitivity numbers were actually about four times what was shown in the original charts. I've uploaded a corrected chart.
Even if the current ratings formulation method is the right approach, the specific calculation to get an "average" rating result may be inappropriate. Maybe the ShillSeeker filter is biased. Maybe the Bayesian parameters are not the best. So how sensitive are the rankings to the parameters chosen?
Currently there's somewhere around 100 phantom votes of 5.5 added to each game before calculating its average rating in order to get its ranking. (We can't tell exactly because an arbitrary set of ratings are filtered out by the ShillSeeker.) These charts below show how sensitive the rankings are to selecting the number of phantom votes or the rating value of those votes. It's the average rank change for different Bayesian parameter values, assuming currently 100x5.5
The current average of all rating values on BGG is about 6.8. Above you can see that the choice between 5.5 (the middle of the rating range) and 6.8 (the average rating actually given by users) makes quite a big difference to the ranking results.
Epilogue
Well I had a fair bit of brain burning this past few weeks putting this all together. To those who read this far, thank you for your attention and I'm stunned and amazed you made it!
I can't say much has come of all this, interesting though it's been... although the Condorcet style ideas appeal to me personally, it seems the current system already gets close to the same conclusions. And for a lot less computing effort or user confusion.
So to answer my opening question: "not materially". It'd be nice if users stopped worrying about the result and just voted their gut feelings. My second biggest concern is that the system is fairly sensitive to the choice of Bayesian parameters, which highlights there's an element of arbitrariness. But the current system is close to agreeing with the maximum possible user assertions of "game A is better than game B" so there's not much room for complaint.
I hope you found something interesting in my ramblings. Now if you'll excuse me I need to go and rediscover sleep and sunshine.
Cheers
Joe

 Last edited Wed Aug 15, 2007 11:42 pm (Total Number of Edits: 10)
 Posted Thu Mar 22, 2007 4:44 pm
 [+] Dice rolls

wlaznak wrote:!
 [+] Dice rolls

Re: Rating with confidence  going TOTALLY NUTS wth statistiI'm a fellow stats geek (I worked as a statistician in Citibank R&D for a year), so I can appreciate this. The problem, as you address, is the "garbage in, garbage out" issue; no matter how you crunch the data, if the ratings aren't truly honest and reflective, the aggregates will be problematic.jgrundy wrote:It'd be nice if users stopped worrying about the result and just voted their gut feelings.This is something I've been fighting with myself for the last few weeks. I started a thread over at
http://boardgamegeek.com/article/1386022#1386022
that really helped me crystallize my issues and get some advice from the community. Based upon that thread, I'm redoing all of my ratings (I'm down to the E's), and the numbers are certainly downward trending. A few designers have contacted me, wondering why my rating dropped for their game.
I'd like to second Joe's request for people to rethink their ratings. Don't rate for anyone else but yourself. While I struggle with the BGG definitions because they define both quality and replayability, I find using them as subjective guidelines (focusing much more on my replay interest) to be a great help. I'm certainly giving many more 4s and 3s using that scale.
 [+] Dice rolls
 Geoffrey EngelsteinUnited States
Bridgewater
New JerseyLudology Host and Dice Tower Contributor 
Re: Rating with confidence  going TOTALLY NUTS wth statistiVery interesting  and I did make it all the way to the end.
You need to get sick more often.
 [+] Dice rolls

Wow, this looks interesting! I've just had a quick read through, and I'm going to have another read again later.
Nice work!
 [+] Dice rolls

Re: Rating with confidence  going TOTALLY NUTS wth statistiJust amazing!
 [+] Dice rolls

Re: Rating with confidence  going TOTALLY NUTS wth statistiThat is very impressive! I'm definitely bookmarking this thread for future reading.
 [+] Dice rolls

Re: Rating with confidence  going TOTALLY NUTS wth statistiThat was a very interesting read. Thanks for writing that up, Joe.
There are some search algorithms which should be able to find a reasonable ranking of games, optimizing for pairwise comparisons, within a short period of time. Simulated annealing and genetic algorithms are the ones which I understand best, but it would probably be possible to apply other techniques like branch and bound or BEAM, if you can come up with an appropriate way to look at the possible solutions. SA and GA in particular are excellent for highly dimensional problems like this one.
 [+] Dice rolls

Great work!
My big thing is that that people get all huffy about what does a "9" or a "3" really mean. As it turns out, and I think the data backs it up, it doesn't matter. As long as everyone rates games in the same way as themselves (using whatever metric they use, it need not be the sample one that this printed on the game page) in aggregate the effect will wash out and produce nice textbook results as the one you point to.
Again stellar performance. Bravo!
 [+] Dice rolls

Re: Rating with confidence  going TOTALLY NUTS wth statistisnicholson wrote:I'm redoing all of my ratings (I'm down to the E's), and the numbers are certainly downward trending. A few designers have contacted me, wondering why my rating dropped for their game.That ... kinda creeps me out. "Is there a problem here? Cuz, you know, I don't like problems. And it looks to me like we got a problem."
While I don't mind my ratings being publiclyviewable for the purpose of endorsement/warning other gamers based on my gaming experience... I'd feel a little uneasy if a change in personal opinion would trigger a direct interaction/consultation/confrontation/inquisition from a game designer or publisher. While some users would probably take it all in stride and, for some, probably be flattered that they've caught the eye of the designer community  some users may not be so keen on the thought of their opinions being under surveillance. The Widderich Affair wasn't very long ago and I'm sure that left a lot of users a little uncomfortable about designers 'taking personal opinions personally'.
 [+] Dice rolls

My head's gonna explode. What a plethora of information. I definitely need some time to read and digest this further.
Great thread.
 [+] Dice rolls

Re: Rating with confidence  going TOTALLY NUTS wth statistiThat's a lot of info. I'll need some time to digest that all, but in the meantime, here's a few questions and remarks already:
How did you compute the confidence intervals in the very first graphs? That involves a statement of the underlying distribution which is decidedly not Gaussian. The general rating distribution looks like Poisson in reverse, but there's more of these skewed curves around (for those who are interested, http://www.itl.nist.gov/div898/handbook/eda/section3/eda366.... has a collection of about 20). Did you create your own distribution, or did you adapt an existing one?
I also had Matlab compute my average and skewness: with values of 5.15 and 0.18 I just fall into the group of 'bothersome' balancersI would like to have it on public record that I do not balance my ratings; they just turned out that way . If I did the microbadgething, I certainly would qualify for a 'brutal reviewer'specimen, though.
Speaking of brutal reviewers, I'll bet that the dot at (~4.8 , ~0,5) is Fawkes . It's cool to be able to spot people on that density graph.
Now all we need to do is combine whatshisname's work (Matthew Gray?) on the game classifications with this, and then we don't need to do any statistics anymore. But your efforts are much appreciated.
 [+] Dice rolls

Re: Rating with confidence  going TOTALLY NUTS wth statistiHow exactly does this help me choose between Lost Cities and Twilight Imperium: III?
Just kidding.
Awesome job! Very interesting read. You should get a golden statsgeek microbadge for your efforts!
 [+] Dice rolls

Re: Rating with confidence  going TOTALLY NUTS wth statistiWell, you certainly live up to:
 [+] Dice rolls

great work, i would love to see a geek list with the ranking if the average was used as the baysein (which seems like it would make more sense).
 [+] Dice rolls

great work, i would love to see a geek list with the ranking if the average was used as the baysein (which seems like it would make more sense).
 [+] Dice rolls

great work, i would love to see a geek list with the ranking if the average was used as the baysein (which seems like it would make more sense).
 [+] Dice rolls

Wow. So, uh... what numbers are you planning to crunch next?
So far, the bulk of the analysis has been of game ratings from a perplayer basis, which is an obvious first place to start. Personally, however, I'm most interested in ownership figures, game weights, perpublisher analysis, perdesigner analysis, and other comparisons that perhaps lie a little further off the beaten path. Are any of you hardcore numbercrunchers interested in tackling that?
Excellent work, by the way.
 [+] Dice rolls
 Mendon DornbrookUnited States
Phoenix
AZThe ticket to the future is always open. 
Re: Rating with confidence  going TOTALLY NUTS wth statistiThis is absolutely the most important post about the ratings system on BGG. As I was reading it, I was tempted to create several other accounts so that I could give you more thumbs for this. It throws down the gauntlet to those who complain about the system as is or who purposefully rate games based on their own contrived system.
Thanks so much for this fabulous post.
 [+] Dice rolls

cymric wrote:How did you compute the confidence intervals in the very first graphs? That involves a statement of the underlying distribution which is decidedly not Gaussian.Maarten, it turns out that no matter what the underlying ratings distribution is, the average of a random sample will vary with Normal distribution about the "true" average of the entire population. (Or specifically, it approaches Normal distribution as the sample size gets bigger.) Conveniently, my stats bible suggests "usually a sample size greater than 30 will ensure that the distribution of mean estimator can be approximated by a normal distribution." So using the sampled average to estimate the "true" average allows the analysis to make use of Normal assumptions.
 [+] Dice rolls

Re: Rating with confidence  going TOTALLY NUTS wth statistiAny assumption that correlates this population to 'normal' has to be wrong.
 [+] Dice rolls

btw I just want to thank everyone for the huge positive response... comments, thumbs, gg, and personal messages. Wow!
I admit I was half expecting to get caned for being so overthetop.
Thanks everyone!
 [+] Dice rolls

Quote:If someone goes putting a spurious "1" or "2" because they don't think a game deserves its current "6.5" average, all they're gonna do is end up getting themselves filtered out and losing their say entirely. We'd rather know what each user really thinks... the averages will take care of themselves.These people would probably be better off simply writing a review and some session reports, and they can ask players to give the game another play and rethink their ratings.
 [+] Dice rolls

Re: Rating with confidence  going TOTALLY NUTS wth statistijgrundy wrote:Maarten, it turns out that no matter what the underlying ratings distribution is, the average of a random sample will vary with Normal distribution about the "true" average of the entire population. (Or specifically, it approaches Normal distribution as the sample size gets bigger.) Conveniently, my stats bible suggests "usually a sample size greater than 30 will ensure that the distribution of mean estimator can be approximated by a normal distribution." So using the sampled average to estimate the "true" average allows the analysis to make use of Normal assumptions.Ack, it's the Central Limit Theorem (http://mathworld.wolfram.com/CentralLimitTheorem.html) again. That is such a counterintuitive piece of mathematics that I always forget its existence. It's too much like having your cake and eating it too, I guess. On the other hand, while looking up more info on the CLT I did learn that there are pathological cases where the theorem doesn't hold, but I think we can safely assume that these conditions are not applicable here. For those who are interested, see here: http://www.statisticalengineering.com/central_limit_theorem_....
 [+] Dice rolls
 Randy CoxUnited States
Clemson
South CarolinaMissing old BGG 
Very interesting. You even made me turn on animation for the first time in a looooong time (I'll be turning it off again in a minute).
Thanks.
 [+] Dice rolls