Steve Haas
United States
Mountain View
CA
flag msg tools
Original post copied from Season 20 thread.

----

TLDR: I did some looking at ratings. Only using recent games for the later iterations doesn't help, but I found something that appears to. It also makes ratings look quite a bit different at the top (and bottom) of the board.

So, pursuant to our discussion of a couple pages back about the merits of the rating system, I decided to do some experimentation. I took Juho's rating evaluation starter kit, fetched all the game summaries from the site, and wrote an evaluation framework which works by using all data that existed at the start of each month to project outcomes for the following month, and then summing this up for every month since the site began. This gives us a nice long window of projections to look at, and balances both how well it settles in on the correct rating and how long it takes to get there.

To score the predictions, I summed the log of the probabilities assigned to the outcomes that actually occurred - this is thus (the log of) the probability the model assigned to all the games coming out the way they did. Clearly, the higher this number is, the better.

That said, because we are projecting nearly 400k different pairwise matchups, the probability that we get them all right is very low. Hence, all the scores we generate will be large negative numbers. For instance, the score for "no system at all" - that is, giving every matchup a 50-50 chance of ending either way - is -238923. Hopefully, by applying Elo, we can improve on this number (i.e., make it less negative).

I began by assessing the iterated Elo system the site actually uses, plus a couple of other variants considered in the rating evaluation github. For example, we can look at the effect of multiple iterations through the data:

Score Parameters
-217524 iters = 1
-211670 iters = 2
-210873 iters = 3
-210539 iters = 5


From this, it looks like the current system is a pretty solid option - you can do a little better by performing more iterations to further smooth the data, but the gains are pretty small and the extra computation nontrivial, so we seem to be in a good balance point.

However, while working on this, the notion of pot size came to my attention. Pot size is the number of points by which the players ratings can change in one pass through the system. Currently, the pot size used is 16 for the first pass, and 1/n^2 times that for the nth iteration. And both Juho's analysis and my own confirms that this works better than a larger or smaller pot.

However, for purposes of that analysis, there was an implicit assumption that the player ratings and the faction ratings used pots of the same size. So while after a while finishing first in a game can "only" boost your rating by maybe 6 or 8 points... the rating of any faction can change by that much based on the outcome of every game. And if you look at the faction ratings over time, they do seem to bounce around quite a bit. In just the past 24 hours, Darklings have dropped from 1096 to 1088 and Witches from 1044 to 1034. Now, those aren't huge changes... but for 1 day of data vs 4 years of prior results, they seem to be far more likely the result of random variation in player success than an actual meaningful adjustment to the strength of factions. And that noise is going to reduce the accuracy of projected outcomes. So it seemed to me that it might make sense to use smaller faction pools than player pools, to account for the fact that players change in skill more rapidly than factions change in strength.

To test this, I began by tweaking parameters (also provided by Juho) that allow one to turn off either the player or the faction controls and evaluate games based purely on the ability to train the other component, and then experimented with different pot sizes. If we train faction ratings alone without any adjustment for player skill, all with 3 iterations:

Score Parameters
-235973 pot_size=16
-235680 pot_size=8
-235536 pot_size=4
-235483 pot_size=2
-235496 pot_size=1
-235529 pot_size=.5


Obviously this is a weak predictor - the predictions aren't that much better than random (and, in fact, with fewer than 3 iterations, some of them are *worse* than random). But nevertheless, the optimal pot size (at least for 3 iterations) is down around 2.

And if we train player ratings alone without any adjustment for faction strength with 3 iterations:

Score Parameters
-217120 pot_size=8
-214107 pot_size=16
-213068 pot_size=24
-212688 pot_size=32
-212613 pot_size=40


We find that pot sizes significantly larger than those currently in use work better - they adjust to changes in skill (and/or new players much stronger or weaker than 1000 rating) much faster, with the later iterations still damping out the variation enough to keep the noise down.

Hence: what we find is that players actually would prefer a bigger pot than 16, and factions a smaller pot than 16, and thus the reason 16 works reasonably well is that its a healthy balance between the two. But there's no particular reason we can't let them play with different pot sizes - so I implemented exactly this, and then tried a bunch of different combinations, and ultimately the best one was the one implied by the numbers above - faction_pot_size=2, pot_size=40, which gave a score of -208754 - about 2000 points better than the current rating system.

Over the following few days I tried a *bunch* of different variants - changing the sizes of the later iterations relative to the original, trying different number of iterations for the faction and the players, and so on and so forth... but ultimately, none of them was significantly better.

And at some point, yes, I tried dropping older data from the later iterations. And the conclusion is... it doesn't help. At all. Every variation of it I tried was worse than just running everything on all the data, and the more data you omit from subsequent iterations, the worse it does. So while it was an interesting idea that was worth considering, it doesn't seem to measure up.

So is this a proposal for a new rating system based on pot_size 40 and faction_pot_size 2? I don't know. On the one hand, a 2000 point increase in fit statistic isn't nothing - its larger than the gains from adding a 3rd iteration to the current system - but in the grand scheme of things, it makes less than half a percent difference with regards to the average projected probability of winning for the winners of games. And it does generate, um, significantly different ratings, as evidenced by the new top 10 of the ratings board:

Player Rating
ttchong 1703
Xevoc 1691
KorKronus 1678
enkidu 1674
Alex 1640
resnick 1621
mikaeljt 1605
Mihas 1589
Fujiwara 1582
Greenraingw 1578


Ratings at the top end are about 100 points higher across the board, - it thinks the gap between starting players and the best players (and from starting players to the worst players) is larger than the current rating system projects. Its also much higher on players who haven't played many games but win a lot; for instance, FakirsOnly - currently rated 1318, 136th on the board - moves up to 32nd place with a rating of 1504. That's probably not a bad thing - FakirsOnly quite possibly is a top-50 player in actuality - but it does definitely give a different look to the ratings, and I'm not sure how people feel about making such a big across-the-board change.
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Steve Haas
United States
Mountain View
CA
flag msg tools
Re: Rating System Discussion (Warning: Contains multiple-thousand-line posts)
Following is a top 500 under the proposed 40/2 rating system. Numbers for the top 10 are slightly different from before because I corrected a minor bug. I've also included current rating and current rank from the existing system - these numbers do not 100% agree with what's posted because a) they're a snapshot from about a day prior to this posting and b) some games are processed in a slightly different order. That said, they should be accurate to within a point or two of what the ratings were at the time I grabbed the data, and they use the same evaluation logic - save pot sizes - as the 40/2 run.

Also included: the rating for each faction under each system.

Faction New Old
darklings 1102 1088
cultists 1059 1068
mermaids 1056 1057
engineers 1048 1051
chaosmagicians 1037 1033
witches 1033 1034
nomads 1032 1045
riverwalkers 1022 1017
halflings 1020 1024
swarmlings 1007 1015
dragonlords 1000 998
icemaidens 989 983
dwarves 983 994
auren 966 949
giants 964 954
acolytes 948 940
yetis 946 949
alchemists 933 953
fakirs 926 913
shapeshifters 921 926


Rank Player Rate CurRate CurRank
1 ttchong 1700 1593 2
2 Xevoc 1689 1611 1
3 KorKronus 1676 1442 17
4 enkidu 1672 1504 6
5 Alex 1639 1506 5
6 resnick 1619 1511 4
7 mikaeljt 1604 1515 3
8 Mihas 1588 1441 18
9 Fujiwara 1582 1494 8
10 Greenraingw 1578 1496 7
11 shanarkoh 1573 1474 9
12 Pattern 1561 1426 23
13 ashallue 1554 1305 158
14 retardedonkey 1551 1464 11
15 Maccit 1547 1429 22
16 CTKShadow 1546 1462 12
17 pyhuang 1546 1468 10
18 SpaceTrucker 1545 1455 14
19 theDarklingKnight 1545 1421 24
20 bjolletz 1540 1456 13
21 Toshimoko 1537 1444 15
22 lrd 1531 1375 59
23 Kelume 1530 1368 69
24 eunck 1525 1415 27
25 sprockitz 1520 1437 19
26 toine 1518 1443 16
27 LesserFactions 1514 1283 207
28 Elsuprimo 1513 1370 66
29 demiurgsage 1512 1431 21
30 khoj 1509 1432 20
31 zander 1504 1417 26
32 FakirsOnly 1504 1318 136
33 ScottyMcSock 1502 1329 116
34 koleman2 1497 1409 31
35 Dhrun 1496 1413 30
36 sam20011336 1493 1403 36
37 iainuki 1493 1374 61
38 Halskov 1491 1413 29
39 6tevidim 1490 1407 33
40 le_asmo 1490 1400 40
41 steve496 1486 1393 44
42 rainbowjoe 1486 1418 25
43 adamyedidia 1485 1358 82
44 Burning 1483 1321 129
45 MattTheLesser 1482 1414 28
46 Rungan 1479 1406 34
47 bellosaurus 1478 1407 32
48 c.love 1478 1392 45
49 benoni 1478 1401 38
50 csander1 1477 1383 51
51 Thrar 1476 1391 46
52 ansonkit 1476 1405 35
53 dgrazian 1474 1398 42
54 CaliforniaLuke 1472 1400 39
55 Romma 1472 1399 41
56 Yvon 1471 1381 54
57 yenn 1469 1394 43
58 dimos 1468 1402 37
59 ktk 1468 1390 47
60 Pelly 1466 1357 83
61 Thargor 1465 1369 67
62 lscrock 1465 1388 48
63 lmk12 1464 1280 217
64 dingo 1464 1316 139
65 Pesto 1462 1362 75
66 marksickau 1459 1388 49
67 Mb 1459 1384 50
68 allstar64 1458 1333 107
69 Riku 1458 1301 166
70 Pokermarc89 1454 1382 52
71 Mito 1453 1368 68
72 Skedge 1451 1312 146
73 Stones 1450 1381 53
74 jeyallen 1449 1352 85
75 Cornholio 1449 1266 265
76 BlueSteel 1448 1377 56
77 Zorastre 1448 1332 109
78 -sjk- 1448 1375 57
79 mattpdecker 1447 1377 55
80 shaka 1447 1293 181
81 jsnell 1447 1361 77
82 fiziqs 1444 1374 62
83 trainsat 1443 1356 84
84 TerraTurtle 1443 1370 65
85 crimsonFox 1442 1367 70
86 Glenw 1442 1375 58
87 Machiavelli 1441 1374 63
88 swoogiemonster 1440 1362 76
89 akio 1439 1366 73
90 Konovalov 1439 1358 79
91 Quelvacho 1438 1322 125
92 Martin123 1438 1375 60
93 DoctorGonzo 1438 1296 177
94 Isobanaani 1436 1372 64
95 dasala 1434 1301 167
96 shellstorm 1432 1360 78
97 jimh 1431 1282 212
98 yamamoto 1431 1358 81
99 jackevils 1430 1275 232
100 harq 1429 1367 71
101 AutumnTail 1429 1351 86
102 Songo 1428 1340 95
103 ararar 1427 1342 94
104 Gr3n 1422 1349 89
105 jimboydy 1422 1336 101
106 OrionHegenomy 1420 1364 74
107 eliegel 1418 1367 72
108 Stefan 1417 1339 96
109 koishi 1414 1350 87
110 eephus 1413 1358 80
111 Fatcat 1411 1335 104
112 Grovast 1410 1342 93
113 loike 1409 1348 90
114 RazorPop 1409 1333 108
115 Chmol 1408 1324 122
116 OptionalBowser 1407 1222 412
117 The_Oak_Tree 1407 1271 246
118 thejordan 1405 1333 106
119 jbrier 1405 1332 111
120 pikeman 1404 1346 91
121 koko615615 1404 1337 100
122 krk 1404 1321 128
123 dzheng 1403 1298 172
124 Tawy 1403 1349 88
125 UOP_Champ 1402 1343 92
126 Kin 1402 1337 99
127 bpt 1402 1338 98
128 konush 1402 1338 97
129 BLEE 1402 1329 114
130 Maestro 1401 1326 118
131 ghostofmars 1401 1335 102
132 Jingking 1399 1329 115
133 rabbit19 1399 1286 198
134 StanPSmith 1398 1323 123
135 Yohek 1398 1293 182
136 Bibr 1397 1306 155
137 VHOORLYS 1396 1332 110
138 Brains-on-ice 1395 1308 152
139 buka 1395 1329 113
140 mrfister69 1395 1325 121
141 frenzy 1394 1326 119
142 thomasfermi 1393 1320 131
143 fellow 1392 1327 117
144 rodrrd 1392 1325 120
145 itzamna 1391 1316 140
146 gremar 1391 1272 243
147 kikoho 1388 1333 105
148 akira 1387 1245 329
149 Rafael.Ramus 1387 1259 290
150 JamesWolfpacker 1387 1321 127
151 kruppy 1386 1313 145
152 arthur 1386 1277 224
153 Isawa 1386 1335 103
154 Jai 1386 1314 142
155 capuso 1384 1280 218
156 Lobsen 1384 1245 330
157 Fenistil 1384 1318 134
158 Loon 1383 1322 124
159 Cephalofair 1383 1297 175
160 salamander 1383 1318 135
161 Agggron 1382 1302 164
162 Kiljen 1381 1317 138
163 surpriz3 1380 1317 137
164 Nomisbo 1379 1330 112
165 watersilence 1379 1309 149
166 mtanzer 1379 1215 442
167 Jari 1376 1275 231
168 kinoko 1376 1319 132
169 Orni 1375 1217 435
170 veenickz 1375 1289 191
171 Eldzik 1375 1285 199
172 lek 1375 1307 154
173 brunnerj 1374 1318 133
174 GoddamnBF 1374 1309 151
175 Ladinek 1373 1306 156
176 HippityHobbity 1372 1261 284
177 Olli 1372 1236 363
178 Trovoes420 1372 1314 144
179 SiberianRanger 1371 1316 141
180 AgricolaZ 1371 1310 147
181 weehoo 1371 1270 250
182 Given 1371 1304 160
183 onesones 1370 1320 130
184 blastoch 1370 1309 148
185 testrun 1368 1298 173
186 poptasticboy 1368 1302 165
187 gnpaolo 1367 1304 161
188 Ignipes 1366 1288 192
189 Artwo 1365 1305 159
190 ThePants 1364 1279 219
191 watt89 1364 1305 157
192 dingbat 1364 1279 220
193 TockTock 1363 1293 183
194 LoloHelico 1363 1314 143
195 rogerc 1363 1225 403
196 noobslice420 1362 1309 150
197 DsnowMan 1362 1203 494
198 jesseg 1362 1303 163
199 JJape 1361 1285 201
200 bjcowen 1360 1290 187
201 Pericles 1360 1275 230
202 InnerCitySumo 1359 1298 174
203 rajeevbat 1358 1283 210
204 MangeUnDindon 1358 1321 126
205 eMps 1357 1288 194
206 PerOlander 1356 1299 171
207 wArLoRd 1355 1270 248
208 sasai 1355 1201 504
209 Salokin 1355 1268 258
210 Marshall 1354 1276 229
211 phank 1354 1285 202
212 Marah 1354 1241 341
213 Devak 1353 1281 216
214 melendor 1353 1284 205
215 dil.kk 1352 1290 189
216 Namie_terra 1352 1295 178
217 Timmi 1351 1292 184
218 Trantor 1351 1270 249
219 Morat 1351 1288 195
220 Chili 1350 1289 190
221 ayr 1350 1308 153
222 Xpin 1349 1232 372
223 victori 1349 1274 235
224 bartdevos 1349 1282 214
225 leesa 1348 1285 203
226 Mirror 1348 1210 471
227 Spigoethe 1348 1287 196
228 sirius 1348 1300 169
229 firexed 1348 1268 255
230 Meijk 1348 1186 580
231 Roestertaube 1347 1303 162
232 prodigalr 1346 1257 299
233 kotoha 1346 1267 260
234 Fortranmatze 1346 1301 168
235 stardotstar 1345 1278 222
236 xavier0730 1344 1277 227
237 awoldow 1343 1252 313
238 Tormentum 1343 1299 170
239 bwallace722 1343 1232 377
240 rainy1103 1343 1294 179
241 schloetterer 1342 1277 228
242 rasberry 1342 1290 188
243 Yhtzee 1342 1277 223
244 JonasD 1342 1291 186
245 Izuu32 1341 1201 505
246 zhenzhm4 1340 1239 352
247 incognito 1340 1179 615
248 Harrakis 1340 1277 225
249 Salmur 1340 1293 180
250 smallhunter 1339 1284 204
251 bleys 1338 1219 423
252 CptPugwash 1338 1283 211
253 JakO 1338 1272 244
254 LudwigVan 1338 1288 193
255 masaki 1338 1282 213
256 sunny 1337 1285 200
257 jst526 1337 1196 527
258 Benhoffer 1337 1262 276
259 rewar 1336 1287 197
260 Ranior 1336 1254 307
261 Daeral 1336 1281 215
262 Grohm 1335 1283 209
263 TauRus 1335 1283 206
264 CatMac 1335 1234 370
265 Nikitiwe 1335 1283 208
266 BravePawn 1335 1274 236
267 Tehol 1334 1258 296
268 WOH_G64 1334 1265 267
269 Cal 1333 1221 417
270 lesy 1333 1268 254
271 evitagen 1333 1275 234
272 drsP 1333 1270 247
273 Nocturne 1333 1272 242
274 Freatla17 1333 1279 221
275 bassano 1332 1256 301
276 sam_daisuke 1332 1268 253
277 M3lchior 1331 1264 271
278 neonelephant 1330 1189 570
279 ProfHydra 1330 1192 554
280 ZornsLemon 1330 1193 549
281 w109809811 1330 1239 353
282 fox30 1330 1292 185
283 Razon 1330 1265 269
284 mochi 1330 1263 273
285 Hubbles 1329 1264 270
286 Shardan 1327 1266 264
287 cemzar 1326 1255 306
288 majicalways 1326 1267 262
289 Fortuna 1326 1263 272
290 j.madrjas 1325 1218 433
291 yoyo987 1325 1259 289
292 kraftyguy 1325 1262 279
293 Antique_Chair 1324 1197 517
294 MadDog 1324 1262 278
295 kyoflare 1324 1277 226
296 Aragos 1323 1243 338
297 grecocha 1323 1271 245
298 AlanWu 1323 1296 176
299 GrandAt 1323 1248 320
300 mbsp 1322 1272 240
301 queeerkopf 1322 1239 347
302 terrarist 1322 1261 282
303 Guiyom 1322 1261 281
304 axw 1322 1262 280
305 tori 1322 1273 239
306 Saetin 1321 1250 316
307 ryansl 1321 1267 259
308 RoyalGambetto 1321 1268 256
309 mav 1320 1272 241
310 landlord 1320 1265 266
311 Mexx 1320 1257 297
312 Cortzas 1320 1257 298
313 Sheldon_Cooper 1320 1157 762
314 mirumoto 1320 1261 285
315 PeteC 1319 1260 286
316 gambrinus 1319 1243 337
317 Raccon_ninja 1318 1237 359
318 mesterlars 1318 1252 311
319 Dreadlord 1318 1258 292
320 twosheep 1318 1275 233
321 rathstar 1317 1258 294
322 kamiyuernst 1317 1253 308
323 Socks_Wielder 1317 1202 497
324 wolfox 1317 1267 261
325 lsy03 1317 1243 335
326 toto520 1317 1258 291
327 rafalimaz 1316 1266 263
328 Nap 1316 1270 251
329 ajisuke 1316 1235 365
330 keeefir 1316 1263 274
331 Mox 1315 1203 493
332 telipych 1315 1203 496
333 DePlof 1314 1256 302
334 ShevekDelphy 1314 1257 300
335 sitatunga 1313 1243 334
336 Gargamel 1313 1160 737
337 Toma 1313 1263 275
338 koloradomice 1312 1171 664
339 m03188024 1312 1274 238
340 Jeska 1312 1219 427
341 TerraDon 1312 1256 303
342 wimastyle 1311 1274 237
343 Brent 1311 1255 305
344 Jayne 1310 1230 385
345 kahy 1310 1251 314
346 asyouare 1309 1246 323
347 Heaven 1309 1246 324
348 asymptotech 1309 1250 317
349 Eldzik_v2 1308 1249 319
350 Silberklinge 1308 1240 346
351 lcg3092 1308 1255 304
352 BattleToads 1307 1269 252
353 entaku_p 1307 1268 257
354 Brokastis.Sampionis 1306 1258 295
355 ftm 1303 1241 345
356 sasin 1303 1252 312
357 Dirrtymagic 1303 1262 277
358 hussai 1303 1242 340
359 wackforce 1302 1238 356
360 AlanR 1302 1161 731
361 Ixlyth 1302 1178 622
362 Crowbarcool 1302 1235 364
363 FallenAngel 1301 1250 315
364 Zaus 1301 1230 382
365 zlorfik 1301 1186 577
366 Jamew 1300 1149 805
367 Davy 1300 1191 558
368 Sarquer 1300 1244 332
369 weichieh1982 1299 1249 318
370 HermesDeluxe 1298 1258 293
371 shutzy 1298 1245 328
372 prodigaldax 1298 1234 368
373 magnusrk 1298 1208 478
374 Deus-ex-machine 1298 1243 336
375 roaminroman4 1297 1227 393
376 dobryden 1297 1226 399
377 Devastating_D 1296 1241 342
378 chukchuk 1296 1216 438
379 nk 1296 1238 357
380 Narxsar 1296 1244 331
381 Schummel 1296 1260 287
382 Fiitsch 1296 1259 288
383 chippy 1296 1173 647
384 Milphi 1295 1232 373
385 lazypig 1295 1253 310
386 kitty 1294 1246 326
387 qinghuangzhugening 1294 1229 386
388 Kesterer 1294 1247 321
389 Qatol 1294 1261 283
390 gpchurchill 1294 1247 322
391 WuD 1294 1165 705
392 Raveniticus 1294 1197 521
393 chuberto 1293 1212 463
394 redking 1293 1143 851
395 Lifey 1293 1253 309
396 hiukim 1293 1237 362
397 tjoc 1292 1155 770
398 Azeotrope 1292 1181 604
399 Strongfold 1292 1239 351
400 Stavros 1292 1212 460
401 qqzm 1292 1222 414
402 AnythingForMarcello 1291 1238 355
403 Zauberer 1291 1222 415
404 FatherTorque 1291 1206 489
405 antontm 1290 1203 495
406 Kazami_Meiku 1290 1222 413
407 Grucha 1290 1231 379
408 Roland 1290 1191 561
409 Alexmi 1290 1218 432
410 BillyMirmidon 1289 1244 333
411 itcouldbewirfs 1289 1245 327
412 Arvid 1289 1239 349
413 610154 1288 1231 380
414 no_dice 1288 1239 350
415 Malika27 1288 1194 541
416 sqrat 1288 1265 268
417 verdab 1287 1237 361
418 foolio 1287 1212 461
419 obelix 1287 1137 889
420 diophantus 1286 1171 661
421 frankchow 1286 1237 360
422 MiaojiBomB 1286 1228 391
423 lizzy 1285 1229 389
424 friedpetable 1285 1223 409
425 Musti 1284 1225 404
426 RagingKozak 1284 1231 378
427 belette7 1284 1215 444
428 Fluxx 1284 1227 397
429 Vazoun 1283 1230 381
430 hochun0809 1283 1169 675
431 p0618168 1283 1211 465
432 TravisB 1283 1241 343
433 Scout920 1283 1232 375
434 laidbackluuk 1283 1175 640
435 enchant 1283 1209 476
436 ogodob 1282 1230 384
437 howlingmadbenji 1282 1186 584
438 tanshu 1282 1229 387
439 MissMaytree 1282 1232 374
440 Xeil 1282 1177 624
441 vengeful 1282 1214 450
442 NEWgreenBIRD 1280 1216 440
443 beeny 1280 1224 405
444 valenski 1280 1215 447
445 suncoursing 1279 1172 657
446 w00ps 1278 1198 515
447 winston11tw 1278 1232 376
448 mjmassi 1278 1176 633
449 rlyTHuk 1277 1226 400
450 moira 1277 1246 325
451 linsson 1277 1226 402
452 cmoney 1277 1156 765
453 steven 1277 1211 464
454 WillyTest 1276 1235 366
455 Quetzal513 1276 1226 401
456 coyotte508 1275 1197 519
457 zm0219 1275 1125 973
458 Davidking 1275 1241 344
459 TheFlying 1275 1213 455
460 tomo_kallang 1275 1226 398
461 froggy 1275 1238 358
462 toucan13 1274 1195 530
463 MaximusB 1274 1239 354
464 oilcan 1273 1163 718
465 Dinandino 1273 1202 499
466 Pjotr 1273 1213 456
467 ColoredSands 1273 1218 430
468 jl_cc_xuehao 1273 1228 392
469 CKHO 1273 1152 794
470 Packby 1273 1220 421
471 Andrea 1273 1150 800
472 chenguanxi 1273 1213 453
473 Khal 1272 1218 431
474 Stads 1272 1171 660
475 Sir_Gawain 1272 1223 408
476 Deryl 1272 1217 437
477 19sebbe90 1272 1182 599
478 kaitie 1272 1192 555
479 d_torch 1272 1229 388
480 LilBuck 1272 1219 422
481 nightlark7 1271 1227 395
482 Dashmudtz 1271 1213 451
483 Noah 1270 1218 429
484 twgeldon 1270 1212 458
485 Niccolo 1270 1227 396
486 Nj 1270 1162 720
487 Naaram 1270 1220 419
488 Fragsworth 1269 1229 390
489 Bryson 1269 1199 512
490 jawolopingjeff 1269 1156 768
491 atreyu26 1269 1212 459
492 slapp 1268 1167 690
493 Chio 1268 1219 428
494 Styrx 1268 1213 452
495 LaQuille 1268 1219 426
496 fleischwunde 1267 1200 509
497 gabrielcdc 1267 1201 502
498 paschlewwer 1266 1239 348
499 yucendulang 1266 1221 418
500 armaddon 1265 1215 445

1 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Steve Haas
United States
Mountain View
CA
flag msg tools
Re: Rating System Discussion (Warning: Contains multiple-thousand-line posts)
Spreadsheet comparing ratings with different pool sizes.

Contains a full rating table for the following settings:

Fit Settings
-210873 current (pot=16, faction_pot=16)
-210410 pot=16, faction_pot=2
-209299 pot=24, faction_pot=2
-208868 pot=32, faction_pot=2
-208754 pot=40, faction_pot=2
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Juho Snellman
Switzerland
Zurich
Zurich
flag msg tools
Avatar
mb
Thanks for running the experiments!

Steve496 wrote:

However, for purposes of that analysis, there was an implicit assumption that the player ratings and the faction ratings used pots of the same size.

It was not an implicit assumption! :) That's what the whole "faction weight" variable is about. At least at the time those experiments were run, the difference between the weights being equal vs. the factions being 90%/80%/50% lower was just insignificant noise. (This is experiment H in [geekurl=https://www.snellman.net/blog/archive/2015-11-18-rating-system-for-asymmetric-multiplayer-games/]the blog post[/geekurl]).

Quote:
But there's no particular reason we can't let them play with different pot sizes - so I implemented exactly this, and then tried a bunch of different combinations, and ultimately the best one was the one implied by the numbers above - faction_pot_size=2, pot_size=40, which gave a score of -208754 - about 2000 points better than the current rating system.

I just tried roughly settings on the previous data, and again it's a totally imperceptible change.

There are some possible explanations for why we're seeing different results:

- The nature of the results changed somehow in the last couple of years. Don't particularly believe this.

- There's some curve fitting going on here. None of the algorithm's parameters were actually derived experimentally; they were educated guesses. Three years down the line they turned out to be pretty decent guesses, but if there had been minor adjustments that showed minor improvements, I would not have used those. But the changes you're proposing are pretty major, so this doesn't seem very likely either.

- There's a bug in my code.

- It's down to the different test methodologies. That is, a single training vs. evaluation data set vs. continuous training + evaluation. Your setup obviously matches reality better. This seems like the best explanation: with single training/evaluation sets, a single bad rating from the training set will affect every single game of that player in the evaluation set. With your system, it'll probably affect just a month's worth of games before reverting to the mean. So there's an implicit extra penalty for volatility in my test setup.

It might make sense to reduce the faction weight just on aesthetic grounds, no matter what. People seem to find rapid changes in the faction ratings kind of disturbing. At the time it was just left as-is since the exact value made so little difference in my testing.

I'm not really keen on increasing the volatility of player ratings in general, but maybe there's scope for some kind of trend detection (a player is doing substantially better/worse than predicted over a period of 20 games => that probably means we're badly calibrated on them, and should have higher sensitivity). Rating systems that track uncertainty would in theory take care of this automatically, but I didn't get great results out of them.
3 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Steve Haas
United States
Mountain View
CA
flag msg tools
If I'm reading your code correctly (I'm not particularly comfortable in Perl): faction weight is actually doing two different things. It does reduce the pot size for updating faction weights; but it also reduces the impact of the trained weights for purposes of making projections. That is, setting faction_weight to .1 would effectively reduce the faction pot size from 16 to 1.6, but it also reduces the 200-point-difference between Darklings and Fakirs to only 20 for purposes of predicting games. What I'm proposing is giving it its full weight in terms of how it changes the projected outcome, but reducing the rate at which it updates.

In terms of how it compares on the original data set: the first thing I did when I started working on this is write an evaluation function to compare your original output files. So here are the scores I generated for your initial data set and the default outputs of generate-win-probabilities.sh:

-18713 elo-separate-maps-k16-nd.csv
-18720 elo-separate-maps-k16-fw-0.2.csv
-18722 elo-separate-maps-k16-fw-0.5.csv
-18744 elo-separate-maps-k16-pfw0.5.csv
-18753 elo-separate-maps-k16-batch.csv
-18755 elo-separate-maps-k16.csv
-18755 elo-separate-maps-k16-pfw0.csv
-18785 trueskill-separate-maps-fw-0.5.csv
-18802 trueskill-separate-maps.csv
-18809 elo-current-iter5.csv
-18810 trueskill-separate-maps-fw-0.2.csv
-18815 elo-separate-maps-k16-fw-0.1.csv
-18823 trueskill-separate-maps-nd.csv
-18826 elo-separate-maps-k16-pfw1.csv
-18833 elo-current-k16.csv
-18836 elo-current-k24.csv
-18849 elo-separate-maps-k8-nd.csv
-18862 elo-current-k16-min1.csv
-18877 trueskill-default.csv
-18883 elo-separate-maps-k16-fw-2.csv
-18886 elo-separate-maps-k8.csv
-18894 trueskill-nd.csv
-18920 elo-current-iter2.csv
-18958 elo-current-k8.csv
-18964 trueskill-separate-maps-fw-1.5.csv
-19111 elo-iter-k24-min5.csv
-19140 elo-iter-k16-min5.csv
-19271 trueskill-no-factions.csv
-19286 elo-iter-k8-min5.csv
-19309 elo-original-k16-min5.csv
-19315 elo-original-k16.csv
-19366 elo-current-iter1.csv
-19390 elo-original-k8.csv
-19396 elo-original-k8-min5.csv
-19399 elo-original-k24-min5.csv
-19417 elo-original-k24.csv
-19531 elo-iter-k4-min5.csv
-19562 elo-original-k32-min5.csv
-19591 elo-original-k32.csv
-19993 whr-iters-5.csv
-20011 whr-iters-10.csv
-20052 whr-iters-2.csv
-20053 whr-iters-20.csv
-20100 whr-iters-1.csv
-20132 whr-iters-50.csv
-20791 elo-none.csv


And here's some fits generated by my proposed methodology:

-18751 faction_pot=1 pot_size=24
-18752 faction_pot=2 pot_size=24
-18757 faction_pot=4 pot_size=24
-18752 faction_pot=1 pot_size=32
-18753 faction_pot=2 pot_size=32
-18759 faction_pot=4 pot_size=32
-18769 faction_pot=1 pot_size=40
-18770 faction_pot=2 pot_size=40
-18775 faction_pot=4 pot_size=40


So: the original data set does favor somewhat smaller pot sizes relative to the extended training set that I used, and the net improvement is a little smaller (~100 points out of 20k vs 2000 points out of 200k) - but its still bumping up into the range that previously only map-specific ratings achieved. Whether that's actually worth it... dunno. Kinda up to you.

Edit: Realized I missed a paragraph in your response, rewriting the end of this to reflect that.

I basically agree with your theory re: training sets, and would additionally note that the fact that your training/eval setup only trains once means it additionally has less incentive to train quickly. If it can get to the right number in a year of training data, that's good enough (save for those players in the middle of improving dramatically, which we can safely presume are rare). So not only is over-costing noise, its also under-costing slowness of update.

That said: I also understand the desire to not have ratings be *too* volatile, even if that works out well theoretically; not sure what the right balance is. Perhaps there's some way to account for "rating churn" in my evaluation logic - I'll have to think about that.
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Steve Haas
United States
Mountain View
CA
flag msg tools
So, I decided to investigate the relative volatility of the current system relative to my proposal. To do so, I plotted rating at the end of each month for a couple of players, and graphed it. Volatility would show as ratings bouncing up and down; the fewer times we switch from "rating going up" to "rating going down", the better.

As guinea pigs, I picked myself and CTKShadow (as our ratings were already discussed some in the previous thread), plus Xevoc, as an example of someone who's maintained a relatively steady rating for an extended period of time.







The proposed changes do seem to consistently overshoot a bit on the initial rating surge, but after that, they're kind of amazingly similar to the current system. The magnitude of variation is maybe 50% bigger (for instance, Xevoc's rating over the past year varies within a range of 27 instead of 19), but in general its moving in the same direction at the same time with only marginally larger magnitude.

I don't know for sure why this is the case; intuitively, one would think that if the pool is 2.5 times larger, the variation should be 2.5 times larger as well. I assume this has to do with the higher spread in ratings combined with the multiple iterations through the data, but I'm having a hard time coming up with a coherent explanation for exactly how and why this happens.
2 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Dhrun
msg tools
mbmbmbmb
I'm happy you guys do the math here, just wanted to echo
Greenraingws and Juhos opinion that ratings should not become more volatile without any very good reason

 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Dhrun
msg tools
mbmbmbmb
Dhrun wrote:
I did not think much on any of this yet, but if you want to touch player pot size, it might be reasonable to vary it based on the type of game:

Idea being that some players play many causal-yet-ranked games carelessly or to experiment and only focus all their brilliance/knowledge/careful play for games they deem more relevant aka the Tour.
While others don't, but rather focus on improving their rating with every game, thinking very hard in all situations, not agreeing to have people take back turns, not taking risks, etc. (which obviously is not wrong).

To reflect this, we could e.g. use
ps 40 for tour
ps 30-40 for upcoming official events like WTC or FI tour (depending on significance people want to assign it)
ps 20 for any open game
ps 10 for any private game

I'm not sure this is a cool idea, just sayin'..


This of course means higher volatility for higher "ranked" games, which is a side-effect of the intended effect to give them more weight relatively against that players "ordinary" games. Note it does not put players only playing "ordinary" games at a disadvantage, except that as starting players they would need longer to reach their equilibrium (at which OTOH they might stay more consistently).
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Dhrun
msg tools
mbmbmbmb
Looking at Steve's table I think a pot size of 40 is too much with the rest of parameters currently used (and therefore I would e.g. half my graded ps suggestions from above).

E.g. in the top 32 I spontaneously see 2 people with 12-15 matches who are probably brilliant but IMHO should not be right at the top yet, because they might have been a bit lucky not to get 1-3 more defeats so far (and had few top opponents yet). Needing a few more games to reach the top might even add to their own sense of achievement if they think as oddly as mecool
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
George Sprockitz
United States
Pennsylvania
flag msg tools
mbmbmb
Steve496 wrote:

The proposed changes do seem to consistently overshoot a bit on the initial rating surge, but after that, they're kind of amazingly similar to the current system. The magnitude of variation is maybe 50% bigger (for instance, Xevoc's rating over the past year varies within a range of 27 instead of 19), but in general its moving in the same direction at the same time with only marginally larger magnitude.

I don't know for sure why this is the case; intuitively, one would think that if the pool is 2.5 times larger, the variation should be 2.5 times larger as well. I assume this has to do with the higher spread in ratings combined with the multiple iterations through the data, but I'm having a hard time coming up with a coherent explanation for exactly how and why this happens.


Are both ratings doing the same number of iterations? It wasn't real clear what's being used in each, but it sounded like you used 3 and the original 2. The weighted pot size of the final iteration dictates volatility and would be close in a 40/3^2 versus 16/2^2 system.
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Steve Haas
United States
Mountain View
CA
flag msg tools
Dhrun wrote:
Looking at Steve's table I think a pot size of 40 is too much with the rest of parameters currently used (and therefore I would e.g. half my graded ps suggestions from above).

E.g. in the top 32 I spontaneously see 2 people with 12-15 matches who are probably brilliant but IMHO should not be right at the top yet, because they might have been a bit lucky not to get 1-3 more defeats so far (and had few top opponents yet). Needing a few more games to reach the top might even add to their own sense of achievement if they think as oddly as mecool


I feel like there's kind of two separate points here, and it fundamentally comes down to the intent of the rating system. The analysis I did initially - optimizing the predicted probability of the observed outcome - is based on the notion that its designed to be purely predictive. If, instead, the goal is to make a high rating an achievement to be worked towards, then there's a different set of questions to be asked.

If the goal is optimizing predictive ability for matchmaking purposes, the only real question is "do we believe FakirsOnly, LesserFactions, etc. are actually top 50 players?" Personally, I was of the opinion FakirsOnly was a smurf account for a top-50-type-player after 9 games, and his two seasons in the tournament has only made me more confident in that. So I don't find the rating system's assertion that he's actually at 1500+ player even remotely surprising or worrisome.

More generally: when it comes to playing a 1300-rated player, given a choice of playing an arbitrary player with 100 games, or LesserFactions with his 12, who do you think you're more likely to beat? And assuming that you think LesserFactions is the stronger opponent, isn't that an indication that they should be rated more highly?

On the other hand, if we're thinking about rating as an achievement - something to be earned - then the question becomes: how long *should* it take to achieve a top rating? If Xevoc started a smurf account today, how many games against comparably-rated-at-the-time opposition would we want to see before we dubbed it the account of a top 200/50/10/2 player? I don't have an answer to that, but would be interested to see people's opinions on the matter - its not something we can optimize for until we know where we're aiming.

gmg159 wrote:
Steve496 wrote:

The proposed changes do seem to consistently overshoot a bit on the initial rating surge, but after that, they're kind of amazingly similar to the current system. The magnitude of variation is maybe 50% bigger (for instance, Xevoc's rating over the past year varies within a range of 27 instead of 19), but in general its moving in the same direction at the same time with only marginally larger magnitude.

I don't know for sure why this is the case; intuitively, one would think that if the pool is 2.5 times larger, the variation should be 2.5 times larger as well. I assume this has to do with the higher spread in ratings combined with the multiple iterations through the data, but I'm having a hard time coming up with a coherent explanation for exactly how and why this happens.


Are both ratings doing the same number of iterations? It wasn't real clear what's being used in each, but it sounded like you used 3 and the original 2. The weighted pot size of the final iteration dictates volatility and would be close in a 40/3^2 versus 16/2^2 system.


Pretty much all data I'm posting is for 3 iterations. I explored using more and less, and the conclusion, ultimately, was that more iterations basically always helps, but each iteration helps less than the last. Juho found 3 to be a reasonable place to stop, and I haven't seen any reason to challenge that assertion. So the final iteration in the current system is 16/9, and the final iteration in the proposed system is 40/9 for players and 2/9 for factions.
1 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Dhrun
msg tools
mbmbmbmb
Steve496 wrote:
Dhrun wrote:
Needing a few more games to reach the top might even add to their own sense of achievement if they think as oddly as mecool


I feel like there's kind of two separate points here, and it fundamentally comes down to the intent of the rating system.

Steve, I think a ranking can actually cover both, predictive power and achievement, reasonably though not perfectly at the same time.
Of course what both goals actually mean can be interpreted differently.

But maybe I should have foregone my last sentence above, let's concentrate on the rest first :-)


So if you want to use the ranking to derive an expected value for a future game without context, your train of thought sounds very good.

But
- while a player who won his first 12 games against random opponents could be a better player than someone who played his last 12 games in D1 with barely average results, common sense and statistics tell us that in all likelyhood he isn't. More likely than not he just got (at least a tiny bit) lucky.
- There is other context (e.g. people could use narrow game setups to grind) which also cannot be modelled easily (the suggestion to gauge the relevance of games using different factors might help a bit).

=> Just to counter this to some extent, one might shift the ratings a bit in the direction of "achievenment".


Note I'm not in favor of using the ratings to express some loyalty bonus. But don't underestimate that people might have long streaks of successful/less successful play which do not represent their "true" strength as a player. Both because of random volatility and volatility in the style of play.
And currently I'm just unsure how much I'd like to see somewhat current strength with rather heavy blur or overall strength with only a slight emphasis on recent play and less statistical volatility.
snore
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
James Wolfpacker
United States
North Carolina
flag msg tools
mbmbmbmb
Steve496 wrote:
If the goal is optimizing predictive ability for matchmaking purposes, the only real question is "do we believe FakirsOnly, LesserFactions, etc. are actually top 50 players?" Personally, I was of the opinion FakirsOnly was a smurf account for a top-50-type-player after 9 games, and his two seasons in the tournament has only made me more confident in that. So I don't find the rating system's assertion that he's actually at 1500+ player even remotely surprising or worrisome.


I'll confirm that these 2 accounts are actually aliases and much better than top 50. I'm not telling exactly who they are though.
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Luke J
msg tools
I am quite certain that each player rated above me on the new table is a more skilled player than me, so I give it my blessing
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Steve Haas
United States
Mountain View
CA
flag msg tools
Dhrun wrote:
Steve496 wrote:
Dhrun wrote:
Needing a few more games to reach the top might even add to their own sense of achievement if they think as oddly as mecool


I feel like there's kind of two separate points here, and it fundamentally comes down to the intent of the rating system.

Steve, I think a ranking can actually cover both, predictive power and achievement, reasonably though not perfectly at the same time.
Of course what both goals actually mean can be interpreted differently.

But maybe I should have foregone my last sentence above, let's concentrate on the rest first :-)


So if you want to use the ranking to derive an expected value for a future game without context, your train of thought sounds very good.

But
- while a player who won his first 12 games against random opponents could be a better player than someone who played his last 12 games in D1 with barely average results, common sense and statistics tell us that in all likelyhood he isn't. More likely than not he just got (at least a tiny bit) lucky.
- There is other context (e.g. people could use narrow game setups to grind) which also cannot be modelled easily (the suggestion to gauge the relevance of games using different factors might help a bit).

=> Just to counter this to some extent, one might shift the ratings a bit in the direction of "achievenment".


Note I'm not in favor of using the ratings to express some loyalty bonus. But don't underestimate that people might have long streaks of successful/less successful play which do not represent their "true" strength as a player. Both because of random volatility and volatility in the style of play.
And currently I'm just unsure how much I'd like to see somewhat current strength with rather heavy blur or overall strength with only a slight emphasis on recent play and less statistical volatility.
snore


I agree that its possible to cover both to an extent; my point is that the analysis I've provided so far is focused solely on the predictive aspect. So if the question is just how well this reflects actual skill levels, that's a thing we already have data on. Its absolutely true that one can go on winning or losing streaks that do not reflect one's true skill. And, as humans, its hard to know if 15 wins in a row with a weak faction against middling opposition is more or less impressive than a sustained run of middling results in D1. But we also don't really need to speculate on how to handle this - we can try different options with different parameters and see what works the best in practice. And at the moment, the thing I've found that works best is pretty aggressive at assuming that players with long win streaks with a weak faction, even if against mediocre opposition, are actually good players. That doesn't mean its always correct to do so, but it tends to yield better predictions about future outcomes than to do otherwise.

This doesn't mean that that's the right way to set up the rating system, though. If we have criteria beyond raw predictive ability - whether that's "needing to prove oneself in high-level games to get a high rating" or "making it hard to level smurf accounts to make cheesing the system harder" or whatever it is, that's fine - my assertion is just that it would be helpful to define those criteria so we can optimize around them and factor in all requirements when assessing various options, rather than optimizing around the single parameter of prediction quality and then nixing what it comes up with because we don't like the side effects it generates.

JamesWolfpacker wrote:
Steve496 wrote:
If the goal is optimizing predictive ability for matchmaking purposes, the only real question is "do we believe FakirsOnly, LesserFactions, etc. are actually top 50 players?" Personally, I was of the opinion FakirsOnly was a smurf account for a top-50-type-player after 9 games, and his two seasons in the tournament has only made me more confident in that. So I don't find the rating system's assertion that he's actually at 1500+ player even remotely surprising or worrisome.


I'll confirm that these 2 accounts are actually aliases and much better than top 50. I'm not telling exactly who they are though.


Would I be correct in positing that the players who own these accounts are rated at least as highly as their smurf accounts are, even in the proposed new system?
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Sam
msg tools
mbmbmb
Steve496 wrote:
q="JamesWolfpacker"]
Steve496 wrote:
If the goal is optimizing predictive ability for matchmaking purposes, the only real question is "do we believe FakirsOnly, LesserFactions, etc. are actually top 50 players?" Personally, I was of the opinion FakirsOnly was a smurf account for a top-50-type-player after 9 games, and his two seasons in the tournament has only made me more confident in that. So I don't find the rating system's assertion that he's actually at 1500+ player even remotely surprising or worrisome.


I'll confirm that these 2 accounts are actually aliases and much better than top 50. I'm not telling exactly who they are though.


What is with all this talk of "smurf" accounts? Do people really care so much about their ratings that they are unwilling to play weak factions or experimental strategies? This creates another problem. When a strong player plays with an account that is rated lower than they are, they deflate the ratings of their opponents. The real issue that we probably need to solve is sorting the wheat from the chaff. We need to more easily separate rated and unrated games. This would obviate the need for some players to create "smurf" accounts, and would allow the rest of us to play experimental games without harming our ratings.

As an aside, having played against LesserFactions, I am somewhat annoyed not knowing who it actually is.
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
James Wolfpacker
United States
North Carolina
flag msg tools
mbmbmbmb
Well, the main accounts of the alias accounts are rated higher. Of course these alias accounts can't currently sign up for those 1350+ rated games (or higher) to play those games so that could keep their rating down.

Also keep in mind that there are alias accounts to play in the tournament. You'll notice that there are some conspicuous players that haven't really "played in TM Tour" and there are some in TM Tour that have just played tournament games.
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Steve Haas
United States
Mountain View
CA
flag msg tools
I mean, setting aside the motivations for why people do so, I'd argue its clear that some number of alt accounts exist. Even setting aside James's assertions, I don't buy that players as strong as some of these accounts we've been discussing only play online with obvious gimmick accounts.

Now, whether allowing people to more easily play unrated games would fix that is less clear. If its just a desire to protect their rating, then yes; but it could also be a desire to practice against some weaker competition before trying their strategies against other top players, or a desire to garner notoriety for the gimmick itself - there's no doubt that we talk about FakirsOnly more than we would a top-20 player who happened to play a couple unrated Fakirs games a month. So I'm not convinced easier unrated games would eliminate such alt/alias/smurf/whatever accounts.

That said: given that such accounts do exist, an interesting question is how they should be rated. My instinct would be that getting them to their appropriate rating more quickly is probably a good thing, because then at least their opponents know what they're getting into; not sure how other people feel about that.
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Steinar Nerhus
Norway
flag msg tools
mbmbmb
An option is to do like they do in chess: new players change their rating faster.
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Steve Haas
United States
Mountain View
CA
flag msg tools
The iterative application of Elo does roughly that - your rating changes a lot faster as long as you're always winning or always losing, and slows down once you start to find your level. That said: there's still a question about how fast is too fast. Is it reasonable to be ranked as a top-50 player after a dozen games, provided you win them all? Or should it take longer than that?
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Greg W
United States
California
flag msg tools
Avatar
mbmbmbmbmb
Steve496 wrote:
Is it reasonable to be ranked as a top-50 player after a dozen games, provided you win them all? Or should it take longer than that?


I think it should take longer, not because I subscribe to any nostalgic idea that people should "earn" their place, but because unlike with Chess matches, TM games do not provide players the opportunity for perfectly controlled relative positioning. To quote my previous point:

"TM does a fantastic job among Euro-style board games of eliminating random chance as a factor - this is part of what drew me to the game in the first place - but no interactive multi-player game like this can truly give players full control of their own destiny, because we are always making choices [that] help/hurt one or another of our opponents."

Put differently, I'd guess that it is within reasonable statistical probability for a new player to benefit over 12 games from opponents' choices, particularly if we're talking about choices on the margins in just a few games. For me, the question of how "swingy" you want the rankings to be comes down to the downsides: I'd rather have a system that takes longer to accommodate improved player skill than one that is littered at the top with players who won 12-20 games and then quit the site while they were ahead.

A fair retort would be "okay then what is the magical number at which we fully trust the results of games?" And unfortunately, I can't really say what that number is or should be - just that 12 feels small. Ultimately, I suspect, 8 different people would draw the line in 8 different places, so it becomes necessary to rely on those who created the system to make some snap judgments. For me, the system Juho set up feels like it allows for rating changes at a speed that makes sense, but I also acknowledge that there's a familiarity bias affecting my feelings there.
2 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Robert
Germany
Bocholt
flag msg tools
badge
I paid 100 Geek Gold so that you can read this! :-)
Avatar
mbmbmbmbmb
greenraingw wrote:
A fair retort would be "okay then what is the magical number at which we fully trust the results of games?" And unfortunately, I can't really say what that number is or should be - just that 12 feels small.
I agree that 12 seems small. And not just because there are 14 factions in the game. My intuitive thought was 42 , but I guess that's a bit to much, so I'd go for 25 or 30.

Another thought: I know the current system doesn't honor faction variability (cf. FakirsOnly with just Fakirs and some Engineers), but it would be nice if you couldn't become a top ranked player by sticking to just a few factions.
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Sam
msg tools
mbmbmb
Steve496 wrote:
The iterative application of Elo does roughly that - your rating changes a lot faster as long as you're always winning or always losing, and slows down once you start to find your level. That said: there's still a question about how fast is too fast. Is it reasonable to be ranked as a top-50 player after a dozen games, provided you win them all? Or should it take longer than that?


Everyone benefits from getting players to their actual rating faster. When you have very good new players, but then force them to work their way up, this deflates the ratings of those they are playing against. While I see the conundrum, how meaningful is a rating that is premised only on wins against weaker opponents?

Chess rating schemes solve these problems with a couple of simple mechanisms. First, new players have increased volatility until they reach a certain number of games played. Second, new players have their initial rating set based on their initial results. Third, until new players have enough games, their ratings are considered provisional.

What I'm saying is that we shouldn't structure ratings to force new players to grind their way up to their actual rating solely because some might game the system to try and create a highly rated account. Instead we should try to create the most accurate rating system possible, but perhaps mark new accounts as provisional ratings until they get sufficient number of games played.
1 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
James Wolfpacker
United States
North Carolina
flag msg tools
mbmbmbmb
DocCool wrote:
(cf. FakirsOnly with just Fakirs and some Engineers)

To be fair... someone picked Nomads ahead of him in a tournament game. laugh
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
Robert
Germany
Bocholt
flag msg tools
badge
I paid 100 Geek Gold so that you can read this! :-)
Avatar
mbmbmbmbmb
JamesWolfpacker wrote:
DocCool wrote:
(cf. FakirsOnly with just Fakirs and some Engineers)

To be fair... someone picked Nomads ahead of him in a tournament game. laugh
I know - if he can, he'll pick Fakirs.

Regarding the discussion in this thread: imagine the frustration of players doing surprisingly well with Fakirs if their result happens to come in just after FakirsOnly scored a victory and thus made the rating system believe Fakirs aren't all that bad. That would largely go away if faction pot size was reduced to a fraction of player pot size.
 
 Thumb up
 tip
 Hide
  • [+] Dice rolls
1 , 2 , 3  Next »   | 
Front Page | Welcome | Contact | Privacy Policy | Terms of Service | Advertise | Support BGG | Feeds RSS
Geekdo, BoardGameGeek, the Geekdo logo, and the BoardGameGeek logo are trademarks of BoardGameGeek, LLC.