How do you calculate error bars on elo rating?

Question

In season 14 of TCEC, Stockfish defeated Leela 50.5-49.5 in a 100-game match (the closest possible winning margin). Using an elo calculator such as this one, we can see that the elo difference between the two engines was +3.

However it must be that this isn't a statistically significant result - after all the match was only 100 games, and it's the smallest possible winning margin. Therefore the error bars on this calculation should be at least 3. The calculator doesn't give the error bars though, and I can't find any equations on how to calculate the error. How do you calculate the error on this estimate?

Related: How do you calculate elo?

Edit: Looks like 3dkingdoms have improved their calculator, which is now capable of calculating an error margin. I'd still be interested in seeing how that error margin is calculated, though.

I'm voting to close this question as off-topic because it is a maths question, not a chess question. — Brian Towers, May 23 '19 at 09:09
LOL. People ask questions about shogi and things like that all the time — David, May 23 '19 at 20:38
I'd put this on Math.SE, but then first I'd have to explain what elo is, how it translates to expected performance, etc. — Allure, May 23 '19 at 21:29
@Allure I think CrossValidated (Stats.SE) can handle this if Chess.SE won't allow it. You could probably reword it to "What is the uncertainty on the rating estimate in the Elo model? in order to make it suitable here. — SecretAgentMan, Feb 13 '23 at 13:08
Voting to reopen. If the math involved here isn't sufficiently chess related, then neither is almost any other "chess question". — Charles Rockafellor, Feb 13 '23 at 16:20
The error bars for an Elo estimate depend on the number of games played and how well-matched the players are. The more games played and the greater the difference in skill between the players, the smaller the error bars. There isn't a set formula to calculate the error bars, but statistical methods like the binomial distribution can be used to get an estimate. The Elo rating system is based on a player's performance against other players and gets updated after each game based on the result and the player's rating compared to their opponent's. — Error404, Feb 14 '23 at 14:57
ELO hasn't error bars by definition. Its calculated over results. For hipotetical problems generate 1000 aleatorial results in a spreadsheet using the Elo formula. — djnavas, Feb 16 '23 at 17:47

David · Answer 1 · 2023-04-21T06:09:15.810

1

Elo calculations are not a statistical estimate of an unknown parameter, they are just an aggregate count of actual results. Since all those results are known, the "error bar" would have size 0.

It is possible to perform a test of statistical significance over match results to determine if there's enough evidence to conclude that one engine/player is stronger than the other, but that's a separate problem from Elo computation.

edited Apr 21 '23 at 06:09

answered Apr 14 '23 at 08:56

David

16,275
26
61

It is possible to perform a test of statistical significance over the match results to determine if there's enough evidence to conclude that one engine/player is stronger than the other Do you know how to do this test, e.g. with the match in the OP as an example? – Allure Apr 14 '23 at 09:07
1

@Allure with the information we have the best choice would probably be a proportions test (https://www.statology.org/one-proportion-z-test/). Let p be the average score Stockfish would get against Leela and let's take p=0.5 as the null hypothesis. However since chess is a game with three possible results, it'd be more accurate to have the win/draw/loss count rather than just the final score. Then we can check if the number of wins is significantly bigger than the number of losses instead of just a test over the final score. – David Apr 14 '23 at 09:12
@David To get to the spirit of the question, would it be valid to just go to https://www.statology.org/one-proportion-z-test-calculator/ and fill in p=0.505 and n=100, look at the 95% confidence intervals, and plug them into the FIDE fractional-score-to-rating table to convert them to a ratings difference? Or does the 3 result possibility throw this calculation off? (If I do that, I get 95% C.I. = [0.4070, 0.6030] which gives me about -67 to +74.) – D M Apr 15 '23 at 01:11
@DM In this case it'd be the same since even if it was 1 win and 99 draws you wouldn't get a significant difference. The proportions test approximation is good enough. One could also try a t-test for normal distributions as a reasonable approximation – David Apr 15 '23 at 11:08

score 0 · Answer 2 · answered Apr 21 '23 at 08:05

Here is a practical answer: Since, as already said, the calculation is "exact", look instead at the rating fluctuations of players since by any model, the rating of a (medium aged) player is constant, modulo local form fluctuations, aging and whatnot. These random fluctuations should be a far better proxy for the "real" sigma of the Elo model than any contrieved math. (Only don't take my charts :-)

How do you calculate error bars on elo rating?

2 Answers2