Next Article in Journal
A Path Planning Method Based on Hybrid Sand Cat Swarm Optimization Algorithm of Green Multimodal Transportation
Previous Article in Journal
A Division-of-Labour Approach to Traffic Light Scheduling
Previous Article in Special Issue
Making More with Less: Improving Software Testing Outcomes Using a Cross-Project and Cross-Language ML Classifier Based on Cost-Sensitive Training
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Stochastic Extensions of the Elo Rating System

by
Gonzalo Gómez-Abejón
1,* and
J. Tinguaro Rodríguez
2
1
Department of Computational and Applied Mathematics and Operations Research (CMOR), Rice University, Houston, TX 77005, USA
2
Department of Statistics and Operations Research, Complutense University of Madrid, 28040 Madrid, Spain
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(17), 8023; https://doi.org/10.3390/app14178023 (registering DOI)
Submission received: 20 July 2024 / Revised: 12 August 2024 / Accepted: 28 August 2024 / Published: 8 September 2024
(This article belongs to the Special Issue Soft Computing Methods and Applications for Decision Making)

Abstract

:
This work studies how the Elo rating system can be applied to score-based sports, where it is gaining popularity, and in particular for predicting the result at any point of a game, extending its statistical basis to stochastic processes. We derive some new theoretical results for this model and use them to implement Elo ratings for basketball and soccer leagues, where the assumptions of our model are tested and found to be mostly accurate. We showcase several metrics for comparing the performance of different rating systems and determine whether adding a feature has a statistically significant impact. Finally, we propose an Elo model based on a discrete process for the score that allows us to obtain draw probabilities for soccer matches and has a performance competitive with alternatives like SPI ratings.

1. Introduction

Rating systems track and predict the performance of competitors in pairwise zero-sum games. They were initially developed to objectively measure the strength of chess players, with the first successful system proposed by Arpad Elo in 1960 and adopted by the United States Chess Federation (USCF) to replace the more problematic Harkness rating system. The International Chess Federation (FIDE) also started publishing ratings of its players in 1970 and continues to publish them today. The Elo rating system, or variants of it, are also used by most chess websites, where users are automatically matched against other players of similar strength, as well as by federations of Go and Scrabble players, and several video games.
The motivation, mathematical basis, and problems of the Elo rating system are explained in detail in Elo’s book [1], and the academic literature is summarized in [2]. The system gives each player a real parameter called their rating and defines an algorithm to update it with the result of each new game. Winning a game increases the player’s rating, and losing decreases it, but the increase depends on the rating of the rival. This allows for the ratings to change as the strength of the players also does. More sophisticated systems like glicko, proposed by Mark Glickman [3], include two parameters for each player, representing the strength and its variability, to allow for players gaining strength more quickly than others.
In his thesis, Glickman also extends Elo ratings to sports like American football to make predictions about the score (not just the result), and more recently, the idea of applying Elo ratings to team sports has gained popularity. For instance, https://fivethirtyeight.com/features/how-we-calculate-nba-elo-ratings/ (accessed on 19 July 2024) has been computing the ratings of NBA teams since 2014, and other websites like www.eloratings.net (accessed on 19 July 2024) do the same for national soccer teams.
There is an extensive literature on modeling the score of sports with stochastic processes, either Brownian motion [4] or other continuous processes such as the gamma process [5]. In order to account for ties, discrete processes can be used, typically Poisson processes [6,7], at the expense of more convoluted prediction and update formulas. Logistic models have also been used to predict the score during the game [8,9].
Our contributions are a more formal study of the model underlying the Elo system, with a few results regarding unbiased estimators for the ratings that are used in our computational analysis of league games. We derive how the natural extension of the Elo model for games involving a scoreboard can be used to predict the result during the game and develop a systematic setup for evaluating different systems.
Finally, we show how a discrete stochastic process can be used to model the score and integrated into a novel type of Elo rating system. For comparison, we implement the rating system behind Nate Silver’s soccer-specific SPI ratings [10] and show that our system has similar predicting power for the result of the games, despite only tracking one parameter for each team.
The remainder of this paper is organized as follows: Section 2 recalls the main aspects of the (static) Elo rating system. The proposed stochastic extensions of the Elo system are then introduced in Section 3. Their performance at predicting the results of team sports, namely, basketball and soccer, is analyzed through the computational study presented in Section 4. Finally, Section 5 is devoted to shed conclusions and expose possible lines of future work.

2. Preliminaries

Although the Elo system has been extensively studied, it is worth going through its mathematical and statistical basis in order to see how its assumptions can be extended to stochastic processes later. We also go through some basic accuracy metrics that can be used to assess its performance, and a statistical estimator to obtain ratings for players without a preexisting rating.

2.1. The Elo Rating System

Elo’s rating system assumes that in a game between two players A and B, the result can be expressed by giving points p A , p B [ 0 , 1 ] to A and B, respectively, so that p B + p A = 1 . The simplest case is a game with two results, in which p A is one if A wins and zero if B wins, but in chess, a draw is represented by p A = p B = 1 2 , since in traditional chess tournaments, the player with the higher number of wins plus half the number of draws wins. Then, if A and B have Elo ratings r A and r B , respectively, the updated ratings after the game are given by
r A : = r A + K ( p A E [ p A | r A , r B ] ) and r B : = r B + K ( p B E [ p B | r B , r A ] )
where E [ p A | r A , r B ] is the expected score of player A in a game against B, given by
E [ p A | r A , r B ] = F X ( r A r B )
Here, K is a positive constant, and F X is the distribution function of some random variable X with mean zero, as depicted in Figure 1. The distribution function of X must also be symmetrical around zero so that E [ p A | r A , r B ] + E [ p B | r A , r B ] = 1 . X was originally a normal variable, but it was later changed to follow a logistic distribution in the FIDE’s ratings. This has been argued to produce better predictions [1] (Chapter 8.41), and it also leads to an explicit and more meaningful expression for E [ p A ] . In fact, sometimes, the update rule is given simply as
r A : = r A + K p A 1 1 + exp ( c ( r A r B ) ) = r A + K p A a r B a r B + a r A
for some c = log ( a ) R + .
It is also possible to update the ratings with the results of a set of matches (for instance the results of a tournament). Given a sample M = { m k = ( a k , b k , p k a , p k b ) } of matches with n intervening players i = 1 , , n , where the kth match is between players a k and b k and ends with scores p k a for a k and p k b = 1 p k a for b k , and given initial ratings R : = ( r 1 , r 2 , , r n ) , we similarly compute:
r i : = r i + K · k | a k = i p k a E [ p k a | R ] + K · k | b k = i p k b E [ p k b | R ]
Obviously, if a player outperforms their expected score, their rating increases, and otherwise it decreases, but the total sum of ratings does not change, as we will see later. However, before discussing the update rule, we need to review the model where the expected score originally comes from, which we refer to as the static Elo model.

2.2. Basis of the Static Elo Model

The difficulty in rating chess players versus, for instance, Olympic runners, is that in the latter sport there is a magnitude, the finish time, which does not depend on the rivals. For instance, taking a weighted mean of the latest times of each athlete would be enough to compare them, and objectively establish which are better and by how much.
However, Elo considers [1] (Chapter 8.23) what happens if we do not know the finish times and can only compare athletes by looking at who finishes first in head-to-head races. If the finish times of runners A and B follow distributions T A N ( μ A , σ 2 ) and T B N ( μ B , σ 2 ) and are independent, then
P [ A wins ] = P [ T A < T B ] = P [ T A T B < 0 ] = Φ μ B μ A 2 σ 2
In general, if the times follow the same continuous distribution Y with different means, i.e., T A Y + μ A and T B Y + μ B , we can define X = T A T B + μ B μ A and
P [ A wins ] = P [ T A T B < 0 ] = F T A T B ( 0 ) = F X ( μ B μ A )
which is exactly Equation (2) for r i = μ i (note that the lower the time, the better the athlete, and the higher the rating should be). Also, by definition, X has mean zero and follows a symmetrical distribution. Therefore, Elo’s system simply tries to estimate those underlying means μ A , μ B from several match results and a given F X .
It also follows that F X ( μ A μ B ) = F X ( ( μ A + h ) ( μ B + h ) ) , so the μ i can only be determined up to the addition of a constant (we can only estimate their pairwise differences). Finally, F λ X λ μ A λ μ B = F X μ A μ B , so scaling X does not affect the model either (it only scales the ratings). This means that both the average of the ratings and the variance of X are an arbitrary choice, and only the shape of F X affects the behavior of the model.
Traditionally, 1500 is chosen as the average rating [2] (p. 621) although Elo’s initial choice was 2000 as the average rating and σ X = 200 2 282.84 as the standard deviation of X [1] (Chapter 8.21). This was replaced by the logistic distribution F X ( x ) = 1 + 10 x / 400 1 , which has a similar standard deviation of 400 π l o g ( 10 ) 3 315.09 .

2.3. Statistical Inference in the Static Model

For the model presented above, the next logical step is to estimate the ratings of the players given a sample of matches (for instance a tournament). Again, we denote the sample by M = { m 1 , , m | M | } with m k = ( a k , b k , p k a , p k b ) , and if player i play match k, we use as shorthand p i , k = p i a if i = a k and p i , k = p i b if i = b k . We also denote the indices of games played by j with M j : = { 1 k | M | : j = a k or j = b k } . Then, an interesting estimator R ^ = ( r ^ 1 , r ^ n ) is the one that reduces to zero the errors ε j defined by
ε j ( M , R ^ ) : = k | a k = j ( p k a E [ p k a | R ^ ] ) + k | b k = j ( p k b E [ p k b | R ^ ] ) = : k M j ( p j , k E [ p j , k | R ^ ] )
As we see later, this estimator is related with the adjustment formula in Equation (4), and in particular, if it exists, it is a fixed point when adjusting with that same sample M:
r ^ j = r ^ j = r ^ j + K k M j ( p j , k E [ p j , k | R ^ ] ) 0 = k M j ( p j , k E [ p j , k | R ^ ] ) = ε j
In fact, given a sample with n players, there are n 1 errors to minimize (since they add up to zero) and n 1 ratings to adjust (since the sum does not matter), so we should expect the estimator with zero error to be unique, and thus the only fixed point of the adjustment formula. In Appendix A, we characterize its existence and prove uniqueness in the case that it exists. Convergence is easy to see for the case of two players i and j, because if the results of a game follow Elo’s model, i.e., Equation (2), and r ^ j m , r ^ i m are the estimators produced by the sample of the first m games, then by the law of large numbers
ε j = 0 F X ( r ^ j m r ^ i m ) = 1 m k = 1 m F X ( r ^ j m r ^ i m ) = 1 m k = 1 m p j , k m a . s . E [ p j , k ] = F X ( r j r i )
and since F X is continuous and has an inverse, r ^ j m r ^ i m converges to r j r i . Note that if the average estimated rating is the same at every step m, then r ^ j m m a . s . r j + C for every j. Therefore, when we look for numerical methods to compute it, it makes sense to keep the average 1 m j = 1 n r ^ j constant, and only worry about the convergence of the pairwise differences.
Finally, there is another natural way to estimate the ratings for a sample, if X is a logistic variable and there are only two results in the game ( p A { 0 , 1 } ). In that case, the sample can be encoded as a set of independent variables X 1 , , X n where X k j = 1 if j = a k , X k j = 1 if j = b k , and otherwise (if k M j ), X k j = 0 . The static model then gives
P ( p k a = 1 ) = E [ p k a ] = F X ( r a k r b k ) = F X j = 1 n X k j r j = 1 1 + e λ j = 1 n r j X k j
which is precisely the formula of the logistic regression of Y : = ( p k a : 1 k m ) against X. Since λ only depends on the variance of the distribution, and the choice of variance does not affect the model, we can just estimate r 1 , r n by performing a logistic regression on the X j .

2.4. Basis of the Elo Adjustment Formula

So far, we have worked assuming Equation (2) holds and each player has a constant “true” rating r j . However, in practice, the strength of chess players or soccer teams changes over time, and hence the need for a system that updates the ratings with each new game and not with the whole history.
This is the reasoning behind the update formula in Equation (1): if we consider only one match between A and B, with estimated ratings r ^ A , r ^ B and real ratings r A , r B (and for simplicity, we pick r ^ A + r ^ B = r A + r B = 0 ), then using Taylor’s expansion, ξ R such that
E [ r ^ A r ^ A ] = K · E [ p A F X ( r ^ A r ^ B ) ] = K · F X ( 2 r A ) K · F X ( 2 r ^ A ) = 2 K ( r A r ^ A ) f X ( ξ )
Therefore, for a small enough K, in particular for K < ( 2 max x R f X ( x ) ) 1 , this gives
| E [ r ^ A r A ] | = | E [ r ^ A r ^ A ] + r ^ A r A | = | 1 2 K f ( ξ ) | | r ^ A r A | < | r ^ A r A |
and thus, if r A is fixed, E [ r ^ A ] converges to r A as r ^ A is updated over an increasing number of games. In practice, K is chosen much smaller than 1 2 max f ( R ) —for instance, the FIDE uses K = 10 for players with ratings above 2400, and in that case, F X ( x ) = 1 1 + 10 x / 400 , so a better K for accelerating the convergence of the expectancy would be 1 2 f X ( 0 ) = 4 · 400 2 log ( 10 ) 347 .
The reason for the difference is that faster convergence is at the expense of a much higher variance of r ^ A (the variance is proportional to K 2 ). If we apply the update formula over an infinite number of games, even if the true ratings of the model stay constant, the computed ratings do not converge, and they approach a limiting distribution [11], the variance of which we would like to minimize.
Therefore, the choice of K depends on the dynamic properties of the strength of the players. If we expect this strength to change significantly from one game to the next, K should be bigger, and if we expect the skill of the players to be relatively stable (for instance, if many games are played in a small lapse of time), we choose a lower K. The FIDE uses K = 40 for younger players, who may improve quickly, and K = 10 as said above, for players with rating above 2400, which are usually grandmasters and do not improve their play that fast.

2.5. Asymmetric Games

Until now, we have assumed that for two players of equal strength, each has an expected score of 1 / 2 , but in practice, this is not true: in chess, for instance, the player with white pieces makes the first move and has a slight advantage. In the team sports we consider later, it is well known that if a team plays on the home field, it also has an advantage over its rival. To incorporate this in the Elo rating system, when A has a systemic advantage against B, we can use the following expectancy formula [3] (p. 36):
E [ p A ] = F X ( r A r B + L ) = 1 F X ( r B r A L ) = 1 E [ p B ]
Here, L is a positive constant, which should obviously be bigger for a bigger advantage of A, since E [ p A ] is increasing in L, and L = 0 reduces to our symmetric model. We can estimate it given a sample of matches, provided of course that we know which player has the systematic advantage in each match. From now on, we suppose that in the kth match m k = ( a k , b k , p k a , p k b ) of a sample M, a k has the advantage. In particular, if X is logistic, L can be estimated as the independent term in the regression formula in Equation (10).

2.6. Accuracy Metrics for Elo Rating Systems

In order to determine the predictive accuracy of a rating system, we use several tools. The first one, provided by Arpad Elo [1] (Chapter 2.6), is a statistical test of normality for a sample in which the players have preexisting ratings R and each player plays m games. Originally the sample was a chess tournament, or a set of tournaments with the same m [1] (Chapter 2.7).
Elo proposes to represent the values of the residues ε j ( R ) in a histogram and to compare them with the frequencies of a normal variable with mean zero and the sample variance. Since ε j ( R ) = k M j E [ p j , k | R ] is the sum of m variables, if m is high enough and the rating differences are not too high, by the CLT, each ε j should approximately follow a normal distribution, with variance k M j Var ( p j , k | R ) 1 4 m . More sophisticated normality tests, like the normal probability plot, can be carried out for the same variables.
In order to compare the accuracy of different systems in the same sample, we propose looking at the average of the squared residues across all matches, i.e., the mean squared error of p:
M S E : = 1 | M | k = 1 | M | p a k , k E [ p a k , k | R ] 2
We can decompose this sum using the known expression of the mean square error, which is the square of the bias plus the mean variance:
E [ ( X a ) 2 ] = ( E [ X ] a ) 2 + Var ( X )
In particular, when applied to the results of the games, we obtain:
M S E = 1 | M | k = 1 | M | E [ p a k , k ] E [ p a k , k | R ] 2 + 1 | M | k = 1 | M | Var ( p a k , k )
The advantage of this measure is that we do not require any of the assumptions of a normality test. The sample can be extended in time (we can update the ratings between games using Equation (1)), and we do not need every player to play the same number of games. Also, if we apply two different rating systems to the same number of games, we can look at the variance reduction and extrapolate p-values to determine if one system is better than another.
However, unlike in a linear regression model, the results p k a do not have a constant variance, so the expression M R V : = 1 | M | k = 1 | M | Var ( p a k , k ) (mean result variance) can only be understood as the expected variance of a game, i.e., E [ Var ( p k a | r a k r b k ) ] , dependent on some distribution of the rating differences r a k r b k . We can only compare results from different samples if we assume this average variance to be similar, but this is not the case in general (for instance, M R V decreases as the dispersion of the ratings increases).
If we assume the results are binary ( p k a { 0 , 1 } ), we can use other approaches. For instance, if F X is logistic, as we saw in Equation (10), the model is equivalent to that of a logistic regression. This allows us to compute statistics like the deviance D, which is approximately distributed as a chi-squared variable for large samples, by Wilks’ theorem [12]:
D = 2 · ln likelihood E l o likelihood s a t u r a t e d | M | d i s t r . χ n 2
where, as before, n is the number of players.
Finally, the effectiveness of the model can be gauged more visually by plotting the receiver operating characteristic (ROC) curve. Again, this does not work with other possible results besides a win or a loss, such as a draw ( p k a = 1 2 ), unless the non-decisive results are removed or imputed (for instance, to zero and one with probabilities 1 p k a and p k a ).

3. Stochastic Elo Models

In many competitive team sports, we do not just observe the result of the game, but also a stochastic process S t , which determines the result at time T (we assume T is a positive constant, although it could be any stopping time of S t ). In practice, S t represents the score difference at time t, positive when the home team is winning, so the home team wins when S T > 0 , while the visiting team wins when S T < 0 .
Our goal is to extend the Elo model to incorporate S t , so that the expected result of the game at time t = 0 matches the static model. In other words, we want to obtain an expression for E [ p A | r A , r B , S t ] = g ( r A , r B , S t , t ) such that g ( r A , r B , 0 , 0 ) = F X ( r A r B + L ) = E [ p A | r A , r B ] , i.e., our initial guess matches that of the original Elo model.
We studied several possible models, but there are some properties they all verified. First, since (in the sports we chose) the past states of the scoreboard are not relevant to the game, and only the final score determines the result, S t should have the Markov property. As a consequence, both X t : = E [ p A | S t ] = g ( r A , r B , S t , t ) and Y t : = E [ S T | S t ] should be martingales, by the Tower Property ( E [ X ] = E [ E [ X | Y ] ] ) [13] (p. 46).
Furthermore, for s > t , since S s gives us as much information as S s and S t together, our estimation at time s should be better than the one at time t, and therefore
E [ Var ( p A | S s ) ] < E [ Var ( p A | S t ) ] and E [ Var ( S T | S s ) ] < E [ Var ( S T | S t ) ]

3.1. Continuous Models

Recall that in the original Elo model, team A wins if X + r A r B + L > 0 , and that should match the odds of S T being positive, so we can assume S T is X + r A r B + L times some constant. Since the incentives of the teams are the same at every point of the game (score the most points and have the opponent score the least), we can also assume the increments S t + h S t are independent (for disjoint intervals) and identically distributed.
For this to make sense, S T = S T / n + ( S 2 T / n S T / n ) + + ( S T S T T / n ) must be infinitely divisible. Although the logistic distribution is infinitely divisible [14], the increments are not logistic, so we suppose that X and S T are normal, and in that case, the increments must also follow the normal distribution by Cramér’s decomposition theorem [15], so S t has i.i.d. Gaussian increments, and it is a Brownian motion with drift:
S t = μ t + σ B t E [ p A ] = P [ N ( μ T , σ 2 T ) > 0 ] = Φ μ T σ T = Φ μ σ T
where B t is the standard Brownian motion, i.e., B 0 = 0 , and B has independent and normal increments B t B s N ( 0 , | t s | ) . Under this model, used by Stern in [4], if we want this expectation to match the one given by the static model, we must have
Φ μ σ T = F X r A r B + L = Φ r A r B + L T μ σ T = r A r B + L σ X
and from this, we immediately obtain the expected score conditioned on S t = 0 :
E [ p A | S t = 0 ] = P N ( μ ( T t ) , σ 2 ( T t ) ) > 0 = Φ μ σ T t = = Φ r A r B + L σ X T t T = F X ( r A r B + L ) T t T
Similarly, if we condition on an arbitrary score difference at time t < T , we obtain
E [ p A | S t = S ] = P S + N ( μ ( T t ) , σ 2 ( T t ) ) > 0 = P N ( μ ( T t ) + S , σ 2 ( T t ) ) > 0 = Φ μ σ T t + S σ T t 1 = F X ( r A r B + L ) T t T + σ X σ S T t
Denoting by u t : = ( T t ) / T the fraction of the time that remains, and by C the constant coefficient σ X σ T , we have:
E [ p A | S t = S ] = F X ( r A r B + L ) u t + C S u t
In the end, we obtain an expression where the term ( r A r B + L ) u t depends on the ratings and decreases to zero as the time t approaches T, while the term C u t 1 S depends on the score and becomes larger as the game nears its end, as we would want (unless S = 0 ). In other words, the behavior or E [ p A | S t ] as t T is exactly what we would expect.
Note that setting t = 0 u t = 1 , and assuming a game starts with a score difference of S points, the expected score is F X ( r A r B + L + C S ) . This allows us to interpret the constant C as the handicap (measured in rating points) that each goal or point entails at the start of the game. We could in fact use this to estimate C from game data, although we do not need to pick t 0 . For any t ( 0 , t ) , C should minimize the mean square error
M S E ( t ) : = k M ( p a k , k E [ p a k , k | S t , k ] ) 2 | M | = 1 | M | k M p a k , k F X u t ( r a k r b k + L ) + C S t , k u t 2
as long as the model is correct, and the minimum should be the same for every t. Conversely, if the function above is minimized for C = C * for every t, our model is optimal among a certain class, as the following result showcases.
Lemma 1.
If there are functions f : R R { 0 } and V : R 2 R , and there exists a C * R such that V ( x , C * f ( x ) ) V ( x , C f ( x ) ) for every x , C R , then for every g : R R ,
V ( x , C * f ( x ) ) V ( x , g ( x ) ) x R
Proof. 
V ( x , g ( x ) ) = V x , g ( x ) f ( x ) f ( x ) V ( x , C * f ( x ) )    □
In particular, for V ( t , g ) = 1 | M | k M p a k , k F X ( u t ( r A r B + L ) + g S t , k 2 and the function f : t 1 / u t , this means that our model is optimal among the ones with a prediction of the form E [ p A ] = F X ( u t ( r A r B + L ) + S g ( t ) ) . If we further suppose that the true expectancy has the form F X ( f ( t ) ( r A r B + L ) + S g ( t ) ) , we can estimate f by looking at the times in the sample where S t , k = 0 , and if f : t u t minimizes the mean square error over that restricted sample, our model is optimal among this wider class.
Finally, assuming we already know or have estimated C = σ X σ T σ = σ X C T , we can obtain the drift μ = σ T r A r B + L σ X and express S t in terms of known quantities as
S t = μ t + σ B t = r A r B + L C · T t + σ X C T B t
Finally, notice also that E [ S T ] = r A r B + L C , which gives another way to estimate C.

3.2. Accuracy Metrics for the Stochastic Model

If we take t = T in Equation (26), we obtain
C · S T r A r B + L + σ X T N ( 0 , T ) C · S T ( r A r B + L ) N ( 0 , σ X 2 ) X
Therefore, the residuals C · S T ( r A r B + L ) follow the same law as X (the variable used for the static model), and we can do a normality test on them, for instance through a QQ plot or comparing a histogram with the predicted frequencies for X. We could in principle use Equation (23) with any distribution of X, and in that case, C · S T minus the rating difference follows the law of X, and the same test works.
From Equation (26), we can also reconstruct the standard Brownian motion S t in the model:
1 T B t = 1 σ X C · S t t T · ( r A r B + L )
and since 1 a B a t is a standard Brownian motion if and only if B t is, taking s t = 1 u t = t / T , we obtain that 1 R B t = 1 T B T s t = C · S T s t s t ( r A r B + L ) is a standard Brownian motion in [ 0 , 1 ] as a function of s t , which is just the fraction of the game time elapsed at t.
From Equation (26), we can also deduce
Var ( S T | S t ) = Var ( S T S t ) = Var ( μ ( T t ) + σ B T B t ) = σ 2 ( T t ) = C 2 σ X 2 · u t
That is, if we compute the mean squared error of S T instead of p A at each time t, we should obtain a linear function in t, and we can test this hypothesis via linear regression.
Recall that we are assuming the increments of S t are independent and identically distributed, and in particular, the variance of S t + h S t only depends on t. We can test this directly by looking at the points scored by either team at each interval between 0 and T, or, if S t can increase or decrease by amounts other than one, we can see if the sample variance 1 | M | 1 k M ( S t + 1 , k S t , k ) 2 | M | a . s . E [ ( S t + 1 S t ) 2 ] depends on t.
Finally, this continuous model implies expressions for E [ p A | S t ] and E [ S T | S t ] for each t, and we can check if they behave like martingales in a sample of games.

3.3. Non-Homogeneous Process

In practice, S t does not always behave like a homogeneous process in time, that is, more points may be scored in some sub-intervals of [ 0 , T ] than others. In that case, if Var ( S t ) = m ( t ) for some increasing function m, we can model the process as
S t = 0 t μ d m ( t ) + 0 T m ( s ) d B s = μ · m ( t ) + σ B m ( t ) = S m ( t )
where S behaves like the score in the first model. Since this is a map of the process that we were using before, we can use the homogeneous model to compute
E [ p A | S t = 0 ] = E [ p A | S m ( t ) = S ] = P [ S m ( T ) > 0 | S m ( t ) = S ] = = F X m ( T ) m ( t ) m ( T ) ( r A r B + L ) + C m ( T ) m ( T ) m ( t ) S
which is exactly the same as before, for u t = m ( T ) m ( t ) m ( T ) .

3.4. A Discrete Model

The main difference between this model and the actual score difference is that the former is a continuous process, but the latter is discrete in value ( S t Z ) and continuous in time. The process that fulfills this condition best is a Skellam process, as described in [6], i.e., the difference of two Poisson processes with different rates, which model the score of each team or player during the game.
If P 1 and P 2 are two independent Poisson distributions with mean μ 1 and μ 2 , then their difference is said to follow a Skellam distribution, P 1 P 2 S k ( μ 1 μ 2 ) . If N 1 ( t ) and N 2 ( t ) are independent Poisson processes with rates λ 1 and λ 2 , the process
Z t : = N 1 ( t ) N 2 ( t ) P ( t λ 1 ) P ( t λ 2 ) = S k ( t λ 1 , t λ 2 )
has i.i.d. increments as well.
However, in this case, modeling the score with Z t is not so straightforward. For starters, if we multiply Z t by a constant, we do not obtain another Skellam process. Adding a real-valued function of t also changes the domain of the process from Z to R . Therefore, the only option is to assume that
S t Z t ( μ 1 ( r A , r B ) , μ 1 ( r A , r B ) ) S T S k ( T μ 1 ( r A , r B ) , T μ 2 ( r A , r B ) )
This in turn means that we cannot assume homoscedasticity of the process, because a Skellam distribution with parameters μ 1 and μ 2 has mean μ 1 μ 2 and variance μ 1 + μ 2 . Since μ 1 and μ 2 are non-negative, the mean is less or equal to the variance, and if we fix μ 1 + μ 2 = σ , we would put a bound on the expected final score E [ S T ] = μ 1 μ 2 σ , which is not realistic.
However, the distribution of S T , which should have the same shape as X, now depends on two parameters, and not just on the rating difference r A r B . We can remove this degree of freedom by fixing a relation between μ 1 and μ 2 , and for simplicity’s sake, since the probability that S k ( μ 1 , μ 2 ) takes the value k is
P [ S k ( μ 1 , μ 2 ) = k ] = e ( μ 1 + μ 2 ) μ 1 μ 2 k / 2 I k ( 2 μ 1 μ 2 )
we can fix 2 μ 1 μ 2 = H to compute I k only once for each k Z . Here, I k ( · ) is the modified Bessel function of the first kind [16], which does not have a closed-form expression.
To determine how μ 1 and μ 2 depend on the ratings, we can take E [ S T ] = μ 1 μ 2 to be linear in the rating gap Δ r = r A r B + L as in Equation (26), and then, if we pick σ X so that C = 1 , from the product and difference of μ 1 and μ 2 , we obtain a second-degree equation
( x + μ 1 ) ( x μ 2 ) = x 2 + Δ r x H 2 4
μ 1 = 1 2 Δ r + Δ r 2 + H 2 and μ 2 = 1 2 Δ r + Δ r 2 + H 2
from which we finally obtain not just a prediction for p A , but also a probability for each possible final score, and in this case, the probability that S T = 0 is positive, so we can include draws in our model. In particular, if we assign a result of p A = p B = 1 / 2 for draws,
E [ p A ] = P [ S T < 0 ] + 1 2 P [ S T = 0 ] = P [ S k ( μ 1 , μ 2 ) < 0 ] + 1 2 P [ S k ( μ 1 , μ 2 ) = 0 ]
Similarly, if we consider that expression as a function of Δ r , it can be shown that it is increasing (see Appendix B), and with limits 0 and 1 at and . Therefore, E [ p A ] = F X ( r A r B + L ) for some random variable X, and then this model is also an extension of a static Elo model.

4. Computational Study

Next, we evaluated the performance of the proposed stochastic extensions of the Elo system in a computational study implemented in Python. All the results discussed can be replicated with the code available at https://github.com/gonzalogomezabejon/StochasticElo (accessed on 19 July 2024) using real data from reference databases for each sport (see Data Availability Statement), which are also included for download through the previous link. In the case of soccer, a comparison with the more complex methodology of the Soccer Power Index (SPI) developed by Nate Silver [7] was also carried out.

4.1. Experimental Setup

In order to evaluate a rating model with any of the accuracy metrics exposed in Section 2.6, a sample of matches M for which both players have a rating is needed. Furthermore, the update formula in Equation (1) needs preexisting ratings, so we used some of the games in our dataset to estimate the starting ratings of each team, using the estimator R ^ defined in Section 2.3.
On the other hand, the proposed stochastic models were designed for sports or games where the result depends on a numerical score (S) and the match has a fixed duration (T), such as soccer, basketball, or ice hockey, although we omitted the latter in the experimental results for brevity. Moreover, we focused on league competitions, where each team plays more games and we could use a bigger sample to obtain R ^ . For instance, in the Premier League, 20 teams play 380 matches, while in the World Cup, 32 teams play 64 matches.
Most national leagues have a similar format, consisting in yearly seasons in which most of the participating teams remain the same from one year to the next, and all teams play the same number of games each week. We designed an algorithm to extract rated games from any of these sports as follows:
  • The expected result was determined by Equation (13) with constant L.
  • During each season, we denoted by “entering teams” the teams that did not play the previous season (during the first season, they are all entering teams).
  • We divided each season in two parts (I and II), the former comprised of the games that started before every entering team had played at least m games.
  • During part I, in each match between two non-entering teams, we updated their ratings (from the previous season) according to Equation (1) using a fixed factor K.
  • When part I ended, we computed the rating estimator R ^ for each entering team using the games in part I and the current ratings of non-entering teams.
  • During part II, since we had ratings for all teams, we could just update them every match using Equation (1) with the same K factor.
In this way, we had a rating before the start of the game for the matches between non-entering teams in part I of each season, and all matches in part II, and we could use them to evaluate the rating system.
This algorithm has (hyper-)parameters m, K and L, that is, the length of part I of each season, the sensitivity of the Elo system to new results, and the home field advantage, respectively. We fixed m and estimated K and L by minimizing the mean squared error defined in Equation (14).

4.2. Basketball Results

Basketball has several advantages regarding the Elo system. First, it only admits two results (win or loss), so the result follows a Bernoulli distribution and the logistical model in Equation (10) can be used. In addition, the score variable S t has a relatively large range (each team scores around 100 points) and changes quickly (by 1, 2, or 3 points each time), so we do not lose too much by approximating it by a continuous variable.
On the other hand, an NBA basketball match lasts 48 min divided in four 12 min quarters, but if the scoreboard is even at the end of that time, additional 5 min quarters are played until one team is ahead. We ignored these extra quarters for the sake of simplicity, since we assumed T was fixed, and our prediction at the end of the game would be slightly wrong when S t = 0 .
Our dataset for basketball consisted of the NBA league games between the seasons 2000–2001 and 2023–2024, obtained from www.basketball-reference.com (accessed on 19 July 2024), which registers every change in the score and its time (in seconds).

4.2.1. Static Elo

We implemented the algorithm described above for m = 20 and F X , corresponding to a normal variable X N ( 0 , 200 ) , in a training set of seasons 2000–2001 to 2009–2010, and in four different ways to test the effectiveness of the Elo model and the significance of its parameters:
  • Without Elo, simply assuming E [ p k , a k ] = p h o m e ¯ : = | M | 1 k = 1 | M | p k , a k = : F X ( L ) .
  • Fixing K = 0 and minimizing M S E as a function of L (no change in strength).
  • Fixing L = 0 and minimizing M S E as a function of K (no home advantage).
  • Minimizing M S E as a function of K and L (standard Elo system).
To avoid overfitting in K and L, we also implemented the algorithm for the optimal values K * and L * in a validation set consisting of seasons 2010–2011 to 2023–2024. The mean squared errors in the training and testing sets are summarized in Table 1:
We can see straightaway that the Elo explains a significant part of the variance (around 11%), and both parameters are useful, since setting them to zero increased the mean squared error. The reduction was similar in the training and testing sets, so cross-validation shows the robustness of the algorithm in Section 4.1.
Notice also that the K-factor and home field advantage are similar to those of chess, where | L * | 50 [1] (Chapter 8.93). The sign of L * only reflects that in basketball (and American sports in general) the visiting team is listed first.
We note that setting K = 0 (which amounts to leaving the rating of a team unchanged until they are relegated) increased the variance, but it still had predictive power, as evidenced by the ROC curves shown in Figure 2.
Finally, setting L = 0 barely changed the curve, because the map from F X ( z ) to F X ( z + L ) is a monotone increasing bijection, and the order of the expected result under the two models are the same for equal K * (and indeed the K factors were very similar).
After fitting our model and obtaining the sample, we could expect that for a subsample of matches with rating difference r A r B r , the average result would be close to F X ( r + L ) , but this was not the case, as evidenced by Figure 3.
We observe that the normal distribution function F X overestimates the effect of the rating difference in the expected result of a match. We can obtain a better predictor if we multiply r A r B by a “dampening” constant D 1 , as in Figure 3 of [17].
When we minimized the M S E on K , L , and D (only using D for the calculation of the mean squared error, since otherwise we would just be scaling X), we obtained optimal parameters K * 19.61 , L * 58.9 and D * 0.786 , reducing the M S E to 0.21064 in the test set.
This seemed a small improvement in M S E , but we could test the significance of the reduction in variance by looking at the variables X : = ( p A F X ( L + r A r B ) ) 2 and Y : = ( p A F X ( L + D r A D r B ) ) 2 for a random game.
In order to check if Z : = X Y had a positive mean from our sample of games, we sampled the difference z k = ( p k a F X ( L + r a k r b k ) ) 2 ( p k a F X ( L + D r a k D r b k ) ) 2 and argued that by the central limit theorem, the sample mean of Z approximately followed a normal distribution with n times the sample variance of Z. In our testing data, Z had a negative mean with p 1.2 × 10 6 , so D significantly improved our predictions.

4.2.2. Stochastic Elo

Our algorithm left us with 12,500 rated games out of the 12,933 games in the training set, and 17,437 out of the 17,820 originally in the testing set. These 17,437 matches were the ones used for the analysis of our stochastic model.
We started by estimating the only parameter of the stochastic model, C. As we showed in Lemma 1, if the model is correct, the optimal value C * should minimize the error M S E ( t ) at every time t, so we minimized the average of that function at different times for a more robust estimation of C * . In the training set, we obtained C * 10.91 , and we could compare the function M S E ( t ) for higher and lower values, as shown in Figure 4. Since every curve lay above the curve for C = 10.91 , Lemma 1 suggested that our formula for the expected result was optimal among the family E [ p A ] = F X ( Δ r u t + g ( t ) S t ) . We also noted that the error did not reach zero at the final time t = T of 2880 s, since a tied score at that time results in an extra time being played. Our model assumed a fixed duration, so in order to predict the result beyond minute 48, we would need to model another (shorter) match.
Next, Figure 5 shows the result of checking the normality of the score process by computing the normal residues C · S T L D ( r A r B ) and comparing them with the quantiles of a normal distribution. Here, again, S T is the score after the first 48 min, not the final score of the game. As expected from Equation (27), the residuals more or less followed a normal distribution, but the mean was not zero, meaning the home team scored better (in terms of S T ) than the Elo model predicted. The standard deviation was also smaller than the expected value of the model, namely, σ X = 200 . From the first plot, we can also tell that the residuals were somewhat correlated to the Elo difference but not to an extent that would invalidate the model.
In this type of figures, the black line is just a linear regression fit to the actual data, which correspond to the blue dots. Finally, Figure 6 checks the linear relationship between the score variance and time, as described in Equation (29), by computing the quantities
M S E S ( t ) : = 1 | M | i = 1 | M | ( S T E [ S T | S t ] ) 2 | M | a . s . E [ ( S T E [ S T | S t ] ) 2 ] = Var t ( S T ) = σ X 2 C 2 u t
The resulting function was close to a line, but the variance still dropped more quickly at the end of the match than at the beginning, suggesting S T was not completely homogeneous in time. Figure 7 plots the average number of points scored at each 10-s interval of the game by either team, in order to approximate the variance of that interval (they are in fact equal if we assume the score is the difference of two independent Poisson processes as in the discrete model). However, Figure 7 shows that the process was very homogeneous in time, except for brief periods before the end of each quarter.

4.3. Soccer Results

Soccer was somewhat more challenging than basketball when it came to applying Elo models. First, since our database consisted of league games, we did not have extra times, at the expense of allowing for draws ( p A = p B = 1 2 ) if the score was tied at the end of the match. We should note that in both the Premier League and the Spanish First Division, teams are awarded three points for a victory, one for a tie and zero for each loss, so this is not entirely a zero-sum game, whereas in the Elo model, p A + p B = 1 .
Although there are no extra times, there is time added at the end of each half-time, but our database only kept track of the minute in which each goal was scored, and every goal scored in the added time appeared at the minute 90 or 45. Finally, S t was relatively small (usually S T < 5 ), so approximating it by a continuous variable was problematic.
Our dataset for soccer consisted of the seasons 2003–2004 to 2023–2024 of the Spanish and English first division leagues, both counting 5320 games. We used the former for training and the latter for testing. The data were obtained from https://fbref.com/en/ (accessed on 19 July 2024).

4.3.1. Static Elo

Since the league structure was similar to that of the NBA, we used the same algorithm in Section 4.1 to obtain rated games, this time with a sample of m = 12 for part I and the same X N ( 0 , 200 ) , now also adding the dampened model to the table along with the optimal D. The corresponding results are presented in Table 2.
As we saw for basketball, both K and L were clearly significant, but the reduction in variance for adding D was much smaller. Using the statistical test described for basketball, we see that D is significant with p 0.0013 , and the optimum D * was closer to one.

4.3.2. Stochastic Elo

After implementing the static Elo system, we were left with 7176 matches out of the 7980 in our testing set, and we used these for the analysis of the stochastic model.
However, in this case, the irregularity of the process versus the recorded time was stronger. As shown in Figure 8, the number of goals scored by every team each minute increased as the game neared its end and had two spikes in the added time of each half, which makes the simple model we used for basketball less effective.
If we estimate C by minimizing the average of M S E S ( t ) at evenly spaced points t, with linear u t = T t T , we obtain the functions M S E ( t ) and M S E S ( t ) Var ( S T | S t ) depicted in Figure 9 and Figure 10, respectively.
To tackle this problem, we modeled the process as a non-homogeneous process, with u t equal to the fraction of goals in our sample scored after time t. To show the results with this modification, Figure 11 and Figure 12 replace the match time on the x axis with the modified time 1 u t . The last plot showcases that the score variance (conditioned on S t ) was almost exactly proportional to u t , and that our assumptions on the process S t were reasonable.
Finally, Figure 13 plots the residues C · S T L D ( r A r B ) and compares them to the standard normal quantiles and their respective rating differences L + D ( r A r B ) . The distribution of the residues was close to normal, except for slightly thicker tails. The correlation between the residues and the rating differences was virtually zero compared with the one we saw for basketball.

4.3.3. Discrete Elo Model

Lastly, we implemented the discrete score model based on the Skellam process. As we showed, this is a particular case of an Elo model, and thus the same algorithm can be used to estimate the parameters K , L , and D. In this case, we used m = 12 and the function F X defined in Equation (37). To estimate the only non-Elo parameter of the model,
H = 2 μ 1 μ 2 = 2 E [ g o a l s h o m e ] + E [ g o a l s a w a y ] = 2 E [ g o a l s h o m e · g o a l s a w a y ]
we used the sample mean of the product of the goals in our training sample, obtaining H 2.578 . The corresponding results are presented in Table 3.
Despite the distribution function being different, the mean squared error was very close to that of the standard Elo model with normal F X . However, this model also predicted probabilities for the three possible results, allowing us to compute metrics like log-loss [18].
For comparison, we implemented the rating system used by the Soccer Power Index (SPI) developed by Nate Silver [7], which uses two rating parameters for each team: one for their offense (their capacity to score goals) and another for their defense. A concise but complete explanation of the SPI system is given in Appendix C.
The SPI system achieved an M S E of 0.1518 versus 0.1533 for the discrete Elo system, and the squared errors were significantly lower, with a p-value of 5 × 10 5 . The average log-loss at time zero was also better, with 1.411 beating 1.432 for the discrete Elo and a p-value of 8 × 10 6 . However, both systems also gave probabilities for the result being a victory, loss, or draw at any time t during the game, namely,
P [ p A = 1 ] = P [ S k ( μ 1 · u t , μ 2 · u t ) + S t > 0 ] P [ p A = 1 / 2 ] = P [ S k ( μ 1 · u t , μ 2 · u t ) + S t = 0 ] P [ p A = 0 ] = P [ S k ( μ 1 · u t , μ 2 · u t ) + S t < 0 ]
where μ 1 and μ 2 are the expected goals by the home team (A) and the visiting team, respectively. Computing these for every time t, we obtained the results shown in Figure 14.
Note that in the log-loss curve, the Elo system beat SPI in the second half—for instance, at t = 72 min, the log-losses were 0.857 vs. 0.841 with a p-value of 2 × 10 7 showing that the Elo loss achieved a smaller loss. This suggests that despite being a one-parameter model and its relative simplicity, the discrete Elo system is on par with SPI in terms of predictive power when it comes to mid-game predictions (conditioned on S t ).

5. Conclusions

In this work, it was shown how the model underlying the Elo system had a natural extension for fixed-duration sports with which it was possible to model the score of each team. The proposed stochastic Elo system can be easily implemented for league competitions using the algorithm in Section 4.1, which in turn uses the estimator defined in Section 2.3, and supported by the results proved in Appendix A.
A discrete model for soccer was also proposed, which was also an Elo system at time t = 0 and inherited its robustness and simple update formula, but modeled the goal difference as a discrete process, thus providing probabilities for each possible outcome of a match. Moreover, we saw that this system had accuracy comparable with that of a biparametric rating system like SPI across the duration of a soccer game. This suggests that the simplicity of the original Elo system may favor it as an alternative to sport-specific models, and its use will continue to increase in this context.

Future Work

These results certainly apply to games or sports with a score board and fixed duration in time, but they could be replicated for other sports with different scoring systems, like table tennis, where the game ends when the first player obtains a score of 21, regardless of how long it takes. It is possible that an Elo system can be derived from these types of “race” processes as we did for our Skellam process.
The concept of predicting the result of a match mid-game could also be applied to chess, where there is no objective score S t but it is common to obtain an “evaluation” from the state of the board, using chess engines. Some chess websites store extremely large databases of computer-evaluated games in which a study of this kind could be performed.
Another question we did not consider when optimizing the parameters of the Elo system, such as K or L, is whether the distribution function F X can also be inferred from a large enough sample of games. Obviously, the space of distribution functions is not finite-dimensional, but a method for choosing F X has not been proposed, even among a finite-dimensional family of distributions.
Finally, there are games (such as checkers) for which there is a concept of perfect play, i.e., a strategy that guarantees a victory or a draw. We can easily see that an expected result above 1 / 2 implies an upper bound in ratings from the original model, but the Elo system does not have an upper bound built in. However, some models derived from race processes not only produce odds for a win, draw, or loss but also have a maximum or minimum rating associated with perfect play.
For instance, given ratings r 1 , r 2 R 0 , we consider Poisson processes P 1 ( t ) and P 2 ( t ) with rates r 1 and r 2 , and arrival times T 1 ( n ) and T 2 ( n ) . Suppose player 1 wins when T > T 2 ( k ) < T 1 ( k ) , player 2 wins when T > T 1 ( k ) < T 2 ( k ) , and they tie when T 1 ( k ) > T < T 2 ( k ) , for fixed k Z , T R . Then, the strength of a player decreases with r, but if r 1 = 0 , player 1 will never lose, and zero is a lower bound for ratings. To the best of our knowledge, these types of systems have not been studied computationally at all.

Author Contributions

Conceptualization, G.G.-A.; methodology, G.G.-A. and J.T.R.; software, G.G.-A.; validation, G.G.-A.; formal analysis, G.G.-A.; investigation, G.G.-A.; resources, G.G.-A.; data curation, G.G.-A.; writing—original draft preparation, G.G.-A.; writing—review and editing, J.T.R.; visualization, G.G.-A.; supervision, J.T.R.; project administration, J.T.R.; funding acquisition, J.T.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Government of Spain grant number PID2021-122905NB-C21.

Data Availability Statement

The raw data of soccer and basketball games was obtained from https://www.sports-reference.com (accessed on 19 July 2024) according to the data use policy in https://www.sports-reference.com/data_use.html (accessed on 19 July 2024). The full computer code for our computational study, as well as the files containing the basketball and soccer data used, can be downloaded at https://github.com/gonzalogomezabejon/StochasticElo (accessed on 19 July 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proofs Regarding the Static Rating Estimators

In this appendix, we characterize the existence of the rating estimators R ^ that give zero errors ε i ( R ^ , M ) for a sample of matches M (assuming F X is continuous), and prove the uniqueness of R ^ (up to adding a constant). We also prove that the vector ε ( R , M ) is a descent direction for the problem min { ε ( R , M ) 1 : R R m } .
Recall that our sample is given by M = { ( a k , b k , p k a , p k b ) : k = 1 , , | M | } , where a k { 1 , , m } is the “home” player or team, b k { 1 , , m } is the “visiting” player, and p k a = p a k , k is the score of player a k , that is, one if a k wins and zero if a k loses, or an intermediate value like 1 2 if there is a draw. Conversely, the score of b k is p k b = 1 p k a .
We want to characterize the existence of the static estimator, that is, the set of ratings R ^ = ( r ^ 1 , r ^ 2 , , r ^ m ) that verifies, for every player j,
0 = ε j ( R ^ , M ) : = k | j = a k ( p k a E [ p k a | R ^ ] ) + k | j = b k ( p k b E [ p k b | R ^ ] ) = k M j ( p j , k E [ p j , k | R ^ ] )
where M j : = { 1 k | M | : j = a k or j = b k } and p a k , k = p k a , p b k , k = p k b . For the expectancy, we use E [ p k a | R ] = F X ( r a r b + L ) , where L is a non-negative constant ( L = 0 is the standard symmetric case). Finally, we define a digraph G M = ( { 1 , n } , A M ) where i j A M if and only if for some k, { i , j } = { a k , b k } and p i , k > 0 . We want to show the following:
Theorem A1
(Existence). For any sample M with G M weakly connected (connected as an undirected graph), there exists an estimator R ^ with ε ( R ^ , M ) = 0 , if and only if G M is strongly connected (that is, for every pair of players i , j , there is a directed ( i , j ) -path).
Theorem A2
(Uniqueness). If there are two estimators R = ( r 1 r n ) and R = ( r 1 r n ) such that ε j ( M , R ) = ε j ( M , R ) = 0 j = 1 n , then r 1 r 1 = r 2 r 2 = = r n r n
Lemma A1.
j = 1 n ε j ( R , M ) = 0
Proof. 
j = 1 n ε j ( R , M ) = j = 1 n k M j ( p j , k E [ p j , k | R ] ) =
= k = 1 | M | p a k , k + p b k , k E [ p a k , k | R ] E [ p b k , k | R ] = i = 1 | M | ( 1 1 ) = 0
Lemma A2.
If the players are partitioned in two sets { 1 , n } = I J with I J = , then
i I ε i ( R , M ) j J ε j ( R , M ) = 2 a k I b k J p a k , k E [ p a k , k | R ] 2 b k I a k J p a k , k E [ p a k , k | R ]
Proof. 
We sort the games of M in the sets M = M I I M I J M J I M J J , where the set M P Q : = { 1 k | M | : a k P , b k Q } contains the indices of the matches where the first player is in the set P and the second in Q. Then, by Lemma A1,
i I ε i ( R , M ) j J ε j ( R , M ) = 2 i I ε i ( R , M ) = 2 i I k M i ( p i , k E [ p i , k | R ] ) = = 2 i I k M i M I I ( p i , k E [ p i , k | R ] ) + 2 i I k M i M I J ( p i , k E [ p i , k | R ] ) + 2 i I k M i M J I ( p i , k E [ p i , k | R ] )
but since M I I only contains games between players in I, again, by Lemma A1,
i I k M i M I I ( p i , k E [ p i , k | R ] ) = i I ε i ( R , M I I ) = 0
and finally, since every game in M I J has exactly one player in I,
i I k M i M I J ( p i , k E [ p i , k | R ] ) = k M I J ( p a k , k E [ p a k , k | R ] ) = a k I b k J p a k , k E [ p a k , k | R ]
and we use the same argument in M J I to obtain
i I k M i M J I ( p i , k E [ p i , k | R ] ) = k M J I ( p b k , k E [ p b k , k | R ] ) = b k I a k J p a k , k E [ p a k , k | R ]
as we wanted (since p a k , k + p b k , k = 1 = E [ p a k , k | R ] + E [ p b k , k | R ] ).   □
Proof of Theorem A1.
(⇒) If the static estimator R ^ exists but G M is not strongly connected, then there are players i and j such that there are no ( i , j ) -paths in G M . In that case, if we define I : = { 1 p n : P ( i , p ) - path in G M } as the set of players reachable from i in G M , and J : = { 1 n } I to be the set of players not reachable from I. By construction, in a game a k I and b k J , if p a k , k > 0 , we would have an edge between a k I and b k J , which contradicts the definition of I and J, so p a k , k = 0 .
Note that i I and j J , so the sets are not empty, and since G M is weakly connected, there must be a game between a player in J to another in I; let us assume w.l.o.g. that ( a l , b l ) I × J , and in that case, by Lemma A2,
0 = i I ε i ( R ^ , M ) j J ε j ( R ^ , M ) = 2 a k I b k J p a k , k E [ p a k , k | R ^ ] 2 b k I a k J p a k , k E [ p a k , k | R ^ ] =
= 2 a k I b k J 0 F X ( r ^ a k r ^ b k + K ) 2 b k I a k J 1 F X ( r ^ a k r ^ b k + K ) 2 ( 0 F X ( r ^ a l r ^ b l + K ) ) < 0
since F X ( x ) ( 0 , 1 ) for any x R , and that is not possible.
(⇐) On the other hand, suppose G M is strongly connected. Then, we show that r ^ exists by proving a stronger proposition: if we fix r 1 , r 2 , , r n m , there are r ^ n m + 1 ( a ) , , r ^ n R such that for every j { n m + 1 , , n } , we have ε j ( ( r 1 , , r ^ n ) , M ) = 0 . To do this, we show by induction in m that there exist continuous functions ϕ m : R n m R m for 1 m n 1 such that ε h ( ( r , ϕ m ( r ) ) , M ) = 0 for any r R n m and h n m + 1 , and also, that every component of ϕ m ( · ) is non-decreasing in each of its arguments (that is, r r ϕ m ( r ) ϕ m ( r ) ). For brevity, we write ε i ( r ) instead of ε i ( r , M ) .
First, we note that since G M is weakly connected, ε j ( ( r 1 , r 2 , , r n ) ) is strictly decreasing in r j , but non-decreasing in any other rating r i for i j :
r j > r j ε j ( ( r 1 r j r n ) ) = a k = j ( p k a F X ( r j r b k + L ) ) + b k = j ( p k b F X ( r j r a k L ) ) <
< a k = j ( p k a F X ( r j r b k + L ) ) + b k = j ( p k b F X ( r j r a k L ) ) = ε j ( ( r 1 r j r n ) )
since one of the sums is non-empty and F X is strictly increasing. From the same expression, we can infer that r i r i ε j ( ( r 1 r i r n ) ) ε j ( ( r 1 r i r n ) )
For the base case of induction, m = 1 , we can show that the increasing function f : x ε n ( ( r 1 , , r n 1 , x ) ) has a unique zero, because
f ( x ) = k | a k = n ( p k a F X ( x r b k + L ) ) + k | b k = n ( p k b F X ( x r a k L ) ) x k M n ( p n , k 1 ) < 0
(since G M is strongly connected, player n scores less than one in some match k) and similarly,
f ( x ) = k | a k = n ( p k a F X ( x r b k + L ) ) + k | b k = n ( p k b F X ( x r a k L ) ) x k M n p n , k < 0
and f is obviously continuous ( F X is continuous), so just applying the intermediate value theorem allows us to define ϕ 1 ( ( r 1 , , r n 1 ) ) as the only zero of f. To see that ϕ 1 is non-decreasing in every coordinate of its argument, we check that
r r = ( r 1 r n 1 ) ε j ( r , ϕ 1 ( r ) ) = 0 = ε j ( r , ϕ 1 ( r ) ) ε j ( r , ϕ 1 ( r ) ) ϕ 1 ( r ) ϕ 1 ( r )
and the last implication is a consequence of ε n being strictly decreasing in r n .
To see that ϕ 1 is continuous, we note that for any convergent sequence { r k } k N with r * = lim k r k , we have
lim k ε n ( r * , ϕ 1 ( r k ) ) = lim k ε n ( r k , ϕ 1 ( r k ) ) = lim k 0 = 0 = ε n ( r * , ϕ 1 ( r * ) )
and since ε n ( ( r * , · ) ) = f ( · ) is strictly monotone and continuous, and therefore bijective, ϕ 1 ( r * ) = lim k ϕ 1 ( r k ) .
Now, for the induction step: suppose ϕ m 1 exists and is non-decreasing in and with respect to every coordinate, and m < n . For a fixed r = ( r 1 , r 2 , r n m ) , we show that the continuous function f : x ε n m + 1 ( ( r , x , ϕ m 1 ( ( r , x ) ) ) ) , which inherits continuity from F X and ϕ m 1 , is strictly decreasing.
Suppose x > x . Then, by our induction hypothesis ϕ m 1 ( ( r , x ) ) ϕ m 1 ( ( r , x ) ) . Let us define I : = { i | ( r , x , ϕ m 1 ( r , x ) ) i > ( r , x , ϕ m 1 ( r , x ) ) i } and J : = { 1 n } I = { j | ( r , x , ϕ m 1 ( r , x ) ) j = ( r , x , ϕ m 1 ( r , x ) ) j } , and note that 1 J since dim ( r ) = n m 1 , and n m + 1 I . Note that i I i = n m + 1 or i > n m + 1 , and in the latter case ε i ( r , x , ϕ m 1 ( r , x ) ) = 0 by definition of ϕ m 1 . Using that fact, and then Lemmas A1 and A2,
f ( x ) = ε n m + 1 ( r , x , ϕ m 1 ( r , x ) ) = i I ε i ( r , x , ϕ m 1 ( r , x ) ) =
1 2 i I ε i ( R ) j J ε j ( R ) = a k I b k J ( p a k , k E [ p a k , k | R ] ) a k J b k I ( p a k , k E [ p a k , k | R ] )
where R = ( r , x , ϕ m 1 ( r , x ) ) . Therefore,
f ( x ) f ( x ) = a k I b k J ( E [ p a k , k | R ] E [ p a k , k | R ] ) a k J b k I ( E [ p a k , k | R ] E [ p a k , k | R ] ) < 0
since at least one of the sums is non-empty ( G M is connected), and F X ( R i R j ± L ) F X ( R i R j ± L ) = F X ( R i R j ± L ) F X ( R i R j ± L ) < 0 for any game between j J and i I .
We also check that f ( x ) approaches a positive number as x goes to infinity. Since ϕ m 1 ( r , x ) is non-decreasing in x, its coordinates either have a limit as x goes to infinity, or they also go to infinity (by the monotone convergence theorem), and we can define
I : = { i | lim x ( r , x , ϕ m 1 ( r , x ) ) i = } and J : = { j | lim x ( r , x , ϕ m 1 ( r , x ) ) j R }
for which, again, n m + 1 I and 1 J , and as before,
f ( x ) = a k I b k J ( p a k , k E [ p a k , k | R ] ) a k J b k I ( p a k , k E [ p a k , k | R ] ) x a k I b k J ( p a k , k 1 ) a k J b k I ( p a k , k 0 ) < 0
since G M is connected and therefore one of the sums is non-empty. An analogous argument shows that as lim x f ( x ) > 0 , and thus, for any given r , f ( x ) has a unique zero, which we denote by g ( r ) , with that, we can define ϕ m : = ( g ( r ) , ϕ m 1 ( ( r , g ( r ) ) ) ) .
Now, we can show that g ( · ) is non-decreasing, because if r r , then by the induction hypothesis, ϕ m 1 ( r , g ( r ) ) ϕ m 1 ( r , g ( r ) ) , and hence
ε n m + 1 ( r , g ( r ) , ϕ m 1 ( r , g ( r ) ) = 0 = ε n m + 1 ( r , g ( r ) , ϕ m 1 ( r , g ( r ) ) ε n m + 1 ( r , g ( r ) , ϕ m 1 ( r , g ( r ) )
which implies (by the fact that x ε n m + 1 ( ( r , x , ϕ m 1 ( ( r , x ) ) ) ) is strictly decreasing and continuous, i.e., bijective) that g ( r ) g ( r ) .
This in turn implies that ϕ m is non-decreasing by composition of g and ϕ m 1 , and to conclude the induction, we only need to check that g is continuous. This is similar to the base case, using that for any convergent sequence { r k } k N with r * = lim k r k , we have
lim k ε n m + 1 ( r * , g ( r k ) , ϕ m 1 ( r * , g ( r k ) ) ) = lim k ε n m + 1 ( r k , g ( r k ) , ϕ m 1 ( r k , g ( r k ) ) ) = 0 = ε n m + 1 ( r * , g ( r * ) , ϕ m 1 ( r * , g ( r * ) ) )
and invoking that f * : x ε n m + 1 ( ( r , x , ϕ m 1 ( ( r , x ) ) ) ) is bijective again.
Our induction works for m = 1 n 1 , and to complete the proof (for m = n ), we consider R ^ = ( 0 , ϕ n 1 ( 0 ) ) . By definition of ϕ n 1 , we know that ε 2 ( R ^ ) = ε 3 ( R ^ ) = = ε n ( R ^ ) = 0 , and by lemma A1, ε 1 ( R ^ ) = i = 2 n ε i ( R ^ ) = 0 , as we wanted.   □
Proof of Theorem A2 (uniqueness).
Let us suppose there are two rating vectors R = ( r 1 , , r n ) and Q = ( q 1 , , q n ) such that ε ( R , M ) = ε ( Q , M ) = 0 . The residues ε are the same for the normalized ratings with mean zero R = R 1 n i = 1 n r i and Q i = 1 n r i = i = 1 n q i = 0 . Therefore, if R Q is not constant, R Q 0 , and there are i , j { 1 , , n } such that r i < q i and r j > q j . In that case, let I : = { 1 i n : r i < q i } and J : = { 1 j n : r j q j } .
0 = i I ε ( R , M ) j J ε ( R , M ) i I ε ( Q , M ) j J ε ( Q , M ) =
= 2 k M I J p a k , k E [ p a k , k | R ] ( p a k , k E [ p a k , k | Q ] ) 2 k M J I p a k , k E [ p a k , k | R ] ( p a k , k E [ p a k , k | Q ] ) =
= 2 k M I J F X ( q a k q b k + L ) F X ( r a k r b k + L ) + 2 k M J I F X ( r a k r b k + L ) F X ( q a k q b k + L )
but every term in the sums is positive, because for any players ( i , j ) I × J , we have that r i q i < 0 r j q j r i r j < q i q j F X ( r i r j + L ) < F X ( q i q j + L ) , and similarly, F X ( r j r i + L ) > F X ( q j q i + L ) . Since one of the sums is non-empty (otherwise I and J would be disconnected in the undirected G M ), one of the two sums is positive, and the other is non-negative, which is a contradiction. Therefore, ε ( R , M ) = 0 = ε ( Q , M ) R = Q and therefore, R must be equal to Q plus a constant.   □
Theorem A3.
ε ( R , M ) is a descent direction for the problem min R R n ε ( R , M ) 2 2
Proof. 
We want to show that ( ε ( R , M ) 2 2 ) T ε ( R , M ) < 0 for ε 0 . First, the partial derivatives of ε i are, for i j ,
d ε i ( R , M ) d r j = d d r j k a k = i p i , k F X ( r i r b k + L ) + d d r j k b k = i p i , k F X ( r i r a k L ) =
= a k = i b k = j f X ( r a k r b k + L ) + a k = j b k = i f X ( r b k r a k L ) = k M i M j f X ( r a k r b k + L )
since f X is symmetric around zero, and for i = j ,
d ε j ( R , M ) d r j = d d r j k a k = j f X ( r j r b k + L ) + d d r j k b k = j f X ( r j r a k L ) = k M j f X ( r a k r b k + L )
so the scalar product we want is
( ε 2 2 ) T ε = i = 1 n ε i ( R , M ) · d d r i ε ( R , M ) 2 2 = i = 1 n j = 1 n 2 ε i ( R , M ) ε j ( R , M ) d ε j ( R , M ) d r i =
= 2 i = 1 n ε i ( R , M ) 2 k M i f X ( r a k r b k + L ) + 4 i j ε i ( R , M ) ε j ( R , M ) k M i M j f X ( r a k r b k + L ) =
= 2 k = 1 | M | 2 ε a k ( R , M ) ε b k ( R , M ) ε a k ( R , M ) 2 ε b k ( R , M ) 2 f X ( r a k r b k + L ) =
= 2 k = 1 | M | ε a k ( R , M ) ε b k ( R , M ) 2 f X ( r a k r b k + L ) 0
If ε ( R , M ) 0 , by Lemma A1, there must be strictly negative and non-negative components of ε ( R , M ) , and a game between one of each (by weak connectedness of G M ), so at least one pairing ( a k , b k ) of M has ε a k ε b k , and the associated term of the sum above is strictly positive, giving ( ε 2 2 ) T ε < 0 .   □

Appendix B. Proof That the Discrete Model Is an Elo Model

Here, we prove that the discrete model proposed in Section 3.4 is a particular type of Elo model, or in other words, that if E [ p A ] = P [ S k ( μ 1 , μ 2 ) > 0 ] + 1 2 P [ S k ( μ 1 , μ 2 ) = 0 ] for
μ 1 = 1 2 Δ r + Δ r 2 + H 2 and μ 2 = 1 2 Δ r + Δ r 2 + H 2
there exists a continuous distribution X with domain R for which E [ p A ] = F X ( Δ r ) .
Proof. 
We denote
F ( x ) : = P [ S k ( μ 1 ( x ) , μ 2 ( x ) ) > 0 ] + 1 2 t P [ S k ( μ 1 ( x ) , μ 2 ( x ) ) = 0 ] =
= P S k x + H 2 x 2 , x + H 2 x 2 > 0 + 1 2 P S k x + H 2 x 2 , x + H 2 x 2 = 0
Here, we assume H 2 > 0 , since 1 4 H 2 is the expectation of a non-negative quantity (the product of the goals scored by each team). This means that μ 1 is strictly increasing in Δ r , and μ 2 = H 2 4 μ 1 is strictly decreasing in Δ r .
On the other hand, g k ( a , b ) : = P [ S k ( a , b ) k ] is increasing in a and decreasing in b, because for any ε > 0 , we have
g k ( a + ε , b ) = P [ S k ( a + ε , b ) k ] = P [ S k ( a , b ) + P ( ε ) k ] > P [ S k ( a , b ) k ] = g k ( a , b )
g k ( a , b ) = P [ S k ( a , b ) k ] > P [ S k ( a , b ) P ( ε ) k ] = g k ( a , b + ε )
where P ( ε ) is a Poisson variable with mean ε .
This two propositions imply that F ( Δ r ) = 1 2 P [ S k ( μ 1 , μ 2 ) 1 ] + 1 2 P [ S k ( μ 1 , μ 2 ) 0 ] = g 1 ( μ 1 , μ 2 ) + g 0 ( μ 1 , μ 2 ) 2 is strictly increasing in μ 1 and decreasing in μ 2 , and therefore increasing as a function of Δ r . To finish, we only need to show that E [ p A ] goes to 1 as Δ r approaches + , and 0 when Δ r goes to , since any monotone function with that asymptotic behavior is a distribution function.
As Δ r goes to infinity, μ 1 goes to infinity and μ 2 goes to zero, so
F ( Δ r ) = P [ P ( μ 1 ) > P ( μ 2 ) ] + 1 2 P [ P ( μ 1 ) = P ( μ 2 ) ] P [ P ( μ 1 ) > 0 = P ( μ 2 ) ] =
= ( 1 e μ 1 ) e μ 2 Δ r ( 1 e ) e 0 = 1
and of course, F ( Δ r ) = 1 2 ( P [ S k ( μ 1 , μ 2 ) 1 ] + P [ S k ( μ 1 , μ 2 ) 0 ] ) 1 , so by the sandwich rule,
F ( Δ r ) Δ r 1
A similar argument shows that when Δ r goes to ,
F ( Δ r ) P [ P ( μ 1 ) > 0 ] + 1 2 P [ P ( μ 2 ) = 0 ] + 1 2 P [ P ( μ 1 ) > 0 ]
3 2 ( 1 e μ 1 ) + 1 2 e μ 2 Δ r 3 2 ( 1 e 0 ) + 1 2 e = 0
and F is bounded below by zero, so both limits hold. Therefore, F is the distribution function of some variable X.   □

Appendix C. Implementation of the Rating System from SPI

The SPI rating system that we used as a benchmark to compare our discrete model is described in [7,10]. The system uses two rating parameters for each team instead of only one, namely O F F A , which is higher the more goals team A is likely to score, and D E F A , which is higher the more goals A will allow its rivals to score.
In a match between A and B, in which A scores G A goals and B scores G B , we define the Adjusted Goals Scored of A as
A G S A : = ( G A D E F B ) 0.424 · A V G B A S E + 0.548 max ( 0.25 , 0.424 · D E F B + 0.548 ) + A V G B A S E
where A V G B A S E is the average number of goals scored by each team in a game. Similarly, the Average Goals Allowed of A is defined as
A G A A : = ( G B O F F B ) 0.424 · A V G B A S E + 0.548 max ( 0.25 , 0.424 · O F F B + 0.548 ) + A V G B A S E
and both quantities are similarly defined for B. The model then states that E [ A G S A ] = O F F A and E [ A G A A ] = D E F A . From this, given some starting values for both parameters for both teams, we can infer A G S A , A G A A , A G S B , and A G A B for each game, and correct the parameters by
O F F A = λ A G S A + ( 1 λ ) O F F A D E F A = λ A G A A + ( 1 λ ) D E F A O F F B = λ A G S B + ( 1 λ ) O F F B D E F B = λ A G A B + ( 1 λ ) D E F B
Since A G A A and A G S A are linear in G B and G A , respectively, we obtain the expectation of G B and G A from them as follows:
E [ G A ] = f A , B : = ( O F F A A V G B A S E ) max ( 0.25 , 0.424 · D E F B + 0.548 ) 0.424 · A V G B A S E + 0.548 + D E F B
E [ G B ] = g B , A : = ( D E F A A V G B A S E ) max ( 0.25 , 0.424 · O F F B + 0.548 ) 0.424 · A V G B A S E + 0.548 + O F F B
There are two issues with this: First, g A , B and f B , A should also match E [ G A ] and E [ G B ] , respectively, but they do not necessarily match; in practice, we used E [ G A ] = f A , B + g A , B 2 for estimating the win and draw odds in our code. The second issue is that the resulting value of E [ G A ] or E [ G B ] could be negative, and in that case, we impute 10 6 as the expected goals of that team.
Finally, in order to account for the home field advantage, supposing A and B play on A’s field, we modify the formulas above as follows:
A G S A : = ( G A D E F B ) 0.424 · A V G B A S E + 0.548 max ( 0.25 , 0.424 · D E F B + 0.548 ) + A V G A W A Y
A G A A : = ( G B O F F B ) 0.424 · A V G B A S E + 0.548 max ( 0.25 , 0.424 · O F F B + 0.548 ) + A V G H O M E
where A V G H O M E is the average number of goals scored by home teams in our dataset, and A V G A W A Y the average goals scored by the visiting player.
The formulas for f A , B = ( O F F A A V G A W A Y ) max ( 0.25 , 0.424 · D E F B + 0.548 ) 0.424 · A V G B A S E + 0.548 + D E F B , f B , A , g A , B , and g B , A are modified accordingly.
Finally, in order to use this model in the same dataset as the Elo system, we used the same algorithm as in Section 4.1 replacing the unbiased estimator for ratings by 50 iterations of the update formulas (A6) from the arbitrary starting point O F F i = D E F i = 1 , which quickly converged to a fixed point.
The only parameter of the model is the update sensibility λ in (A6), which plays a similar role as K does for the Elo system, and we also introduced a dampening parameter D so that the result predictions were calculated with D · O F F A , D · D E F A , D · O F F B , and D · D E F B instead of the original parameters. In our training dataset for soccer, we found the minimum M S E at the point λ * 0.02 and D 0.9 .

References

  1. Elo, A.E. The Rating of Chessplayers, Past and Present, 2nd ed.; Arco Publishing Inc.: New York, NY, USA, 1986. [Google Scholar]
  2. Aldous, D. Elo Ratings and the Sports Model: A Neglected Topic in Applied Probability? Stat. Sci. 2017, 32, 616–629. [Google Scholar] [CrossRef]
  3. Glickman, M.E. Paired Comparison Models with Time-Varying Parameters. Ph.D. Thesis, Harvard University, Cambridge, MA, USA, 1993. [Google Scholar]
  4. Stern, H.S. A brownian motion model for the progress of sports scores. J. Am. Stat. Assoc. 1994, 89, 1128–1134. [Google Scholar] [CrossRef]
  5. Song, K.; Shi, J. A gamma process based in-play prediction model for National Basketball Association games. Eur. J. Oper. Res. 2020, 283, 706–713. [Google Scholar] [CrossRef]
  6. Karlis, D.; Ntzoufras, I. Analysis of sports data by using bivariate Poisson models. Statistician 2003, 52 Part 3, 381–393. [Google Scholar] [CrossRef]
  7. A Guide to ESPN’s SPI Ratings. Available online: https://www.espn.com/world-cup/story/_/id/4447078/ce/us/guide-espn-spi-ratings (accessed on 14 July 2024).
  8. Cattelan, M.; Varin, C.; Firth, D. Dynamic Bradley–Terry modelling of sports tournaments. J. R. Stat. Soc. Ser. Appl. Stat. 2013, 62, 135–150. [Google Scholar] [CrossRef]
  9. Ryall, R. Predicting Outcomes in Australian Rules Football. Ph.D. Thesis, Royal Melbourne Institute of Technology University, Melbourne, Australia, 2011. [Google Scholar]
  10. How Our Club Soccer Predictions Work. Available online: https://fivethirtyeight.com/methodology/how-our-club-soccer-predictions-work/ (accessed on 14 July 2024).
  11. Jabin, P.E.; Junca, S. A Continuous Model For Ratings. SIAM J. Appl. Math. 2015, 75, 420–442. [Google Scholar] [CrossRef]
  12. Wilks, S.S. The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann. Math. Stat. 1938, 9, 60–62. [Google Scholar] [CrossRef]
  13. Steele, J.M. Stochastic Calculus and Financial Applications, 1st ed.; Springer: New York, NY, USA, 2010. [Google Scholar]
  14. Bondesson, L. Generalized Gamma Convolutions and Related Classes of Distributions and Densities; Lecture Notes in Stat. 76; Springer: New York, NY, USA, 1992. [Google Scholar]
  15. Cramér, H. Über eine Eigenschaft der normalen Verteilungsfunktion. Math. Z. 1936, 41, 405–414. [Google Scholar] [CrossRef]
  16. Skellam, J.G. The frequency distribution of the difference between two Poisson variates belonging to different populations. J. R. Stat. Soc. 1946, 109, 296. [Google Scholar] [CrossRef]
  17. Glickman, M.E.; Jones, A.C. Rating the Chess Rating System. Chance 1999, 12, 21–28. [Google Scholar]
  18. Ferri, C.; Hernandez-Orallo, J.; Modroiu, R. An Experimental Comparison of Performance Measures for Classification. Pattern Recognit. Lett. 2009, 30, 27–38. [Google Scholar] [CrossRef]
Figure 1. Expected score versus rating difference.
Figure 1. Expected score versus rating difference.
Applsci 14 08023 g001
Figure 2. ROC curve for each of the Elo system implementations in the training set (a) and test set (b).
Figure 2. ROC curve for each of the Elo system implementations in the training set (a) and test set (b).
Applsci 14 08023 g002
Figure 3. Observed expectation of p A vs. r A r B versus predicted value F X ( D ( r A r B ) + L ) .
Figure 3. Observed expectation of p A vs. r A r B versus predicted value F X ( D ( r A r B ) + L ) .
Applsci 14 08023 g003
Figure 4. Mean squared error as a function of time for C > C * (a) and C < C * (b).
Figure 4. Mean squared error as a function of time for C > C * (a) and C < C * (b).
Applsci 14 08023 g004
Figure 5. Tests for the model residues C · S T L D ( r A r B ) , plotted against the Elo difference (a) and the standard normal quantiles (b).
Figure 5. Tests for the model residues C · S T L D ( r A r B ) , plotted against the Elo difference (a) and the standard normal quantiles (b).
Applsci 14 08023 g005
Figure 6. Score prediction variance M S E S ( t ) vs. match time.
Figure 6. Score prediction variance M S E S ( t ) vs. match time.
Applsci 14 08023 g006
Figure 7. Average increase in combined score vs. match time.
Figure 7. Average increase in combined score vs. match time.
Applsci 14 08023 g007
Figure 8. Average number of goals each minute.
Figure 8. Average number of goals each minute.
Applsci 14 08023 g008
Figure 9. Mean squared error as a function of time for C > C * (a) and C < C * (b).
Figure 9. Mean squared error as a function of time for C > C * (a) and C < C * (b).
Applsci 14 08023 g009
Figure 10. Score prediction variance M S E S ( t ) vs. match time.
Figure 10. Score prediction variance M S E S ( t ) vs. match time.
Applsci 14 08023 g010
Figure 11. Mean squared error as a function of 1 u t for C > C * (a) and C < C * (b).
Figure 11. Mean squared error as a function of 1 u t for C > C * (a) and C < C * (b).
Applsci 14 08023 g011
Figure 12. Score prediction variance M S E S ( t ) vs. modified time 1 u t .
Figure 12. Score prediction variance M S E S ( t ) vs. modified time 1 u t .
Applsci 14 08023 g012
Figure 13. Tests for the model residues C · S T L D ( r A r B ) , plotted against the Elo difference (a) and the standard normal quantiles (b).
Figure 13. Tests for the model residues C · S T L D ( r A r B ) , plotted against the Elo difference (a) and the standard normal quantiles (b).
Applsci 14 08023 g013
Figure 14. Evolution of model mean squared error (a) and Log-loss (b) vs. modified time 1 u t .
Figure 14. Evolution of model mean squared error (a) and Log-loss (b) vs. modified time 1 u t .
Applsci 14 08023 g014
Table 1. Mean squared error of different Elo models vs. sample variance.
Table 1. Mean squared error of different Elo models vs. sample variance.
K * L * MSE MSE test
No Elo−53.780.238760.24394
0−60.890.264980.29590
14.7100.223330.22308
16.19−60.740.211790.21744
Table 2. Mean squared error of different Elo models vs. sample variance.
Table 2. Mean squared error of different Elo models vs. sample variance.
K * L * D * MSE MSE test
No Elo48.11-0.178810.18188
052.1210.172480.19252
9.90010.163460.16136
10.8052.6810.154200.15396
11.8252.500.8740.153870.15341
Table 3. Mean squared error of restricted Elo models vs. sample variance.
Table 3. Mean squared error of restricted Elo models vs. sample variance.
K * L * D * MSE MSE test
No Elo0.56115-0.178810.18188
00.6152510.172560.19258
0.11702010.163480.16141
0.128880.6156010.154240.15397
0.147810.620040.865360.153890.15334
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gómez-Abejón, G.; Rodríguez, J.T. Stochastic Extensions of the Elo Rating System. Appl. Sci. 2024, 14, 8023. https://doi.org/10.3390/app14178023

AMA Style

Gómez-Abejón G, Rodríguez JT. Stochastic Extensions of the Elo Rating System. Applied Sciences. 2024; 14(17):8023. https://doi.org/10.3390/app14178023

Chicago/Turabian Style

Gómez-Abejón, Gonzalo, and J. Tinguaro Rodríguez. 2024. "Stochastic Extensions of the Elo Rating System" Applied Sciences 14, no. 17: 8023. https://doi.org/10.3390/app14178023

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop