**5. Measuring Goodness of Fit**

Following Chmura et al. [9], we use a quadratic scoring rule in order to assess the goodness of fit of each learning model.<sup>11</sup> This rule, as described in Equation (13), provides a measure of nearness from the predicted choice to the observed choice.12

$$q\_i(t) = 2p\_i(t) - p\_i(t)^2 - (1 - p\_i(t))^2 \tag{13}$$

The quadratic score, *qi*(*t*), is a function of the probability, *pi*(*t*), of the choice by action *i* in period *t*. *p* is the predicted probability that is derived from the parameters of the learning models. The score is equal to 1 minus the squared distance between the predicted probability and the actual choice.

The expected range of *qi*(*t*) is [−1, 1]. On one hand, if a learning model predicts the data perfectly, then *pi*(*t*) = 1, which implies *qi*(*t*) = 1. On the other hand, a completely uninformative learning model, in our setting, would be right half the time, so *pi*(*t*) = 0.5, which implies *qi*(*t*) = 0.5.

We employ the quadratic scoring rule in order to understand goodness of fit of each learning model in multiple tests. First, we calculate parameters on the entire playing history of all subjects and use the best-fitting parameters to estimate the predicted probabilities across playing history and calculate the mean quadratic score for each learning model. Next, we employ a rolling forward out-of-sample procedure. The out-of-sample process is chosen by fitting all models on the first *X*% of the data and using the fitted parameters of the model to predict the holdout sample of (100 − *X*)% of the data. We then calculate the mean quadratic score for the remaining out-of-sample observations. We repeat for different values of *X*; in particular, we use 40%, 50%, 60%, 70%, and 80% in-sample training data to predict choice on the remaining 60%, 50%, 40%, 30%, and 20% remaining data, respectively. The in-sample method is a standard way to judge goodness-of-fit, by simply looking at how much of the whole data the model can explain individual choice. The out-of-sample method guards against over-fitting the data, but, to be valid, it assumes stationarity of parameters across the in-sample and holdout data. For concerns of over-fitting the data with any learning model, this out-of-sample procedure is the preferred benchmark in choosing which model explain the data best.

In estimating CBL, we use information available to subjects to define the Problem set P. In our main specification, we choose two elements in the information vector (i.e., problem vector). The first element is the round of the game (i.e., time). The round of the game plays the role of recency, or forgetting, in other learning models; cases that are distant in the past are less similar to present circumstances than cases that happened more recently. The second element is the opponents' play from the game. We account for other players actions by using a moving average of past play, treating all opponents as a representative player, just as we do for the surprise index in self-tuning EWA. We use a four period moving average. For example, a row player would use the moving average of how many times their opponents played Left as a component of similarity and, as opponents trend toward different frequencies of playing Left, the CBL would put less weight on those cases, C. There are many possible choices on how to incorporate these information vectors and we explore them further in the Appendix B in Table A1. We find that these choices do not have a large effect on the performance of CBL.

We include cases as much as 15 periods in the past in memory (we explore the sensitivity of this assumption in Section 7.1).

<sup>11</sup> The quadratic scoring rule was introduced by Brier [38] to measure performance in weather forecasting. This scoring rule is also described in Selten [39].

<sup>12</sup> The use of other measures of goodness of fit generally provide the same qualitative measures, but ordering of preferred learning models can be reversed by employing Log-Likelihood when model fitness is relatively close. We prefer the quadratic scoring rule and use that throughout.
