*3.2. Reinforcement Learning*

Reinforcement learning (RL) has origins in psychology and artificial intelligence and it is used in many fields, including neuroscience [34,35]. This is the formulation that we use here:

Consider a vector of attractions *RLA<sup>j</sup> i mi j*=1 . Suppose that strategy *s<sup>j</sup>* is chosen, the payoff experienced by agent *i* is added to *RLA<sup>j</sup> i* . In this way, strategies that turn out well (have a high payoff) have their attraction increased, so they are played more likely in the future. After strategy profile *s*(*t*) is chosen at time *t* and payoffs are awarded, the new vector of attractions is:

<sup>7</sup> Moreover, the similarity function can also be dynamic, which further allows for reconsideration of past events in a way RL/EWA accumulation does not.

*Games* **2020**, *11*, 38

$$RLA\_{i}^{j}(t) = \phi RLA\_{i}^{j}(t-1) + I\left(s\_{i}^{j}(t), s\_{i}^{j}\right) \cdot \pi\_{i}\left(s\_{i}, s\_{-i}\left(t\right)\right) \qquad \forall j = 1, \ldots, m\_{i} \tag{5}$$

This is the same basic model of accumulated attractions as proposed by Harley [36] and Roth and Erev [37]. The first term within the brackets, *φRLA<sup>j</sup> i* (*t* − 1), captures the waning influence of past attractions. For all attractions other than the one corresponding to the selected strategy *s*(*t*) attractions tend toward zero, assuming that the single global factor is not too large. *I*(*s j i* ,*si*(*t*)) will be used to denote the indicator function which equals 1 when *s j <sup>i</sup>* = *si*(*t*) and 0 otherwise. The one countervailing force is the payoff *πi*, which only plays a role in the selected strategy (as indicated by the indicator function). This version of reinforcement learning is a cumulative weighted RL, since payoffs accumulate in the attractions to chosen strategies. In a simplified setting where payoffs are weakly positive, this process can be thought of as a set of leaky buckets (with leak rate (1 − *φ*)), one corresponding to each strategy, in which more water is poured into buckets corresponding to the strategy chosen in proportion to the size of the payoff received.<sup>8</sup> Subsequently, a strategy is chosen with a probability that corresponds to the amount of water in its bucket.

There are simpler forms of RL that can be used, where *φ* is equal to 1, and not estimated. This simpler model is estimated in Chmura et al. [9] and performs worse than our modified model in explaining the data. We use Equation (5) to fit the data to make conservative comparisons to the CBL model.
