1. Introduction
In many situations, ability to learn becomes important for survival: how fast and effectively individuals explore their surroundings and learn how to react to them determines their survival chances. Some environments and situations might require specific skills or rapid adaptations whereas others might just require general awareness. Learning and adaptation are remarkable features of life in natural environments that are prone to frequent changes. The process of learning can be considered as a continuous multidimensional process where several skills can be learned in parallel. However, reducing the time step to a very small interval, we focus on learning only one skill at a time. In this study, we introduce a model of multidimensional stepwise learning by assuming that the learning process can be subdivided into smaller time steps during which we only learn one skill.
Evolutionary games were proposed to answer questions of the effects of natural selection on populations. Here, a game happens at a higher level: the game is not played directly by individuals but rather by strategies or skills competing for survival [
1,
2]. In such settings, learning the best and more effective strategy first becomes critical for individuals, especially under resource limitation [
3]. Learning in the evolutionary settings often assumes that learning evolves in parallel with the evolution itself [
4]. In this paper, we consider the case when the system is perturbed after reaching its stable equilibrium. Finding the most efficient learning path to return to the previous equilibrium is a challenge. In this paper, we suggest to look at this problem from the evolutionary perspective where an optimal learning path is dictated by the chance of survival. This is achieved through maximisation of fitness depending on the learning path an individual takes.
Behavioural stochasticity is a popular object of study for game-theorists. First, the concept of “trembling hands” [
5] was suggested as an approach to model players’ mistakes during the strategies’ execution with some small probability. In addition, mutations can be interpreted as reactions to environmental changes (see for instance [
6,
7,
8] for studies on mutations). In artificial intelligence and economics this idea is of a particular interest since individuals can be imperfect and incapable of executing their strategies (examples of such studies are [
9,
10,
11]). However, we do not restrict our analysis to mistakes that are random noise, but, instead, allow for biased mistakes and assume that skills can be learnt at different rates. For this, we utilise the notion of incompetence that means that individuals make mistakes with certain probabilities, which was first introduced in [
12,
13] and extended in [
14]. Hence, learning is defined as a process of improving competence and reducing the probabilities of mistakes. Here, we propose the notion of prioritised learning when the priority in the order is determined by the skills’ relative advantages. That is, we are interested in the mechanism of discovering an optimal order according to which skills should be learned. This mechanism relies on the interplay between relative fitness and learning advantages and the notion of prioritised learning which aims to balance out these differences between the strategies.
We focus on the question in which order species should learn their skills to have an evolutionary advantage. The process of re-learning the effective strategy can be considered as a continuous multidimensional problem. However, reducing the time step to a very small interval, we focus on only one skill. We consider the case of the coexistence of two strategies: to cooperate and to defect. Mathematically, this can be described by the Snowdrift (also known as Chicken or Hawk-Dove) game where two strategies interact and both appear in the behavioural traits of individuals [
15].
The question of how cooperation evolves in communities is a rich field of study for mathematicians and biologists. Enforcement, punishment and reciprocity are the mechanisms that can sustain stable cooperation [
16,
17,
18,
19]. The Prisoners’ Dilemma is the well-known example of the problem of cooperation [
20,
21]. It is a strict game where sustaining cooperation becomes a challenge. However, it can be relaxed if we assume that benefit gained from cooperation can be received by both players and the costs shared (see [
22] for the review). The resulting game has a form of the Snowdrift game, where a stable mixed equilibrium exists [
23]. We point out that the focus of our study is not the evolution of cooperative mechanisms as such. Learning to cooperate is a fruitful field proposing many interesting results [
24,
25,
26]. However, here we are more interested in understanding of how the system should optimally evolve back to an equilibrium state once it was disturbed. Specifically, we shall focus on a Snowdrift type game in order to determine which of the strategies (cooperation or defection) should be learnt first.
When speaking of the optimality of a learning path, we expect that the benefit from this strategy has an impact on the overall population’s fitness [
27]. We note that learning and replicator timescales are decoupled. In fact, we require that the learning timescale is much slower than reproduction. This can be seen in view of a behavioural adaptation that might take longer time to happen. In classic settings of the replicator dynamics, the dynamics’ time scale is equal to the reproduction time scale. That is, every time step is the length of one generation. However, when modelling behaviour of individuals, this reproduction time scale represents an interaction time scale. Therefore, the behavioural change might take several interactions to be achieved. Hence, naturally, we assume that individuals are learning slower than they interact. For example, in the incompetent version of the Snowdrift game, individuals go through multiple interactions which may change their behaviour from cooperating to defecting.
We show that measuring the fitness over the learning path depends on the order of learning along the path and the extent to which strategies are learnt. Our results demonstrate that it might be preferable to learn a skill prone to higher probabilities of mistakes first and leverage its learning advantage. This suggests that in the environmental or other changes those strategies that are most disruptive must be adapted to more quickly if they are to survive at all. If two skills are equivalent in their relative strategic advantages, we show that both skills are also equivalent with respect to the order of learning. Counter-intuitively, we show that these relative advantages can still be identical even if the relative fitness advantages of the skills are significantly different suggesting that the evolution is trying to balance out mistakes. We conjecture that not only the fitness of the skill has to be taken into account but also the degree of incompetence.
In the following section we set up the model and define the notions of relative fitness, learning and strategic advantages. Then, in 
Section 3 we define a fitness-over-learning objective function measuring how fitness of the population improves over the learning path taken. After that, we proceed to two cases: (a) the case when skills are identical in their strategic advantage and no prioritised learning is needed in 
Section 4, and (b) the case when skills are distinct in their advantages in 
Section 5. These two cases demonstrate the notion of hierarchy in the learning order and how mistakes affect the evolution.
  2. Learning Setup
Typically, a population of species acquires a set of skills that they are required to learn either while being young or while adapting to new environmental conditions. Let us suppose that this set consists of only two essential skills both of which are needed for survival (or for stable coexistence in the population). Hence, we consider two specific strategies that need to be learnt: cooperation and defection. In the evolutionary context, defection can be interpreted as aggression to capture (rather than share) a resource such as food or territory. We utilise the notion of replicator dynamics in order to describe the evolution of interactions among individuals in the population [
28,
29]. Let 
 be the frequency of cooperators at (evolutionary) time 
t. Then, the dynamics can be expressed as
      
      where 
 is fitness of cooperators and 
 is the mean fitness function defined as 
, since, 
 is the fraction of defectors. For the purpose of this paper, we use a linear form of the fitness functions, as in [
14], which simplifies to
      
We shall note that if both cooperation and defection strategies would be required to some extent, then both strategies might be required to coexist at equilibrium. Such an equilibrium usually characterises the Snowdrift game, which is also referred to as a anti-coordination game. Since both strategies are the best response to the opposite strategy, at equilibrium, both strategies coexist, hence, securing some stable level of cooperation. We shall construct a reward matrix 
R for such a game as
      
      where 
B is the benefit and 
C is the cost of cooperation. In order to simplify our analysis, we apply the linear transformation from [
30] and subtract the diagonal elements from the corresponding columns, as it does not affect dynamics’ behaviour. Then, we can consider a reduced form of the matrix given by
      
 where 
 and 
. As we want to set our game to be a Snowdrift game, we assume that 
, because then both strategies will stably coexist at the equilibrium 
 and their frequencies will be given by
      
Then, in the context of learning a set of skills, both skills will be required to be learned. The question we address is: which one should be learnt first?
In order to model learning these skills in the game, we utilise the notion of incompetence. This concept was first introduced for classic games in 2012 [
12] and extended to evolutionary games in 2018 [
14]. Here, individuals choose a strategy to play but have non-zero probabilities of executing a different strategy. Such mistakes are a manifestation of the incompetence of players. Mathematically, it is described by the matrix of incompetence, 
Q, that evolves as individuals are learning. This matrix consists of conditional probabilities 
 determining the probability of executing strategy 
j given that strategy 
i was chosen. The matrix has the form
      
A schematic representation of the interaction under incompetence can be found in 
Figure 1A. As a measure of learning we use parameters 
x and 
y for each strategy. Thus, each strategy can be learned at a different pace, which sets this work apart from the existing literature (see [
12,
31]). Then, the matrix of incompetent parameters has the form
      
      where 
. We assume that strategies can be learnt and, hence, the incompetence can decrease from some initial level of propensity to make mistakes. Thus, the learning process is described by the equation
      
      where 
S represents the starting level of incompetence. If 
, then 
 and
      
Full competence corresponds to the case 
, where 
, the identity matrix. Each strategy has its own measure of incompetence level and its own time needed for it to be mastered. Then, the new incompetent game reward matrix is defined as
      
Given the new reward matrix, we require some basic assumptions on parameters in (
3)–(
5) to be satisfied in order for the game to avoid bifurcations in the replicator dynamics. Specifically:
If parameters of the game do not meet these conditions, then there will be some values of the incompetence parameters for which the system (
1) undergoes bifurcation. This would lead to situations where one of the skills is not beneficial to be learnt due to the fact that it is dominated. Hence, learning the beneficial skill would be an obvious answer to the question of which skill to learn first. However, we are interested in the case when the optimal learning path involves both skills.
  3. Maximising Fitness over Learning
The optimality of a learning path can be determined in many ways. In the replicator dynamics with symmetric payoff matrices, interactions follow the evolutionary path that maximises the overall population’s fitness [
32]. We focus on the fitness function at the equilibrium state, which implies that the population attains a steady-state faster than incompetence parameters change. Technically speaking, we assume either very long timescales of learning or else very fast convergence to steady state, or both. Hence, it is sufficient to consider the mean fitness function [
29] of the Snowdrift game which has the following form
      
	  Then, in our new incompetent game the mean fitness 
 can be shown to be
      
We shall note that in fact any nontrivial learning path will improve the mean fitness at the equilibrium. This follows from the fact that for any vector 
 with entries in 
 the following holds
      
      where 
 is a Hessian of 
. Since we would like to find an optimal learning path that maximises the fitness, it is convenient to consider a new re-scaled fitness function. For that, we choose
      
      where new parameters 
 and 
 are
      
An important note for further analysis is the understanding of fitness and learning advantages. Given that the relative fitness of each strategy is positive, that is, , we say that the strategy with a higher relative fitness obtains a fitness advantage. In addition, we define a learning advantage of a strategy in a very special way. This advantage arises when the strategy is accompanied by a higher probability of making a mistake in its execution. In such a case reducing the corresponding level of incompetence offers a greater opportunity for improvement, thereby constituting a learning advantage.
Next, note that the parameters 
 and 
 introduced in (8) are closely connected with the above definitions of fitness and learning advantages. Indeed, their difference allows us to capture the relative tradeoffs between these two types of advantages. Hence, we shall say that the new parameter 
 measures the relative strategic advantage of cooperators against defectors. That is, if 
, cooperators have an advantage. We illustrate these concepts in 
Table 1. Next, we shall demonstrate how this notion affects the optimal learning path with respect to the population’s mean fitness.
However, we need to take into account possible limitations of the learning pace. We assume that only one skill can be learnt to some degree and two skills cannot be learnt simultaneously. Thus, learning is achieved in a discrete manner (see 
Figure 1B). The question is what is the optimal learning order and what are the switching points. This will be determined by defining the learning path as a curve 
 on the learning space 
 such that it starts at 
 and ends at 
.
Definition 1. Define the learning space 
of an incompetence game  as the domain of the incompetence matrices  from (5) given by the set of all 2 × 2 stochastic matrices.  Definition 2. Define a learning path for an incompetence game as a curve  on the learning space  such that  and .
 A learning path  can be a smooth curve or a stepwise path, depending on whether learning is continuous or discrete. We shall consider only stepwise learning paths. This is a natural restriction because: at a small time-scale only one skill can be learned at any given time. However, we will also study the case when the number , which approaches a smooth learning curve.
Definition 3. A stepwise learning path of order n, , is a stepwise curve in the learning space , connecting the n points , that satisfies
        
- (a)
- ,
             
- (b)
- ,
             
- (c)
- ,
             
- (d)
- .
             
 Conditions  and  imply  for . The path segment from  to  could consist of the following sequence , , , in which case we say that the x direction was taken. The alternative path segment is , , , indicating that the y direction was chosen first.
Definition 4. The set of all (alternating) stepwise curves of order n that satisfy conditions (a)–(d) is called the learning set of order n and is denoted by .
 Here, we focus on alternating stepwise paths where the first direction determines the remaining path. Consequently, an 
n-step learning path is described by the points 
 satisfying (a)–(d) above, resulting in two possible learning paths corresponding to the direction of the first step. Let 
 be the fitness-over-learning in the 
x direction for the learning path 
 given by
      
Let 
 be the fitness-over-learning in the 
y direction for the learning path 
 defined as
      
That is, for a given , we have two objective functions:  and . Finding their maxima separately yields an optimal learning path with the optimal direction. In what follows, we define the optimal learning paths in the x and y directions and the overall optimal learning path that maximises the population’s measure of fitness over learning.
Definition 5. The optimal learning path  with respect to the population’s mean fitness function  is such that it satisfies the equationwhere ,  are the optimal paths in the x and y direction, respectively. When , both  and  are optimal paths, where .  The superscript (respectively, subscript) indicates the direction of the path. That is, if the x direction is optimal, that is , then . Otherwise, if , then . When , both directions x and y are optimal and it does not matter which one we take. Hence, we can define prioritised learning in these settings.
Definition 6. We say that there exists prioritised learning for Φ among stepwise learning paths of order n, if there exists  such that one of the directions is preferable over the other. That is, .
 Given the structure of the fitness Functions (
7) and (
9)–(
10) we can explicitly derive 
 and 
 for the learning paths in the 
x and 
y directions. However, we will show that the optimal direction of learning can be determined simply by the sign of the relative strategic advantage, 
.
  4. No Strategic Advantages
Throughout this section, we assume that no strategy has a relative strategic advantage (). We shall first note that in this case both objective functions,  and , exhibit a symmetry relation.
Theorem 1. If , then there is no difference in the direction of optimal learning, that is, .
 Hence, if there is no relative strategic advantage in the game (), the order of learning does not affect the fitness of the population. It is therefore sufficient to calculate only one path that maximises the fitness-over-learning of the population. In the following proposition we show that this learning path has a remarkably simple form.
Proposition 1. If , then the unique optimal stepwise learning path of order n in the x direction,  is given by  See Mathematical 
Appendix A for the proofs of Theorem 1 and Proposition 1. Interestingly, the optimal solution for the 
x direction yields 
 such that 
. That is, each step in the 
y direction, 
 is greater than the step in the 
x direction, 
.
To analyse how increasing the number of steps is changing the objective function, we consider the rate of change of the fitness over learning function at the optimal solution, that is
      
After substituting the optimal solution (
11) into (
12) we obtain
      
	  Due to symmetry, when 
, it follows that 
.
Therefore, the smaller the learning steps, the greater the benefit. However, the marginal increases tend to 0 as n becomes large. Arguably, this illustrates the “law of diminishing returns” of stepwise learning.
Let us consider two following games, which will be called Examples 1 and 2: 
In terms of fitnesses, these examples are different: the fixed points of the replicator equations are 
 and 
, respectively. Hence, fitness of strategy 1 is higher in the second game. However, setting 
 and 
, we have relative strategic advantages in examples 1 and 2 both equal to 0 (
). Then, 
 and 
 equalise the strategies. The high probability of mistakes for strategy 2 in 
 signals that it is more disrupted by incompetence. This makes the optimal learning paths for these two games identical (see 
Figure 2A).
Next we shall consider the case when . In this case, one of the strategies has a relative strategic advantage, depending on the sign of . We show that the order of the learning path is now important and influences the value of the fitness-over-learning function, which we call prioritised learning.
  5. Prioritised Learning
First, we recall the notion of prioritised learning used henceforth. By Definition 6, there exists prioritised learning for  among stepwise learning paths of order n, if there exists  such that one of the directions is preferable over the other, along that path.That is, . We shall next characterise an optimal solution in the x direction.
Proposition 2. Let . If n is such that , then the x direction is preferred and the optimal learning path of order n is given by  We refer the reader to Mathematical 
Appendix A for more details. Next, we provide conditions for which the 
y direction defines the optimal learning path.
Proposition 3. Let . If n is such that , then the y direction is preferred and the optimal learning path of order n is given by  Hence, depending on the value of 
, the optimal learning path has different directions. The threshold for 
 is equal to 
, which for sufficiently large number of steps is nearly 0. However, in the proof of Theorem 2 (see Mathematical 
Appendix A), we show that it is the sign of 
 that determines the direction of learning.
Theorem 2. The direction of the optimal learning path is determined by the sign of δ: for  the y direction is optimal and for  the x direction is optimal.
 For 
 the optimal learning path represents 
 equally distributed steps and the direction of the first step does not affect the fitness-over-learning. Note that if 
, the first and last steps for each direction are
      
Thus, the first step of the learning path aims to adjust fitness and learning advantages between the two strategies. However, in the interior of the learning space, the path still takes 
 equally distributed steps. In 
Figure 3 we displayed the objective function 
 for different positive values of 
 and different number of steps 
n. The images below the colormap enhance the changes in 
 as the number of steps 
n varies. The difference in the values of 
 with respect to 
n is marginal, hence, below we zoom into several values of 
. For 
, 
 increases in 
n, suggesting that taking more steps is beneficial. However, 
 decreases for 
, suggesting one-step learning of the skill. We show this in the next result that follows immediately from Propositions 2, 3, and Theorem 2.
Corollary 1. For, we obtain two cases:
- (i) 
- If, then the optimal learning curve is a one-step function in the y direction.
             
- (ii) 
- If, then the optimal learning curve is a one-step function in the x direction. 
 If 
, then the optimal solution from (
13) and (
14) is only feasible for 
. However, if 
 is positive and less than 1, the greatest fitness-over-learning is achieved for a smooth learning path along the line 
. This can be seen as a consequence of allowing 
n to approach infinity.
Corollary 2. For, 
the optimal step-wise learning pathin the y direction as, 
follows the relation  The same relation between 
 and 
 can be obtained for the optimal solution in the 
x direction when 
 and 
. In that case, when we initiate learning in the 
x direction, we start with 
 for 
 and continue to follow the relationship 
 for 
. We demonstrate the effect of the relative strategic advantage on the optimal learning path, by considering the game with the fitness matrix 
 and the incompetence matrix 
, referred to as Example 3:
Strategy 2 obtains a learning advantage (
). We give strategy 1 a fitness advantage by selecting three values for 
a: 5, 7 and 9, which result in 
 and 0, respectively. Then, smaller values of 
a result in the larger first step (see 
Figure 2B).
The next natural question to ask would be what is the influence of the number of steps we make. Computing the exact form of 
 and 
 at the optimal solutions and taking their rate of change in 
n yields that
      
      which are positive for any 
. For 
 we can observe negative 
 identifying that the 
x direction is no longer preferred. This indicates that the bigger the difference between skills, the lower potential fitness-over-learning that can be gained. In this sense, the skill with higher incompetence might reduce the fitness as it requires some investments for the skills to be learnt.
Our learning scheme allows for an adjustment of individuals’ behaviour in case of any disruption leading to behavioural mistakes. The number of steps required for such an adjustment can be as low as 2 allowing for a quick reaction to system’s uncertainty. Moreover, such an adjustment does not require for the system to stop interactions for learning. Individuals can continue interacting with their group-mates while their behavioural mistakes are reduced and fitness is maximised.
  6. Conclusions
In this paper, we considered the evolutionary game where two skills coexist in a mixed equilibrium, and hence are both required. This is a key assumption as we aimed to answer the question: If both strategies are important, then how do we learn them in an optimal way? We introduced a fitness-over-learning function which measures the improvement in fitness of the population over the learning path that was taken. This function relies on both performance of the strategy and its rate of mistakes.
The naive suggestion would be that the most advantageous skill in terms of fitness has to be learnt first. However, the strategy with lower relative strategic advantage is learnt first in the optimal learning path. We conjecture that this adjusts the difference between the skills and then, once they are comparable, optimal learning suggests to learn both skills with equal rates. These findings indicate that once disrupted, selection tries to recover the most affected strategies first even if their fitness is not the highest. Nonetheless, if the fitness difference is high enough to overcome the effect of incompetence, then the optimal learning will demand that the better strategy is learned first. Another possible interpretation would be to consider the mixed equilibrium as mixed strategies used by players. Then, by learning the less-advantageous strategy, individuals are reaching the nearest optimal mixed strategy.
Importantly, we parametrised the notion of strategic advantage of cooperation versus defection with a single quantity denoted by , which captures the tradeoffs between fitness and propensity to make execution errors in these two modes of behaviour. Interestingly, we showed that this quantity has a critical threshold absolute value of 1. Namely, if  then our Corollary 1 implies that learning by many small steps is preferable to learning by fewer large steps. Arguably, this captures the belief that complex skills are best learned incrementally. However, if , then Corollary 1 shows that coexistence is preserved by one of only two possible learning paths: (a) full learning first in the x direction, followed by full learning in the y direction; or (b) the other way around. This suggests that a sufficiently large strategic advantage of cooperation over defection (or the converse) eliminates the luxury of incremental learning.
In addition, the number of steps in the learning path maximising the fitness is not bounded. Indeed, taking many small learning steps improves the fitness we observe. However, as demonstrated in 
Figure 3, there may exist a number of steps 
 after which the increase in the objective function seems insignificant. Hence, we can determine a sufficient number of steps to achieve a target level of the fitness-over-learning function in applications.
Overall, the learning scheme proposed in this paper can be used to correct behavioural uncertainty when the system already reached its equilibrium but was disrupted. However, our formulation has its limitations. The main limitation is that we allow for only one skill to be learnt at a time. However, restrictiveness of this assumption decreases with increasing number of steps. The second limitation of our scheme is that the direction of the learning path can only be chosen at the very beginning and cannot be changed while individuals are learning. While it may be more natural to permit the learning direction to change at each step, it would also require more resources to be spent on learning. Therefore, the cost of learning would need to be taken into account. Such extensions can be studied in future research.