*Article* **A Novel Discrete Differential Evolution with Varying Variables for the Deficiency Number of Mahjong Hand**

**Xueqing Yan <sup>1</sup> and Yongming Li 2,\***


**Abstract:** The deficiency number of one hand, i.e., the number of tiles needed to change in order to win, is an important factor in the game Mahjong, and plays a significant role in the development of artificial intelligence (AI) for Mahjong. However, it is often difficult to compute due to the large amount of possible combinations of tiles. In this paper, a novel discrete differential evolution (DE) algorithm is presented to calculate the deficiency number of the tiles. In detail, to decrease the difficulty of computing the deficiency number, some pretreatment mechanisms are first put forward to convert it into a simple combinatorial optimization problem with varying variables by changing its search space. Subsequently, by means of the superior framework of DE, a novel discrete DE algorithm is specially developed for the simplified problem through devising proper initialization, a mapping solution method, a repairing solution technique, a fitness evaluation approach, and mutation and crossover operations. Finally, several experiments are designed and conducted to evaluate the performance of the proposed algorithm by comparing it with the tree search algorithm and three other kinds of metaheuristic methods on a large number of various test cases. Experimental results indicate that the proposed algorithm is efficient and promising.

**Keywords:** Mahjong; differential evolution; deficiency number; combinatorial optimization

**MSC:** 68T20; 90C27

#### **1. Introduction**

Mahjong is a traditional, Chinese, tile-based game with a long history and is often played by four people [1,2]. In this game, each player is devoted to devising a proper strategy to win as soon as possible, with luck also playing an important role in this process. Due to its imperfect information, Mahjong has become a popular testbed for artificial intelligence (AI) research [3–9]. Nowadays, various variants of Mahjong with different rules for computing paybacks and/or the legality of actions have emerged due to the individual cultures of different regions and countries. However, they all have similar processes and can be easily extended by the basic version of Mahjong. Therefore, the basic version of Mahjong is just considered in the following without loss of generality.

In the basic version of Mahjong, four players are involved, and 108 tiles are used, consisting of 36 tiles of bamboo type, 36 tiles of character type, and 36 tiles of dot type. At the start of the game, each player in turn draws thirteen tiles from the tile wall, and then they each take one tile from the tile wall and discard one tile from their hand. When one player gets a winning hand or there are no tiles left in the tile wall, the game ends. So, all players need to continuously change their hands' tiles in order to ensure that their hands win quickly, and these processes involve a number of evaluations on the quality of various cases of tiles, i.e., calculating the minimum number of tiles needed to be changed to win, which is called the deficiency number [10]. Thereby, computing the deficiency number has an important role in Mahjong, and can promote AI development for the game. Moreover,

**Citation:** Yan, X.; Li, Y. A Novel Discrete Differential Evolution with Varying Variables for the Deficiency Number of Mahjong Hand. *Mathematics* **2023**, *11*, 2135. https:// doi.org/10.3390/math11092135

Academic Editor: Jian Dong

Received: 20 March 2023 Revised: 27 April 2023 Accepted: 30 April 2023 Published: 2 May 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the winning hand closely depends on the combinations of the tiles, and these combinations include a sequence of three or four identical tiles (called a pong or kong), a sequence of three consecutive tiles with the same type (called a chow), and a pair of identical tiles (called a pair or an eye). A pong or a chow is also called a meld, and a pseudomeld (abbr. pmeld) refers to a pair of tiles that can constitute a meld. Therefore, computing the deficiency number is in fact a combinatorial optimization problem, which is often not easily calculated due to its large search space.

As far as we know, there are very few studies on the calculation of the deficiency number at present [10–12]. Specifically, in the paper [10], Li and Yan presented a recursive method and a tree-based method for calculating the deficiency number. In the recursive method, for one hand with 14 tiles, all cases of 14 tiles with *k* deficiency must be known in advance before the deficiency number of one hand is determined to be *k* + 1. In the tree-based method, all pseudo-decompositions of the tiles must be found and evaluated, and then the deficiency number is obtained by the minimum cost of these pseudo-decompositions. Herein, a pseudo-decomposition is a sequence *π* of five subsequences, *π*[0], *π*[1],..., *π*[4], where *π*[*i*] for 0 ≤ *i* ≤ 3 can be a meld, a pmeld, a single tile, or empty, and *π*[4] can be a pair, a single tile, or empty. The cost of each pseudo-decomposition is the number of missing tiles compared to the current hand. Wang et al. [11] constructed a theoretical model of weighted restarting automaton to compute the deficiency number of 14 tiles, where the tiles are combined into melds and eyes during the process of simplification, and the number of changed tiles is counted using its weight function. Recently, Wang et al. [12] further proposed an efficient algorithmic approach, where the tiles in the player's hand are first divided into several groups, such as melds, pseudomelds, and isolated tiles, and then the deficiency number is computed based on the number of pseudomelds and that of the isolated tiles. Although these approaches can calculate the deficiency number of a Mahjong hand, the full searches are all implicit in them, which often take too much computational time and thus cannot meet the actual demand of response time in the game. Therefore, it is necessary to develop a more promising search approach for the deficiency number.

As is well known, because of the lower requirement on the problem to be solved, heuristic intelligent search/optimization methods have been regarded as the most important tools for solving real problems, especially complicated ones. In addition, various metaheuristic algorithms have been presented by simulating nature phenomena, animal behaviors, human activities, or physical criteria, such as the genetic algorithm (GA) [13], differential evolution (DE) [14], particle swarm optimization (PSO) [15], artificial bee colony algorithm [16], water cycle algorithm (WCA) [17], squirrel search algorithm [18], gravitational search algorithm (GSA) [19], teaching–learning-based optimization (TLBO) [20], gaining–sharing knowledge-based algorithm [21], and so on. In detail, more metaheuristic algorithms can be found in [22], and they have been widely researched and successfully applied in many scientific fields and practical problems up to now, including data clustering [23,24], stock portfolios [25], knapsack optimization problems [26], multitask optimization [27], and multimodal optimization [28].

As pointed out in [22], among the existing intelligent approaches, differential evolution (DE) [14] is one of the most popular population-based stochastic optimizers, and has been proven to be more efficient and robust on various problems. Due to its simplicity and simple implementation, DE always attracts more attention from researchers, and numerous DE variants have been put forward to strengthen its performance [29–34] and/or solve special practical problems [35–39]. For example, by dividing a population into multiple swarms and randomly selecting the solutions with better fitness values in each swarm to conduct mutations, Wang et al. [29] developed a novel adaptive DE algorithm. Through making full use of the information of the best neighbor, one randomly selected neighbor, and the current individual for each individual to predict the promising region, Yan and Tian [30] presented an enhanced adaptive DE variant, while, by adaptively combining the benefits of multiple operators, Yi et al. [34] developed a novel approach for continuous optimization problems. Moreover, by designing a Taper-shaped transfer function based

on power functions to transform a real vector representing the individual encoding into a binary vector, He et al. [35] proposed a novel binary differential evolution algorithm for binary optimization problems. Meanwhile, for traveling salesman problems, Ali et al. [38] proposed an improved discrete DE version, where an enhanced mapping mechanism is devised to map continuous variables to discrete ones, a k-means clustering-based repairing method is devised to enhance the solutions in the initial population, and an ensemble of mutation strategies is used to maintain its exploration capability. In particular, a large number of numerical experiments were conducted in their papers, and the numerical results validated their effectiveness and superiority. Thereby, DE has a greater potential in a promising search performance. Following this, we adopt the framework of DE to calculate the deficiency number of the tiles in this paper, so as to obtain a promising performance. Specifically, more detailed research on the improvements and applications of DE can be further referred to in [40].

In this paper, a more efficient method is researched to calculate the deficiency number of a Mahjong hand, and the framework of DE is adopted to achieve this, based on its intrinsic advantages. In detail, in order to reduce the difficulty of computing the deficiency number, some pretreatment mechanisms are first devised to convert the original problem into a simple combinatorial optimization problem, where the dimensions of the search space are degraded to five, and each feasible solution may have different lengths. Noticeably, this makes the dealing problem significantly different to the existing studied discrete and/or combinatorial optimization problems, in which the size of every solution is fixed and consistent. Moreover, for this converted new problem, based on the basic framework of DE, a novel discrete DE (NDDE) algorithm is then specially presented by devising proper initialization, a mapping solution method, a repairing solution technique, a fitness evaluation approach, and mutation and crossover operations, which aim to meet the available search requirement of the discrete space and the characteristic of the varying sizes of different individuals. Then, the proposed NDDE algorithm is capable of computing the deficiency number of one hand efficiently and effectively. Finally, a large number of experiments are designed and conducted to verify the performance of the NDDE algorithm. Specifically, three representative data sets are employed and tested, including 118,800 hands with one type, 100,000 hands with two types, and 100,000 hands with three types, and the NDDE algorithm is compared with the tree search method in [10] and three other kinds of metaheuristic methods. The sensitivity of the parameters involved in the NDDE algorithm is also investigated. The experimental results show that the proposed algorithm has a promising performance for the deficiency number of the hand.

For clarity, compared with the existing works, the main contributions and novelties of this paper are described as follows.


(4) A large number of experiments are designed and conducted to verify the performance of the NDDE algorithm, where three representative data sets are employed and tested, including 118,800 hands with one type, 100,000 hands with two types, and 100,000 hands with three types. Moreover, the NDDE algorithm is compared with the tree search method in [10] and three other kinds of metaheuristic methods, and the sensitivity of the parameters involved in the NDDE algorithm is also investigated. Experimental results indicate that the NDDE algorithm is a more promising technique for the deficiency number of the hand.

Finally, it should also be mentioned that the reason the framework of DE is chosen to design the solver in this paper is solely because much of the research reported in the literature has proven its superiority on various problems, but other metaheuristic algorithms can also be adopted here, which we will further study in our future work.

The reminder of this paper is organized as follows. Some related works and basic notions on Mahjong are presented in Section 2. The proposed algorithm for computing the deficiency number of a Mahjong hand is introduced in Section 3, and the experimental tests are conducted and analyzed in Section 4. Finally, conclusions are drawn in Section 5.

#### **2. Preliminaries**

In this section, the related concepts and research on Mahjong and the classical DE algorithm shall be described.

#### *2.1. Related Concepts on Mahjong*

In this part, some related concepts in Mahjong are introduced to provide a foundation for the below descriptions. For simplicity, the basic version of Mahjong, denoted as *M*0, is considered, which consists of tiles with bamboo type, tiles with character type, and tiles with dot type. Specifically, these tiles with bamboo, character, and dot type are represented as follows:


Moreover, in *M*0, each of the above tiles has four identical tiles; thus, there are 108 tiles in *M*<sup>0</sup> in total. Subsequently, some basic notions on Mahjong shall be provided, which mainly refers to the the literature [10,12].

**Definition 1.** *A* pong *is a sequence of three identical tiles, that is, a pong has the form of XiXiXi for X* ∈ {*B*, *C*, *D*} *and* 1 ≤ *i* ≤ 9*; A* chow *is a sequence of three consecutive tiles with the same type, that is, a chow has the form of XiXi*+1*Xi*+<sup>2</sup> *for X* ∈ {*B*, *C*, *D*} *and* 1 ≤ *i* ≤ 7*; An* eye (*or* pair) *is a pair of identical tiles, that is, an eye has the form of XiXi for X* ∈ {*B*, *C*, *D*} *and* 1 ≤ *i* ≤ 9*. A* meld *is either a pong or a chow.*

**Definition 2.** *A* pseudochow (pchow) *is a pair of tiles XiXj with the same type, having* 1 ≤ |*j* − *i*| ≤ 2 *and X* ∈ {*B*, *C*, *D*}*, which can become a chow if we add an appropriate tile with the same type in it. A* pseudomeld (pmeld) *is a pchow or a pair. We say a tile c* completes *a pmeld* (*ab*) *if* (*abc*) (*after reordering*) *is a meld. Similarly, a pair is completed from a single tile t if it is obtained by adding an identical tile to t.*

For example, (*B*1*B*1*B*1) is a pong, (*C*1*C*2*C*3) is a chow, (*D*1*D*1) is a pair, and (*B*2*B*3) and (*C*3*C*5) are two pchows.

**Definition 3.** *A* hand *is a set of 14 tiles, denoted by H, i.e., a sequence of 14 tiles from M*0*, where every tile can not appear more than four times.*

**Definition 4.** *A hand H from M*<sup>0</sup> *is* winning *or* complete *if it can be decomposed into four melds and one pair. (For ease of presentation, we do not regard a hand with seven pairs as complete.) Given* *a complete hand H, a* decomposition *π of H is a sequence of five subsequences of H, where π*[*i*] *is a meld for* 0 ≤ *i* ≤ 3 *and π*[4] *is a pair. If this is the case, we call π*[4] *the* eye *of this decomposition.*

For example, the hand *H*<sup>1</sup> = (*B*1*B*2*B*3*B*3*B*3*B*8*B*8*B*8)(*C*4*C*5*C*6)(*D*6*D*7*D*8) is a winning or complete hand, and its decomposition is *π* = (*B*1*B*2*B*3)(*B*8*B*8*B*8)(*C*4*C*5*C*6)(*D*6*D*7*D*8)(*B*3*B*3).

As is well known, the hands involved in a Mahjong game are mostly incomplete. That is, there is no decomposition defined above for these hands. So, corresponding to the decomposition of a winning hand, another concept of predecomposition is further given here for the hands.

**Definition 5.** *Given a hand H, the predecomposition (abbr.* pDCMP*) of H is a sequence π of five subsequences, π*(1)*,..., π*(5)*, of H such that each π*(*i*) (1 ≤ *i* ≤ 5) *is a meld, pair, pchow, or empty.*

Noticeably, unlike the concept of pseudo-decomposition in [10], a single tile is not contained in the predecomposition and the position of the eye is no longer fixed. For example, with respect to the above hand *H*1, the sequence *π*<sup>1</sup> = (*B*1*B*2)(*B*3*B*3*B*3)(*B*8*B*8)(*C*4*C*5*C*6)(*D*6*D*7*D*8) is one of its pDCMPs.

**Definition 6.** *Suppose π and π are two* pDCMP*s for one hand H. We say π is* finer *than π if π*(*i*) *is identical to or a subsequence of π* (*i*) *for* 1 ≤ *i* ≤ 5*. A* pDCMP *π is* completable *if there exists a decomposition π*∗ *for H that is finer than π. If this is the case, we call π*∗ *a* completion *of π. Moreover, the cost of a completable* pDCMP *π, written cost*(*π*)*, is the number of missing tiles required to complete π into a decomposition, which consists of four melds and one eye.*

In particular, for one pDCMP *π* of one hand *H*, we say it has infinite cost if it is incompletable. Moreover, for the above pDCMP *π*<sup>1</sup> of *H*1, its cost is 1 since it can be completed by one tile, *B*3.

**Definition 7.** *For one hand H, the minimal number of necessary tile changes for making T a winning is called the* deficiency number (or simply *deficiency*) of *H*. If the deficiency of *H* is -, we write *df ncy*(*H*) = -. Obviously, for a winning hand *H*, it holds that *df ncy*(*H*) = 0.

Based on the above notions, we can see that for one hand, *H*, its deficiency number can be obtained by finding all its possible predecompositions and then comparing their differences with the ideal decompositions.

#### *2.2. Research on Mahjong Game*

With the development of artificial intelligence (AI) techniques, games are continuously regarded as one of the most important test platforms. In particular, for games with perfect information, such as checkers [41], chess [42] and Go [43,44], AI has now even been able to beat the best human players. In contrast, for imperfect information games [45–49], there are still few works since players have to deal with some invisible information during the game, especially for Mahjong.

As stated in the Introduction, there have been various variants of Mahjong due to the unique cultures of each region and country. Among them, Riichi Mahjong is a popular version played in Japan, and most of the current research on Mahjong is based on it [5,8,9,49–52]. Specifically, Li et al. [51] proposed a Mahjong AI system, suphx, based on reinforcement learning, and the test results on the "Tenhou" platform showed its effectiveness. Kurita [5] abstracted the Mahjong process by defining multiple Markov decision processes, and then constructed an effective search tree for optimal decisionmaking. Mizukami and Tsuruoka [49] built a strong Mahjong AI by modeling opponent players and performing Monte Carlo simulations. Yoshimura et al. [50] proposed a tabu search method of optimal movements without using game records, while Sato et al. [52] presented a new method to classify the opponent players' strategy by analyzing Mahjong

playing records. Additionally, for the version of bloody Mahjong, Gao and Li [8] developed a fusion model by using the deep learning and XGBoost algorithm to extract the Mahjong situation features and derive the card strategy, respectively. In particular, a more detailed review about the existing works on Mahjong AI can be further found in ref. [9], where the advantages and disadvantages of each method are analyzed.

Unlike the above works that aim to develop Mahjong AI players, there are still few studies to evaluate the quality of a Mahjong hand, which is more helpful for a player to devise an appropriate strategy during the game [10–12]. In [10], Li and Yan first introduced the notation of the deficiency number for measuring the quality of a hand, and developed two different calculation methods for it, namely, the recursive method and the tree-based method. After this, Wang et al. [11] presented a theoretical model of weighted restarting automaton for computing the deficiency number of 14 tiles. In the developed model, the tiles are combined into melds and eyes in the process of simplification, and the number of changed tiles is counted by its weight function. Moreover, by dividing the tiles into some groups, such as melds, pseudomelds, and isolated tiles, in advance and fully considering the relation between their numbers, Wang et al. [12] recently further proposed an efficient algorithm to calculate the deficiency number of 14 tiles. Even though these above methods have made some progress in measuring the quality of a hand, they all contains the idea of full searches, which might make their computational time too long to timely satisfy the actual demand of response time during a game. Therefore, it is necessary to devise a more promising technique for calculating the deficiency number.

#### *2.3. Traditional DE Algorithm*

For the convenience of later descriptions, the detailed operations of a classical DE is drawn as follows, including initialization, mutation, crossover, and selection [14]. Specifically, the minimization problem min{ *f*(*x*)|*xmin*,*<sup>j</sup>* ≤ *xj* ≤ *xmax*,*j*, *j* = 1, 2, ... , *D*} is considered here, where *x* = (*x*1, *x*2, ... , *xD*) is the solution vector, *D* is the dimension of the search space, and *xmin*,*<sup>j</sup>* and *xmax*,*<sup>j</sup>* are the lower and upper bounds of the *j*-th component of the search space, respectively.

First, one population, *P*0, consisting of *NP* solutions is randomly created in the search space. Each solution is denoted by *x*<sup>0</sup> *<sup>i</sup>* = (*x*<sup>0</sup> *<sup>i</sup>*,1, *<sup>x</sup>*<sup>0</sup> *<sup>i</sup>*,2, ... , *<sup>x</sup>*<sup>0</sup> *<sup>i</sup>*,*D*) with *i* = 1, 2, ... , *NP*, and *NP* is the size of population. Concretely, the *j*-th component of *x*<sup>0</sup> *<sup>i</sup>* is generated by

$$
\boldsymbol{\alpha}\_{i,j}^{0} = \boldsymbol{\pi}\_{\min,j} + rand \cdot (\boldsymbol{\pi}\_{\max,j} - \boldsymbol{\pi}\_{\min,j}) , \tag{1}
$$

where *rand* is a random number uniformly distributed in [0, 1].

After initialization of the population, for each solution *x g <sup>i</sup>* at *g* generation, the mutation operation is executed to create its mutation individual *v g <sup>i</sup>* . The detailed process of operator 'DE/rand/1' can be provided as

$$
\vec{\nu}\_i^{\mathbb{g}} = \vec{x}\_{r1}^{\mathbb{g}} + F \cdot (\vec{x}\_{r2}^{\mathbb{g}} - \vec{x}\_{r3}^{\mathbb{g}}),
\tag{2}
$$

where *r*1, *r*2, and *r*<sup>3</sup> are three random integers in [1, *NP*] and have *r*1 = *r*2 = *r*3 = *i*, and *F* is a scaling factor.

Then, by combining the components of *v g <sup>i</sup>* and its corresponding target individual *x g i* , one trial individual *u<sup>g</sup> <sup>i</sup>* = (*u<sup>g</sup> <sup>i</sup>*,1, *<sup>u</sup><sup>g</sup> <sup>i</sup>*,2, ... , *<sup>u</sup><sup>g</sup> <sup>i</sup>*,*D*) is created with the crossover operation. The rule of the binomial crossover method is just described here as:

$$u\_{i,j}^{\mathbb{S}} = \begin{cases} \begin{array}{c} v\_{i,j'}^{\mathbb{S}} & \text{if } rand \le Cr \text{ or } j = randn(i),\\ x\_{i,j'}^{\mathbb{S}} & \text{otherwise} \end{array} \end{cases} \tag{3}$$

where *u<sup>g</sup> i*,*j* , *v g i*,*j* , and *x g <sup>i</sup>*,*<sup>j</sup>* are, respectively, the *<sup>j</sup>*-th components of *u<sup>g</sup> <sup>i</sup>* , *v g <sup>i</sup>* , and *x g <sup>i</sup>* , *Cr* ∈ [0, 1] is the crossover rate, and *randn*(*i*) returns a random integer within [1, *D*], which ensures that *ug <sup>i</sup>* obtains at least one component from *v g i* .

Finally, the selection operation is conducted to update the current population by comparing each target solution *x g <sup>i</sup>* with its trial vector *u<sup>g</sup> <sup>i</sup>* based on their fitness values. In particular, the greedy selection strategy can be expressed as follows.

$$
\vec{x}\_{i}^{\mathbb{g}+1} = \begin{cases}
\ \vec{u}\_{i}^{\mathbb{g}} \ \vdots \ \text{if } f(\vec{u}\_{i}^{\mathbb{g}}) \le f(\vec{x}\_{i}^{\mathbb{g}}) \\
\ \vec{x}\_{i}^{\mathbb{g}} \ \text{otherwise.}
\end{cases}
\tag{4}
$$

where *f*(·) is the objective function to be optimized.

Note that DE with (4) will either get better or remain at the same fitness status, but never deteriorate. Meanwhile, when DE is activated for one optimization problem, the mutation, crossover, and selection operations will in turn be executed until one satisfying solution is found or the prescribed termination criterion is met.

#### **3. Proposed Algorithm**

As described and discussed in the Introduction, the deficiency number evaluates the quality of a Mahjong hand and is helpful for a player to devise an appropriate strategy, which can facilitate the development of Mahjong AI. However, due to the fact that the tiles in one hand always have a large number of possible combinations, it usually cannot be easily calculated. To address this issue, several approaches have been presented, but they are very limited and always require too much computational time. Therefore, inspired by the advantages of DE, such as simplicity, easy implementation, strong robustness, and superior performance, we propose a novel discrete DE variant (NDDE) to compute the deficiency number of a Mahjong hand in this section. Specifically, the problem of computing the deficiency number is first converted to a simpler combinatorial optimization problem, and a new DE variant is presented for it by devising proper initialization, a mapping solution method, a repairing solution technique, a fitness evaluation approach, and mutation and crossover operations.

#### *3.1. Simplified Problem of Deficiency Number*

As stated in [10], the problem of computing the deficiency number is in fact an optimization problem and can be regarded as a combinatorial optimization problem to solve, i.e., finding one proper combination of the tiles that has minimal differences between it and its corresponding ideal winning hand. Recently, several approaches have been proposed to calculate the deficiency number of the tiles based on this [10–12]. However, there are often too many combinations in the search space, which might degrade the efficiency of these methods. To alleviate this demerit, we propose converting the problem of computing the deficiency number into a simpler one, where the dimension of the new search space is reduced to five. The details of the concrete processes are described in the following.

For one Mahjong hand *H*, we first search all its possible melds and pmelds and denote them by *S*. The reason for this is that the decomposition of a winning hand is constituted by four melds and one eye, and it can be completed by the predecomposition, which consists of melds and/or pmelds. Then, the problem of computing the deficiency number can be alternatively described as:

$$\min\_{\pi \in \Pi} \mathcal{g}(\pi) \tag{5}$$

where *π* = (*π*(1), *π*(2), *π*(3), *π*(4), *π*(5)) is the predecomposition of *H* with *π*(*i*) (1 ≤ *i* ≤ 5) being a meld, pair, or pchow from *S* or empty, Π is the set of all predecompositions of *H*, and *g*(*π*) denotes the differences between *π* and its ideal decomposition, i.e., having *g*(*π*) = *cost*(*π*), which will be further discussed in the next subsection.

According to the definition of Equation (5), all possible melds and pmelds of one Mahjong hand *H* must be found and contained in *S*, so as to ensure the completeness of all its predecompositions. Moreover, since the meld has the modes of chow and pong, the pmeld has the modes of pair and pchow, and the pchow always has two different modes, such as (*D*6*D*7) and (*D*6*D*8); then, *S* can be formed by considering each case above for

each tile. Note that, to form a winning hand, the pchow must be able to constitute a chow, and thus, the number of tiles used to complete the pchow should be less than four in *H*. Thereby, for a Mahjong hand *H*, its related set *S* can be generated as follows. First, we find the distinct tiles of *H* and denote them by *St*, while initializing the set *S* to be empty. Then, for each tile *Xi* ∈ *St*, the following five cases are successively checked to create all possible melds and pmelds, which will be added into *S*.


Particularly, Cases 1, 2, and 4 can provide all the possible pmelds, and Cases 3 and 5 can give all the possible melds involved in *H*.

In summary, from the above descriptions, all the possible melds and pmelds can be obtained for each Mahjong hand *H*, and then every predecomposition involved in it can be created by using *S*. Thus, the predecompositions obtained by the tree-based search algorithm will be contained in the search space Π of the new problem. Meanwhile, since the tree-based search algorithm is a full search approach, the legal solution for the new problem can also be obtained by it. So, the proposed operations cannot affect the deficiency number of one Mahjong hand, and the converted problem and its original one are equivalent. Moreover, it is easy to find that the dimension of this new problem is just five, and its search space is decided by the length of *S*. Additionally, the generation process of *S* is relatively cheap, and its size of *S* is often smaller; therefore, the new problem has a smaller search space and is easier to solve.

#### *3.2. Proposed NDDE Algorithm*

In this subsection, the special components of the NDDE algorithm shall be introduced, including the initialization, mapping solution method, repairing solution technique, fitness evaluation approach, and mutation and crossover operations.

#### 3.2.1. Initialization of Population and Mapping Solution

According to the framework of DE, when DE is activated, an initial population with *NP* solutions will first be created. In this paper, since the objective function is a discrete one, the following special method is used to initialize the population *P*0.

For convenience, we let *NS* denote the size of the set *S*, each solution *x*<sup>0</sup> *<sup>i</sup>* denote a predecomposition *π*, and then have *x*<sup>0</sup> *<sup>i</sup>* = (*x*<sup>0</sup> *<sup>i</sup>*,1, *<sup>x</sup>*<sup>0</sup> *<sup>i</sup>*,2, ... , *<sup>x</sup>*<sup>0</sup> *<sup>i</sup>*,5) with *i* = 1, 2, ... , *NP*. Specifically, for each element *x*<sup>0</sup> *<sup>i</sup>*,*<sup>j</sup>* of *x*<sup>0</sup> *<sup>i</sup>* , *j* = 1, 2, . . . , 5, it will be created by

$$x\_{i,j}^0 = randint(N\_{\mathbb{S}})\_\prime \tag{6}$$

where *randint*(*A*) returns a random integer uniformly distributed in [1, *A*].

Moreover, from Equation (6), one can easily find that each solution in *P*<sup>0</sup> is only represented by some integer values, and they are not compatible with the new problem described in Equation (5). Thus, a mapping method is further provided here for matching the solutions of the population with those of the new problem. In particular, for each solution *x*<sup>0</sup> *<sup>i</sup>* in *<sup>P</sup>*0, its corresponding solution *<sup>π</sup>*<sup>0</sup> *<sup>i</sup>* for the new problem can be obtained by

$$
\pi\_i^0 = (\pi\_i^0(1), \pi\_i^0(2), \pi\_i^0(3), \pi\_i^0(4), \pi\_i^0(5)),
\tag{7}
$$

where *π*<sup>0</sup> *<sup>i</sup>* (*j*) is the *<sup>x</sup>*<sup>0</sup> *i*,*j* -th element of *S* for *j* = 1, 2, . . . , 5, which may be a meld or pmeld.

From the above descriptions, the solutions matching the DE algorithm and the new problem can all be obtained. Noticeably, the operations of initializing the population and mapping it are both simple and easy to implement. Thus, these operations will not add severe computational burdens.

#### 3.2.2. Repairing Solution Technique

As described above, through the proposed mapping method, each individual in a population will generate a corresponding solution for the new problem. However, there is still a shortcoming in it, that is, the mapped solutions cannot be guaranteed to be feasible. In fact, for one Mahjong hand *H* and a mapped solution *πi*, the number of some tiles in *π<sup>i</sup>* may exceed that in *H*. Thereby, a repairing technique is also necessary to correct these infeasible solutions.

For the sake of saving the computational cost of the algorithm, the following simple repairing approach is used in this paper. For one Mahjong hand *H* and a mapped solution *πi*, all distinct tiles of *π<sup>i</sup>* are first found and recorded in one set, denoted by *Sp*. Then, for each tile *Xi* in *Sp*, we separately count its numbers in *π<sup>i</sup>* and *H*, and if the number in *π<sup>i</sup>* is larger than that in *H*, we gradually remove one meld or pmeld containing *Xi* from *π<sup>i</sup>* until its number in *π<sup>i</sup>* is less than or equal to that in *H*. At the same time, when one meld or pmeld is removed in *πi*, its corresponding element in *x g <sup>i</sup>* will also be set to empty. Clearly, by this repairing method, each mapped solution will be changed to be feasible, and this correcting process is very easy to achieve.

#### 3.2.3. Fitness Evaluation Approach

In order to determine which solutions will be selected in the next iteration, their fitness values should be evaluated as well. At each generation *g*, the fitness value of each individual *x g <sup>i</sup>* in population *<sup>P</sup><sup>g</sup>* is evaluated by the cost of its corresponding mapped solution *π<sup>g</sup> <sup>i</sup>* for the new problem. That is, we let *f*(*x g <sup>i</sup>* ) = *<sup>g</sup>*(*π<sup>g</sup> <sup>i</sup>* ) in this paper.

Concretely, for a mapped solution *πg*, according to *g*(*πg*) = *cost*(*πg*) and the definition of *cost*(*πg*), i.e., the number of missing tiles required to complete *π<sup>g</sup>* into a decomposition, the fitness value of *x g <sup>i</sup>* can be calculated by

$$f(\vec{x}\_i^\xi) = \begin{cases} 4 - m\_\prime & \text{if } m + n = 5 \text{ and } p = 1, \\ 5 - m\_\prime & \text{if } m + n = 5 \text{ and } p = 0, \\ 9 - 2 \ast m - n\_\prime & \text{if } m + n < 5. \end{cases} \tag{8}$$

Herein, *m* and *n* are the numbers of meld and pmeld in *π<sup>g</sup> <sup>i</sup>* , respectively, and *p* is the index of whether there is a pair in *π<sup>g</sup> <sup>i</sup>* . If there is a pair in *<sup>π</sup><sup>g</sup> <sup>i</sup>* , then we have *p* = 1; otherwise, *p* = 0.

Moreover, it should be mentioned that there are still two other special cases in the evaluation of the solution. For one predecomposition *πg*, the first case is that it has *m* + *n* = 5 and contains two identical pairs. In this case, one of the two pairs in *π<sup>g</sup>* should have formed a meld, but the number of tiles in *π<sup>g</sup>* will have increased to four. Thus, the actual cost should be added by one due to the invalid pmeld. Another case is that it has *m* = 4, *n* = 0, and the remaining two different tiles will have formed a pong in *πg*. In this case, we further need to form a pair to complete *πg*, but the remaining tiles cannot form a pair. Therefore, we need to change both of these two tiles to form a pair and the actual cost of *π<sup>g</sup>* should be two.

For example, considering one Mahjong hand *H* = (*B*1*B*2*B*3*B*3*B*3*B*3*B*4*B*8)(*C*4*C*5*C*9) (*D*6*D*7*D*8) and its three predecompositions *π*<sup>1</sup> = (*B*1*B*2*B*3)(*B*3*B*4)(*B*3*B*3)(*C*4*C*5)(*D*6*D*7*D*8), *π*<sup>2</sup> = (*B*1*B*2*B*3)(*B*3*B*3*B*3)(*C*4*C*5)(*D*6*D*7*D*8), and *π*<sup>3</sup> = (*B*2*B*3*B*4)(*B*1*B*3)(*B*3*B*3)(*C*4*C*5)(*D*6*D*7), by using the above evaluation approach, we can find that for *π*1, it has *m* = 2, *n* = 3, and *p* = 1; thus, *g*(*π*1) = 2. While for *π*2, it has *m* = 3, *n* = 1, and *p* = 0; thus, *g*(*π*2) = 2. For *π*3, it has *m* = 1, *n* = 4, and *p* = 1; thus, *g*(*π*2) = 3.

In conclusion, from the above descriptions, the proposed fitness evaluation method can accurately assess the quality of each solution in the population.

#### 3.2.4. Mutation and Crossover Operators

To fully search the decision space and ensure the computational efficiency of the algorithm, the mutation operator "DE/rand/1" and the binomial crossover operation are enhanced and employed to generate the mutant and trial individual, respectively, in the proposed NDDE algorithm.

Especially considering the fact that the search space involved in the NDDE algorithm is discrete and the lengths of different solutions may be various, a discrete version of "DE/rand/1" is devised and employed in this paper. In detail, for each individual *x g <sup>i</sup>* at *g* generation, its mutant individual *v g <sup>i</sup>* = (*v g <sup>i</sup>*,1, *v g <sup>i</sup>*,2,..., *v g <sup>i</sup>*,5) can be generated by

$$\boldsymbol{\upsilon}\_{i,j}^{\mathcal{S}} = \begin{cases} \textit{round}(\boldsymbol{\mathfrak{x}}\_{r1,j}^{\mathcal{S}} + \boldsymbol{F} \ast (\boldsymbol{\mathfrak{x}}\_{r2,j}^{\mathcal{S}} - \boldsymbol{\mathfrak{x}}\_{r3,j}^{\mathcal{S}})), & \text{if } \boldsymbol{\mathfrak{x}}\_{r1,j}^{\mathcal{S}} \neq \boldsymbol{\mathcal{O}} \text{ and } \boldsymbol{\mathfrak{x}}\_{r2,j}^{\mathcal{S}} \neq \boldsymbol{\mathcal{O}} \text{ and } \boldsymbol{\mathfrak{x}}\_{r3,j}^{\mathcal{S}} \neq \boldsymbol{\mathcal{O}},\\ \textit{randint}(\textit{N}\_{\boldsymbol{\mathcal{S}}}), & \text{otherwise} \end{cases} \tag{9}$$

where *j* = 1, 2, ... , 5, *r*1, *r*2, and *r*<sup>3</sup> are three random integers in [1, *NP*] with *r*<sup>1</sup> = *r*<sup>2</sup> = *r*<sup>3</sup> = *i*, *F* is a scaling factor, *round*(*A*) denotes the nearest integer around A. In particular, we set *F* = 0.5 in this paper.

Similarly, based on the property of varying variables of individuals and to make full use of the information of the target individual, the following modified binomial crossover operation is developed and used to generate the trial individuals. Specifically, for each individual *x g <sup>i</sup>* and its mutant individual *v g <sup>i</sup>* , the corresponding trial individual *ug <sup>i</sup>* = (*u<sup>g</sup> <sup>i</sup>*,1, *<sup>u</sup><sup>g</sup> <sup>i</sup>*,2,..., *<sup>u</sup><sup>g</sup> <sup>i</sup>*,5) is obtained by

$$u\_{i,j}^{\mathcal{S}} = \begin{cases} \begin{array}{ll} \mathfrak{x}\_{i,j}^{\mathcal{S}} & \text{if } \mathfrak{x}\_{i,j}^{\mathcal{S}} \neq \mathcal{Q} \text{ and } rand \leq Cr, \\\ v\_{i,j}^{\mathcal{S}} & \text{otherwise} \end{array} \end{cases} \tag{10}$$

where *Cr* ∈ (0, 1) is the crossover rate, and especially, we set *Cr* = 0.5 here.

As described above, the proposed mutation and crossover operator can broadly search the search space, fully utilize the acquired information, and have a simple implementing process. Consequently, they are capable of boosting the search ability of the algorithm for finding the deficiency number of one Mahjong hand.

Finally, after generating the trial individual for each target one, the population will be updated by comparing them based on their fitness values, and the best one among them will enter the new population. To achieve this process, the greedy selection strategy (see Equation (4)) is utilized in this paper. Overall, by integrating the proposed initialization, mapping solution method, repairing solution technique, fitness evaluation approach, and mutation and crossover operations, the framework of the proposed NDDE algorithm is shown in Algorithm 1. To improve the understanding of this paper, a flowchart of the proposed approach for computing the deficiency number is provided in Figure 1, where *G* and *Gmax* denote the current number and maximum number of iterations, respectively.

It should be pointed out that, unlike the existing approaches for obtaining the deficiency number in [10–12], the proposed method simplifies the original problem of calculating the deficiency number and makes full use of the benefits of DE to improve the computational efficiency. Specifically, by previously finding all possible melds, pmelds, and pairs contained in one hand and constructing the form of the solution based on the structure of the decomposition, the problem of calculating the deficiency number is converted to a more simple combinatorial optimization problem. Meanwhile, according to the properties of discrete and varying variables of the new problem, a proper initialization, mapping solution method, repairing solution technique, fitness evaluation approach, and mutation and crossover operations are separately developed, and then a novel discrete DE algorithm is proposed. Thereby, the proposed method can effectively and efficiently compute the deficiency number of one Mahjong hand. It should also be mentioned that, for the hands

with a larger deficiency number, the proposed approach might be less efficient than the tree-based deterministic algorithms. This is because the ones with a larger deficiency number always just contain a few melds, pmelds, and/or pairs, which might lead to few child nodes for each search step in the deterministic methods and then reduce the search cost, while the stochastic algorithms need to conduct the predefined searches for each hand at all times. Moreover, compared to the general discrete and/or combinatorial optimization problems, the simplified problem involved in this paper has a special characteristic that its feasible solutions may have various lengths. So, the previous discrete DE versions are not able to directly solve this problem. Meanwhile, for solving the simplified problem, other metaheuristic algorithms can be alternatively adopted instead of DE by designing some proper operations, which we will further study in our future work.

**Figure 1.** The flowchart of the proposed method in this paper for computing deficiency number.

#### **Algorithm 1** (The framework of the proposed NDDE algorithm).


#### **4. Experimental Analyses**

In this part, the performance of the proposed NDDE algorithm is evaluated by conducting a series of experiments on a large number of various, randomly generated test hands, including 118,800 hands with one type, 100,000 hands with two types, and 100,000 hands with three types. Meanwhile, the influence analyses of the parameters involved in the NDDE algorithm are investigated in terms of both search accuracy and running time, and the tree search algorithm (TSA) in [10] and three other metaheuristic algorithms, including PSO [15], GA [13], and TLBO [20], are also compared with the NDDE algorithm. Finally, the effectiveness of the NDDE algorithm is further demonstrated in a large number of Mahjong game battles.

In these experiments, the performance of each algorithm is measured by the accuracy rate and average running time. Moreover, in order to make fair comparisons and obtain statistical conclusions, each algorithm is run 30 times independently for each hand, and three widely used statistical tests, including the *t* test [53], Wilcoxon rank sum test [54], and Friedman test [55], are further adopted to distinguish the differences between the NDDE algorithm and each compared method. All algorithms are all implemented in Python 3.0 on a personal laptop with Intel i7-6700 CPU and 16 GB RAM.

#### *4.1. Influence Analyses of NP and Gmax*

Herein, the influences of *NP* and *Gmax* on the performance of the NDDE algorithm are analyzed. As stated in Algorithm 1, *NP* determines the number of solutions created at each iteration, while *Gmax* decides the total iterations of the NDDE algorithm. Thus, various values for them might cause different performances of the NDDE algorithm. Specifically, in order to show the influence of each parameter, the full factorial design (FFD) [56] is used here, and *NP* and *Gmax* are set to four and six different values, respectively, i.e., *NP* ∈ {10, 20, 30, 50} and *Gmax* ∈ {10, 30, 50, 100, 200, 500}. Tables 1 and 2 list the accuracy rate and average running time, respectively, of the NDDE algorithm on three kinds of test hands with one run. Note that when the actual deficiency number for a hand is found, the proposed NDDE algorithm will be terminated, and the corresponding running time is used just to measure its performance.


**Table 1.** Accuracy rates of NDDE algorithm with various *NP* and *Gmax* on three kinds of test hands.

From Table 1, it can be seen that the accuracy rate of the NDDE algorithm is closely related to the values of *NP* and *Gmax*, and gradually improves with their increase in all cases. Specifically, for the hands with one type, the accuracy rate of the NDDE algorithm exceeds 90% when *Gmax* = 30 and *NP* = 30 and 50, *Gmax* = 50 and *NP* = 20, 30, and 50, and every value for *NP* with *Gmax* = 100, 200, and 500. The accuracy rate of the NDDE algorithm is 100% when *Gmax* = 500 and *NP* = 50. Moreover, for the hands with two types, the accuracy rate of the NDDE algorithm exceeds 90% except when *Gmax* = 10 and *NP* = 10, 20, and 30, while it reaches 100% when *Gmax* = 200, *NP* = 50 and *Gmax* = 500, *NP* = 20, 30, and 50. Moreover, for the hands with three types, the accuracy rate of the NDDE algorithm exceeds 90% except when *Gmax* = 10 and *NP* = 10, while it reaches 100% when *Gmax* = 200 and *NP* = 20, 30, and 50, and every value for *NP* with *Gmax* = 500. In addition, from Table 2, one can further find that the average running time of the NDDE algorithm is also dependent on the values of *NP* and *Gmax*, and it always gradually increases with their increase in all cases. Specifically, when the types of the hands increase, the average running time of the NDDE algorithm on them gradually decreases. The reason for this might be that as the types of the hands increase, the corresponding search space will be reduced. Hence, the proposed NDDE algorithm can always obtain the actual deficiency number for all hands with certain search costs.

**Table 2.** Average running time of NDDE algorithm with various *NP* and *Gmax* on three kinds of test hands.


For the sake of clarity, Figures 2 and 3 further depict the accuracy rate and average running time of the NDDE algorithm on all kinds of hands. From Figures 2 and 3, it can easily be seen that whenever either *NP* or *Gmax* increase, the accuracy rate of the NDDE algorithm improves for each type of hand. Meanwhile, with the increase in *NP*, the average running time of the NDDE algorithm increases in each case, while the NDDE algorithm has a minimal average running time on the hands with three types. Thus, it is essential to set suitable *NP* and *Gmax* in the NDDE algorithm, and we let *NP* = 20 and *Gmax* = 50 in the following experiments due to its promising performance in terms of both accuracy rate and average running time.

**Figure 2.** Accuracy rates of NDDE algorithm with various *NP* and *Gmax* on three kinds of test hands. (**a**) Hands with one type, (**b**) hands with two types, and (**c**) hands with three types.

**Figure 3.** Average running time of NDDE algorithm with various *NP* and *Gmax* = 50 on three kinds of test hands.

#### *4.2. Comparisons and Discussions*

In this subsection, in order to verify the performance of the NDDE algorithm, one typical deterministic method, namely TSA [10], and three other famous metaheuristic algorithms, including PSO [15], GA [13], and TLBO [20], are compared with it on all the above cases of hands. Specifically, to persuasively estimate the performance of these methods, each approach is run 30 times independently for each hand, and the average accuracy rate and running time of 30 runs on each kind of hand are employed to measure its performance. Moreover, *t* tests [53], Wilcoxon rank sum tests [54], and Friedman tests [55] are also utilized to show the differences between their performances.

It should be mentioned that PSO [15] is a famous swarm intelligent optimization algorithm, GA [13] is a typical approach belonging to evolutionary computation, and TLBO [20] is a promising metaheuristic method inspired by human activities. Meanwhile, the proposed NDDE algorithm is developed based on just the basic framework of DE. So, these methods are very representative and suitable and thus chosen as the compared ones here.

#### 4.2.1. Comparisons of NDDE Algorithm with TSA

First, one typical deterministic method, namely TSA [10], is compared with the NDDE algorithm on all three kinds of hands. To clearly demonstrate the performance of the NDDE algorithm, its two versions, named NDDE1 and NDDE2, where *NP* and *Gmax* are set to 20 and 50 and 50 and 500, respectively, are simultaneously employed here to compare with TSA. Tables 3 and 4 provide their average and statistical results of 30 runs on each kind of hand in terms of the accuracy rate and running time, respectively. Herein, *pt*-value and *pw*-value denote the p-values of the *t* test [53] and Wilcoxon rank sum test [54], respectively (the same below).

From Tables 3 and 4, it can be seen that NDDE1 has the worst results among them in all cases, and NDDE2 and TSA each obtain the actual deficiency number for all hands. Meanwhile, with respect to the average running time, TSA takes the longest time in each case, and NDDE1 takes less time than NDDE2. Moreover, according to the results of the *t* test and Wilcoxon rank sum test reported in both Tables 3 and 4, NDDE1 has significant differences compared with TSA, and NDDE1 and NDDE2 are both significantly faster than TSA in all cases. Thus, the proposed NDDE is more effective and efficient than TSA for the deficiency number of one Mahjong hand.


**Table 3.** The average and statistical results of TSA, NDDE1, and NDDE2 on three kinds of hands in terms of accuracy rate.

**Table 4.** The average and statistical results of TSA, NDDE1, and NDDE2 on three kinds of hands in terms of running time.


4.2.2. Comparisons of NDDE Algorithm with Three Other Famous Metaheuristic Algorithms

To further demonstrate the benefit of the NDDE algorithm, three other famous stochastic intelligent algorithms, including PSO [15], GA [13], and TLBO [20], are also compared with it in all cases of hands above. Specifically, in these chosen compared methods, the same mapping, repairing, and evaluation methods as in the NDDE algorithm are adopted, and the size of the population and the maximum number of iterations are also set to 20 and 50, which is consistent with the setting of the NDDE algorithm. Moreover, to show the differences between these compared methods and the NDDE algorithm, the three statistical tests above are further adopted to give statistical conclusions. Tables 5 and 6 list the average and statistical results of 30 runs on each kind of hand in terms of the accuracy rate and running time, respectively, and Table 7 reports their final comparison results based on the Friedman test [55].

As seen from Tables 5 and 6, the NDDE algorithm has a better accuracy rate than PSO, GA, and TLBO in all cases, and there are significant differences between them and the NDDE algorithm according to both the *t* test and Wilcoxon rank sum test. Meanwhile, in terms of running time, the NDDE algorithm also has the least time on all kinds of hands, and significantly performs best based on the statistical results. The reason for this might be because GA needs to calculate the selection probability for each individual at each generation, PSO always needs to record and update the personal best individual for each solution, and TLBO has to additionally compute the mean point of the whole population and compare the two target individuals based on their performances to determine the search direction. So, the NDDE algorithm has a more efficient search procedure than the others. Moreover, from Table 7, according to the Friedman test, the NDDE algorithm has the top performance among them on all three kinds of hands in terms of both the accuracy rate and running time. Thereby, the NDDE algorithm is the most promising solver for computing the deficiency number.


**Table 5.** The average and statistical results of NDDE algorithm, PSO, GA, and TLBO on three kinds of hands in terms of accuracy rate.

**Table 6.** The average and statistical results of NDDE algorithm, PSO, GA, and TLBO on three kinds of hands in terms of running time.


**Table 7.** The final comparison results of NDDE algorithm, PSO, GA, and TLBO on all kinds of hands according to Friedman test.


Furthermore, in order to clearly illustrate the performance of the NDDE algorithm, the convergence curves of the NDDE algorithm, PSO, GA, and TLBO are also depicted here on six different hands, including *H*<sup>1</sup> = (*B*4*B*4*B*6*B*6*B*6*B*7*B*7*B*7*B*7*B*8*B*9*B*9*B*9*B*9), *H*<sup>2</sup> = (*B*3*B*5*B*5*B*5*B*5*B*6*B*6*B*6*B*7*B*7*B*8*B*8*B*9*B*9), *H*<sup>3</sup> = (*B*1*B*2*B*4*B*5*B*5)(*C*2*C*3*C*3*C*3*C*4*C*4*C*5*C*7*C*7), *H*<sup>4</sup> = (*B*4)(*C*1*C*1*C*3*C*4*C*4*C*4*C*4*C*5*C*7*C*8*C*8*C*9*C*9), *H*<sup>5</sup> = (*B*4*B*6)(*C*5*C*7*C*8*C*9)(*D*1*D*1*D*2*D*2*D*3*D*7*D*7*D*8), and *H*<sup>6</sup> = (*B*3)(*C*9)(*D*1*D*4*D*5*D*6*D*6*D*6*D*7*D*7*D*8*D*8*D*9*D*9). Herein, *H*<sup>1</sup> and *H*<sup>2</sup> have just one color, *H*<sup>3</sup> and *H*<sup>4</sup> have two colors, and *H*<sup>5</sup> and *H*<sup>6</sup> have three colors. From Figure 4, one can easily find that the NDDE algorithm always has a better convergence performance than PSO, GA, and TLBO on each hand. Therefore, the NDDE algorithm has a more promising performance.

**Figure 4.** Convergence curves of NDDE algorithm and PSO, GA, and TLBO on six test hands. (**a**) *H*1, (**b**) *H*2, (**c**) *H*3, (**d**) *H*4, (**e**) *H*5, and (**f**) *H*6.

#### *4.3. Effectiveness of NNDE on Mahjong Game Battles*

In this part, the practicality of the NDDE algorithm is further evaluated by comparing it with the tree-based search method (TSA) [10] in a Mahjong battle with four players. In this test, 1000 randomly generated states of Mahjong are employed, where all players have drawn their hands and the order of tiles on the wall is fixed, and for each Mahjong game, two rounds are played. Moreover, all players adopt the same strategy to make the decisions for each action, such as pong, chow, and kong [10], except for the method employed to calculate the deficiency number. Specifically, for each state of the game, player 1 and player 3 use the NDDE algorithm to compute the deficiency number, while player 2 and player 4 adopt TSA in the first round. In contrast, player 1 and player 3 use TSA to compute the deficiency number, while player 2 and player 4 adopt the NDDE algorithm in the second round. The sum of the scores obtained by player 1 and player 3 in the first round and by player 2 and player 4 in the second round is recorded as the final score to evaluate the

effectiveness of the NDDE algorithm. Herein, we set the basic score in game as 1, and let *NP* = 20 and *Gmax* = 50 in the NDDE algorithm. After conducting these 1000 different games, the final score of the NDDE algorithm on them is −1.173. Importantly, it should be mentioned that the NDDE algorithm with *NP* = 20 and *Gmax* = 50 has accuracy rates of 93.325%, 99.044%, and 99.797% on the hands with one, two, and three types, respectively, which can be found in Table 1. This result means that the NDDE algorithm has almost the same performance as TSA in the real battles. Thus, the NDDE algorithm is a promising approach for calculating the deficiency number of a Mahjong hand.

#### **5. Conclusions**

In this paper, a novel DE-based approach was presented to calculate the deficiency number of one Mahjong hand, which plays an important role in Mahjong and is helpful for boosting its AI development. Concretely, in order to decrease the difficulty of computing the deficiency number, some pretreatment mechanisms were first presented to convert the original problem into a simpler combinatorial optimization one, where the dimension of the search space of the new problem was reduced to five, and the feasible solutions might have various lengths. Meanwhile, inspired by the benefits of DE, such as simplicity, ease of implementation, a strong robustness, and a superior performance, a novel discrete DE (NDDE) variant was specially developed for solving this new problem by devising proper initialization, a mapping solution method, a repairing solution technique, a fitness evaluation approach, and mutation and crossover operations. Compared to the existing methods for calculating the deficiency number, where the full searches are all implicit in them, thus being very costly, the proposed algorithm employed the framework of the stochastic intelligent approach, and the problem of calculating the deficiency number was converted into a simpler one to solve in this paper. Thereby, the proposed approach is capable of more effectively and efficiently computing the deficiency number of one Mahjong hand. Finally, the performance of the proposed algorithm was evaluated by comparing with the tree search algorithm and three other kinds of metaheuristic methods on a large number of various test cases, and the sensitivity of the parameters involved in the NDDE algorithm was also investigated. The experimental results indicated that the proposed algorithm is more efficient and promising.

It should also be mentioned that this paper only adopted the framework of DE in the design of the algorithm due to its previous superior practical experiences, and the most simple version of DE only was used. Therefore, in our future work, we will focus on devising other solvers for calculating the deficiency number based on other metaheuristic algorithms, the existing enhanced DE variants, and discrete DE versions. Meanwhile, we will also focus on designing a hybrid method by properly integrating the merits of both the metaheuristic methods and the deterministic ones for calculating the deficiency number of a Mahjong hand.

**Author Contributions:** Conceptualization, X.Y. and Y.L.; methodology, Y.L.; software, X.Y.; validation, X.Y. and Y.L.; formal analysis, X.Y.; investigation, X.Y.; resources, Y.L.; data curation, X.Y.; writing—original draft preparation, X.Y.; writing—review and editing, X.Y.; visualization, X.Y.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China No. 11671244 and 12071271.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

## *Review* **A Survey on Population-Based Deep Reinforcement Learning**

**Weifan Long 1, Taixian Hou 1, Xiaoyi Wei 1, Shichao Yan 1, Peng Zhai 1,2,3,\* and Lihua Zhang 1,4,5,\***


**Abstract:** Many real-world applications can be described as large-scale games of imperfect information, which require extensive prior domain knowledge, especially in competitive or human–AI cooperation settings. Population-based training methods have become a popular solution to learn robust policies without any prior knowledge, which can generalize to policies of other players or humans. In this survey, we shed light on population-based deep reinforcement learning (PB-DRL) algorithms, their applications, and general frameworks. We introduce several independent subject areas, including naive self-play, fictitious self-play, population-play, evolution-based training methods, and the policy-space response oracle family. These methods provide a variety of approaches to solving multi-agent problems and are useful in designing robust multi-agent reinforcement learning algorithms that can handle complex real-life situations. Finally, we discuss challenges and hot topics in PB-DRL algorithms. We hope that this brief survey can provide guidance and insights for researchers interested in PB-DRL algorithms.

**Keywords:** reinforcement learning; multi-agent reinforcement learning; self play; population play

**MSC:** 68T42

#### **1. Introduction**

Reinforcement learning (RL) [1] is a highly active research field in the machine learning community with decades of development. However, traditional RL methods have limited performance when it comes to complex, high-dimensional input spaces. Deep reinforcement learning (DRL) [2] addresses this issue by using deep neural networks as function approximators, allowing agents to use unstructured data for decision-making. DRL has shown impressive performance on a range of tasks, including game playing, robotics, and autonomous driving. There are many impressive research works from different fields which were achieved through DRL, such as gaming (AlphaGo [3], AlphaZero [4], AlphaStar [5]), nuclear energy (fusion control [6]), and mathematics (AlphaTensor [7]). While DRL has become increasingly popular due to its effectiveness and generality, there are many real-world applications that require multiple agents' cooperation or competition. A multi-agent system is usually employed to research problems that are difficult or impossible for a single agent. Multi-agent reinforcement learning (MARL) is one of the effective approaches to multi-agent system problems [8]; it has been used to address problems in a variety of domains, including robotics, distributed control, telecommunications, and economics [9]. However, many real-world applications can be described as large-scale games of imperfect information, which require a lot of prior domain knowledge to compute a Nash equilibrium, especially in a competitive environment. This can be a major challenge for traditional MARL methods. Population-based reinforcement learning (PB-DRL) has emerged as a popular solution, allowing for the training of robust policies without any prior domain knowledge that can generalize to all policies of other players.

**Citation:** Long, W.; Hou, T.; Wei, X.; Yan, S.; Zhai, P.; Zhang, L. A Survey on Population-Based Deep Reinforcement Learning. *Mathematics* **2023**, *11*, 2234. https://doi.org/ 10.3390/math11102234

Academic Editors: Jian Dong and Marjan Mernik

Received: 31 March 2023 Revised: 7 May 2023 Accepted: 8 May 2023 Published: 10 May 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Population-based approaches take advantage of the high parallelism and large search space typically found in optimization problems. This method has demonstrated remarkable performance in MARL, resulting in exceptional performance in games such as Pluribus [10] and OpenAI Five [11] without any expert experience. Czarnecki et al. [12] conducted research on the importance of population-based training techniques for large-scale multiagent environments, which can include both virtual and real-world scenarios. According to their study, population-based methods offer several benefits for training robust policies in such environments, including diversity of strategies, scalability, and adaptability to changing conditions. The population diversity in PB-DRL allows for a more robust policy because it can handle a wider range of situations and scenarios. This can be particularly important in real-world applications where there may be plenty of variability and uncertainty. By using a population-based approach, the policy can be trained to be more robust and adaptable to different situations, which is crucial for success in real-world applications.

In contrast to prior surveys, the motivation of our survey lies instead in recent PB-DRL algorithms and applications specifically, which helps to achieve surprisingly outstanding performance. Accordingly, we also introduce general frameworks of PB-DRL. In this survey, we give an account of PB-DRL approaches and associated methods. We start with reviewing selected ideas from game theory and multi-agent reinforcement learning. Then, we move on to present several recent promising approaches, applications, and frameworks of PB-DRL. Here, we first give the idea of milestones and briefly describe others in each kind of PB-DRL; applications for how to use the corresponding methods will then be introduced. Finally, we finish by discussing challenges and hot topics of PB-DRL. Given the extensive literature from the MARL community and adjacent fields, we acknowledge that our survey is not exhaustive and may not cover all prior work. Specifically, in this survey, our focus is on PB-DRL algorithms including multi-agent reinforcement learning by using self-play-related technology, evolution-based training methods for reinforcement learning, and general frameworks. We conducted a comprehensive literature search following the guidelines for Systematic Literature Reviews (SLRs) [13] to conduct a comprehensive literature search on PB-DRL. We searched four databases that are widely used and recognized in the field of software engineering: Google Scholar, IEEE Xplore, ACM Digital Library, and DataBase Systems and Logic Programming. We used advanced search options to limit the results to peer-reviewed articles published in English from 2018 to 2023, as this survey is focused more on recent works and we used snowballing method to cover previous milestones. We used the following search string: ("population-based" AND "reinforcement learning") OR ("evolution algorithm" AND "reinforcement learning") OR ("self-play" AND "reinforcement learning"). The initial search yielded 200 papers from Google Scholar (capturing the first 20 pages of search results), 127 papers from IEEE Xplore, 381 papers from ACM Digital Library, and 80 papers from DataBase Systems and Logic Programming, resulting in a total of 788 papers before screening. We screened the titles and abstracts of these papers based on their relevance to our survey topic, which is PB-DRL methods and applications. We used the following inclusion criteria: (1) the paper must focus on PB-DRL or a related concept (e.g., evolutionary algorithm, self-play); (2) the paper must report novel approaches, significant results, or comparative evaluations related to PB-DRL. To ensure the high quality of the references, we also screened based on the publisher; famous conferences and journals such as Nature, Science, NeuralPS, ICML, and ICLR were included. After screening, we included nearly 30 papers for further analysis and excluded others for various reasons such as being out of scope, being duplicates, or being of low quality. We then used snowballing to complement our database search and ensure that we did not miss any relevant studies, i.e., we checked the references of the included papers to identify any additional relevant studies. Our aim is to provide an overview of recent developments in PB-DRL, and we hope that it can garner more attention and be applied in various industries.

#### **2. Background**

Population-based deep reinforcement learning is an approach that addresses the limitations of traditional reinforcement learning algorithms. PB-DRL algorithms maintain a population of agents that explore the environment in parallel and learn from each other to improve their collective performance. This approach has shown significant improvements in terms of sample efficiency and generalization in various applications. In this section, we start with necessary knowledge which may help to understand PB-DRL algorithms.

#### *2.1. Game Theory*

Game theory and MARL are closely related fields, as game theory provides a theoretical framework for analyzing and understanding strategic interactions between multiple agents, while MARL provides a practical approach for agents to learn optimal strategies through experience. Research in this survey usually considers a normal-form game or extensive-form game. A normal-form game refers to a game where players make decisions simultaneously without knowing the decisions made by other players, whereas an extensive-form game refers to a game where players make decisions sequentially and can observe the decisions made by other players before making their own decisions.

Normal-form games represent the game by way of a matrix, represented by a tuple (*π*, *U*, *n*), where *n* is the number of players, *π* = (*π*1, ... , *πn*) is a collection of strategies for all players, and *<sup>U</sup>* : *<sup>π</sup>* <sup>→</sup> *<sup>R</sup><sup>n</sup>* is a payoff table mapping each joint policy to a scalar utility for each player. Each player aims to maximize their own payoff. Assume that *π<sup>i</sup>* is a mixed strategy of player i and *<sup>π</sup>*−*<sup>i</sup>* refers to the joint mixed strategies except *<sup>π</sup>i*. *Ui*(*πi*, *<sup>π</sup>*−*i*) is the expected payoff of player i. Then, Nash equilibrium can be defined.

**Definition 1** (Nash equilibrium)**.** *A mixed strategy profile π∗* = (*π*∗ <sup>1</sup> , ... , *π*<sup>∗</sup> *<sup>n</sup>*) *is a Nash equilibrium if for all player i:*

$$\max\_{\pi\_i'} \mathcal{U}\_i(\pi\_{i\prime}^{\prime}, \pi\_{-i}) = \mathcal{U}\_i(\pi\_{i\prime}^\*, \pi\_{-i})$$

Intuitively, in a Nash equilibrium, no player has an incentive to change their current strategy unilaterally because doing so would result in no benefits or even negative returns. Therefore, the strategy profile remains stable. *π*∗ *<sup>i</sup>* is the best response (BR) of agent i.

In extensive-form games, players make decisions sequentially, with each player's action influencing the subsequent decisions of other players. These games generalize the normal-form game formalism for sequential decision-making scenarios. Every finite extensive-form game has an equivalent normal-form game [14], and an approximation of Nash equilibrium called -Nash equilibrium or approximate Nash equilibrium is typically considered. The corresponding BR is referred to as the -best response.

**Definition 2** (-Nash equilibrium)**.** *A mixed strategy profile π∗* = (*π*∗ <sup>1</sup> , ... , *π*<sup>∗</sup> *<sup>n</sup>*) *is a -Nash equilibrium if for all player i:*

$$\mathcal{U}l\_i(\pi\_i^\*, \pi\_{-i}) \ge \mathcal{U}\_i(\pi\_i', \pi\_{-i}) - \varepsilon, \quad for \ any \ policy \ \pi\_i' \ of \ player \ i$$

In the context of game theory, a real-life game can be incredibly complex, making it impractical to explicitly list out all the possible strategies. As a result, researchers often construct an "empirical game" that is smaller in size but still captures the essential features of the full game. This empirical game is built by discovering strategies and using metareasoning techniques to navigate the strategy space. In the general framework of PB-DRL algorithms, this empirical game is also known as the "meta-game." It starts with a single policy and grows by adding policies that approximate the best responses to the metastrategy of the other players. In other words, the meta-game is a simplified version of the real game that captures its essential features, allowing PB-DRL algorithms to learn effective strategies in a computationally efficient manner.

#### *2.2. Multi-Agent Reinforcement Learning*

Reinforcement learning is a learning approach that aims to find the optimal way of mapping situations to actions to maximize a numerical reward signal [1]. It is often formalized as a Markov Decision Process (MDP) that addresses sequential decision-making problems in environments with discrete time-steps. Deep reinforcement learning combines RL with deep neural networks, resulting in improved performance compared to RL without neural networks. DRL can handle larger state spaces, enabling it to process larger input matrices such as images, and it can learn more complex policies with fewer hand-specified features to represent state information [15,16]. Some well-known DRL algorithms used in various applications include DQN [17], TD3 [18], and PPO [19].

Multi-agent reinforcement learning refers to sequential decision-making with multiple agents, which poses additional challenges due to the fact that taking action can influence the rewards of other agents. MARL tasks are commonly divided into cooperative (the agents work together to reach a common goal, like Overcooked), competitive (each agent has its own reward function and acts selfishly to maximize only its own expected cumulative reward, like Go), and mixed settings (each agent has an arbitrary but agent-unique reward function, like football). PB-DRL has shown good performance in applications with these settings [4,20,21].

MARL algorithms can be classified into six different learning paradigms, as outlined by Yang et al. [22]: (1) independent learners with shared policy, (2) independent learners with independent policies, (3) independent learning with shared policy within a group, (4) one central controller controlling all agents, (5) centralized training with decentralized execution (CTDE), and (6) decentralized training with networked agents. Types with independent learners (type 1-3) use independent reinforcement learning, where each agent treats the experience of other agents as part of its (non-stationary) environments [23,24]. This makes it a suitable approach for problems with a large number of agents or a high-dimensional action space. However, it may result in overfitting to the other agents during training and insufficient generalization during execution [25]. The fourth type of the learning paradigm can be seen as a single-agent RL method, which has a problem in large-scale strategy space. Type 5, CTDE, is a popular approach where agents can exchange information with others during training and act independently during execution [26,27]. This allows for efficient training but may suffer from communication overhead during execution. Decentralized training with networked agents, type 6, involves each agent learning its own policy based on local observations and communication with its neighbors in a networked structure [28]. This can improve scalability and adaptability but may require significant communication among agents.

PB-DRL algorithms belong to the CTDE paradigm and maintain a large, diverse population to train a robust policy that can generalize to non-communication situations. In PB-DRL, a population of agents explores the environment in parallel, allowing them to learn from each other to improve their collective performance. This approach has been shown to be more sample-efficient and to generalize better in various applications, including cooperative, competitive, and mixed settings, which makes it a promising technique for addressing the challenges in multi-agent reinforcement learning.

#### **3. Population-Based Deep Reinforcement Learning**

In recent years, PB-DRL has emerged as a promising research direction in the field of DRL. The key idea behind PB-DRL is to employ multiple agents or learners that interact with their environment in parallel and exchange information to improve their performance. This approach has shown great potential in achieving superior results compared to traditional single-agent methods. In this survey paper, we focus on several popular population-based methods, including naive self-play, fictitious self-play, population play, evolutionary-based methods, and the general framework. We will discuss the basic concepts, advantages, and limitations of each method to provide a comprehensive overview of the current state of the art in this field. For a brief overview of a selected subset of methods for PB-DRL, see also Table 1.

**Table 1.** A selected subset of research areas and recent algorithms or frameworks for PB-DRL.


#### *3.1. Naive Self-Play*

Self-play (SP) is an open-ended learning training scheme that trains by playing against a mirrored copy of itself without any supervision in various stochastic environments. Compared with expert opponents, SP has shown more amazing performance in many complex problems. The simplest and most effective SP method is naive self-play, first proposed in [29]. As shown in Figure 1, the opponent (mirrored agent) uses the same policy network, i.e., the opponent downloads the latest policy network while the agent updates its policy network. Denote *π* as a policy being trained, *πzoo* as a policy zoo, *π* as the policy

set of the opponents, Ω as the policy sampling distribution, and *G* as the gating function for *πzoo* [45]. The policy sampling distribution Ω is

$$\Omega(\pi'|\pi^{zao},\pi) = \begin{cases} 1, & \forall \pi' \in \pi^{zao}: \pi' = \pi \\ 0. & \text{Otherwise} \end{cases} \tag{1}$$

Since the policy zoo *πzoo* only keeps the latest version of policy *π*, it always clears the old policies *πzoo* and inserts *π*, *πzoo* = *π*.

**Figure 1.** Overview of naive self-play.

A variety of works have followed this method since naive self-play is simple and effective. TD-Gammon [46] features naive SP to learn a policy by using TD(*λ*) algorithm. At that time, this work outperforms supervised learning with expert experience. AlphaGo [3] defeated the world champion of Go in 2017; it uses a combination of supervised learning on expert datasets and SP technology. SP is used to update the policy and to generate more data. SP-based applications have been developed rapidly in both academia and industry. One year after AlphaGo, AlphaZero [47] gained prominence. In contrast to AlphaGo, AlphaZero does not require domain-specific human knowledge but achieves outstanding performance. Instead, it learns the game policy by playing against itself, using only the game rules.

Naive SP is also a solution for handling many-to-many environments, as demonstrated by JueWu [48] which uses this approach for two players controlling five heroes in Honor of Kings during lineup and random lineup stages. Another study applied naive SP to an open-ended environment (hide-and-seek [49]), showing that it can lead to emergent auto-curricula with many distinct and compounding phase shifts in agent strategy.

Despite its effectiveness, naive SP may not be sufficient to learn a robust policy due to the lack of diversity in opponent policies. Fictitious self-play is a solution to this problem, where the agent plays against a mixture of its previous policies and fictional policies that are generated by sampling from a distribution over policies learned during training.

#### *3.2. Fictitious Self-Play*

Fictitious play, introduced by Brown [30], is a popular method for learning Nash equilibrium in normal-form games. The premise is that players repeatedly play a game and choose the best response to a uniform mixture of all previous policies at each iteration. As shown in Figure 2, fictitious self-play (FSP) [31] is a machine learning framework that implements generalized weakened fictitious play in behavioral strategies in a samplebased fashion. It can avoid cycles by playing against all previous policies. FSP iteratively samples episodes of the game from SP. These episodes constitute datasets that are used by reinforcement learning to compute approximate best responses and by supervised learning to compute perturbed models of average strategies.

**Figure 2.** Overview of fictitious self-play.

Neural fictitious self-play (NFSP) [32] combines FSP with neural network function approximation. NFSP keeps two kinds of memories. One, denoted as M*RL*, was used for storing experience of game transitions, while the other, M*SL*, stored the best response behavior. Each agent computed an approximate best response *β* from M*RL* and updated its average policy Π by supervised learning from M*SL*. In principle, each agent could learn the best response by playing against the average policies of other agents. However, the agent cannot get its best response policy *β*, which is needed to train its average policy Π, and its average policy Π is needed for the best response training of other agents. NFSP uses the approximation of anticipatory dynamics of continuous-time dynamic fictitious play [50], in which players choose the best response to the short-term predicted average policy of their opponents, Π−*<sup>i</sup> <sup>t</sup>* + *<sup>η</sup> <sup>d</sup> dt* Π*t*, where *η* is the anticipatory parameter. NFSP assumes *<sup>β</sup>t*+<sup>1</sup> <sup>−</sup> <sup>Π</sup>*<sup>t</sup>* <sup>≈</sup> *<sup>d</sup> dt* Π*<sup>t</sup>* as a discrete-time approximation. During play, all agents mixed their actions according to *σ* = Π + *η*(*β* − Π). By using this approach, each agent could learn an approximate best response with predicted average policies of its opponents. In other words, the policy sampling distribution of all agents Ω is

$$\Omega(\pi) = \begin{cases} \beta, & \text{with probability } \eta \\ \Pi. & \text{with probability } 1-\eta \end{cases} \tag{2}$$

*MRL* uses a circular buffer to store transition in every step, but *MSL* only inserts transition while agent follows the best response policy *β*.

The Exploitability Descent (ED) algorithm [33] is a PB-DRL method that directly optimizes policies against worst-case opponents without the need to compute average policies. In contrast to NFSP algorithm, which requires a large reservoir buffer to compute an approximate equilibrium, ED focuses on decreasing the "exploitability" of each player, which refers to how much a player could gain by switching to a best response. The algorithm has two steps for each player on each iteration. The first step is identical to the FP algorithm, where the best response to the policy of each player is computed. The second step performs gradient ascent on the policy to increase the utility of each player against the respective best responder, aiming to decrease the exploitability of each player. In a tabular setting with Q-values and L2 projection, the policy gradient ascent update is defined by equation

$$\begin{split} \boldsymbol{\theta}\_{\mathcal{S}}^{t} &= \boldsymbol{P}\_{\ell\_{2}}(\boldsymbol{\theta}\_{\mathcal{S}}^{t-1} + \boldsymbol{a}^{t} \langle \nabla\_{\boldsymbol{\theta}\_{\mathcal{S}}} \pi\_{\boldsymbol{\theta}}^{t-1}(\mathcal{S}), \boldsymbol{\mathcal{Q}}^{b}(\mathcal{S}) \rangle) \\ &= \boldsymbol{P}\_{\ell\_{2}}(\boldsymbol{\theta}\_{\mathcal{S}}^{t-1} + \boldsymbol{a}^{t} \boldsymbol{\mathcal{Q}}^{b}(\mathcal{S})), \end{split} \tag{3}$$

where *<sup>Q</sup>b*(*S*) is the expected return at state *<sup>S</sup>* with joint policy set *<sup>b</sup>*, *<sup>P</sup>*-<sup>2</sup> is the L2 projection, <sup>∇</sup>*θSπt*−<sup>1</sup> *<sup>θ</sup>* (*S*) is an identity matrix, and *<sup>α</sup>* is the step size. In other words, the ED algorithm directly optimizes policies against worst-case opponents, making it a promising approach for addressing games with complex strategy spaces.

A related approach from another perspective is *δ* − *Uni f orm* FSP [34], which learns a policy that can beat older versions of itself sampled uniformly at random. The authors use a percentage threshold *δ* ∈ [0, 1] to select the old policies that are eligible for sampling from the policy zoo *πzoo*, i.e., the opponent strategy *π* is sampled from

$$
\Omega(\pi'|\pi^{zao}, \pi) = \mathsf{Umform}(\delta|\pi^{zao}|\iota|\pi^{zao}|)\tag{4}
$$

Significantly, the algorithm is the same as naive SP while *δ* = 1. After every episode, the training policy is always inserted into the policy zoo *πzoo*. Thus, *πzoo* is updated with *<sup>π</sup>zoo* <sup>=</sup> *<sup>π</sup>zoo* <sup>∪</sup> *<sup>π</sup>*.

While AlphaStar does use FSP as one of its learning algorithms, Prioritized FSP is actually a modification proposed by the AlphaStar team in their subsequent paper [5]. The authors argue that many games are wasted against players that are defeated in almost 100% of games while using regular FSP and propose Prioritized FSP which samples policies by their expected win rate. Policies that are expected to win with higher probability against the current agent have higher priority and are sampled more frequently. The opponent sampling distribution Ω can be written as

$$\Omega(\pi'|\pi^{z\bullet o},\pi) = \frac{f(P(\pi \text{ beats } \pi'))}{\sum\_{\Pi \in \pi^{z\bullet o}} f(P(\pi \text{ beta } \Pi))}\tag{5}$$

where *<sup>f</sup>* is a weighting function, e.g., *<sup>f</sup>*(*x*)=(<sup>1</sup> <sup>−</sup> *<sup>x</sup>*)*p*. The policy zoo named league in the paper is complex; we will introduce the update method latter.

OpenAI Five also employs a similar method, as described in [11]. The method consists of training with a naive self-play approach for 80% of the games and using past sampling policies for the remaining 20%. Similar to the Prioritized FSP method, OpenAI Five uses a dynamic sampling system that relies on a dynamically generated quality score *q*. This system samples opponent agents according to a softmax distribution, where the probability of choosing an opponent *p* is proportional to *eq*. If OpenAI Five wins the game, *q* is updated with a learning rate constant *η* as follows:

$$q = q - \frac{\eta}{Np} \tag{6}$$

where *N* is the size of policy zoo. At every 10 iterations, the policy of the current agent will be added to the policy zoo with an initial quality score equal to the maximum quality score in the zoo.

While self-play can bring remarkable performance improvements to reinforcement learning, it performs poorly in non-transitive games because it always plays against itself. Specifically, the opponent's policy only samples from one policy, which means the training agent only learns from a single type of opponent. This approach works well in situations where a higher-ranked player can always beat a lower-ranked player. Population-based training methods bring more robust policies.

#### *3.3. Population-Play*

Another population-based method for multi-agent systems is population-play (PP), which builds upon the concept of SP to involve multiple players and their past generations [5,35], as shown in Figure 3. With PP, a group of agents is developed and trained to compete not only with each other but also with agents from prior generations.

**Figure 3.** Overview of (naive) population-play.

To train an exceptional agent, AlphaStar [5] maintains three types of opponent pools: Main Agents, League Exploiters, and Main Exploiters. Main Agents are trained with a combination of 35% SP and 50% PFSP against all past players in the league, and the agent plays an additional 15% of matches against opponents who had previously been beaten but are now unbeatable, as well as past opponents who had previously exploited the weaknesses of the agent. League Exploiters are used to find a policy that league agents cannot defeat. They are trained using PFSP against agents in the league and added to the league if they defeat all agents in the league with a winning rate of more than 70%. Main Exploiters play against Main Agents to identify their weaknesses. If the current probability of winning is less than 20%, Main Exploiters employ PFSP against players created by Main Agents. Otherwise, Main Exploiters play directly against the current Main Agents.

For the Win (FTW) [35] is a training method designed for the game of Capture the Flag, which involves training a diverse population of different agents by having them learn from playing with each other. The training process involves sampling agents from the population to play as teammates and opponents, which is done using a stochastic matchmaking scheme that biases co-players to be of similar skill to the player. This ensures that a diverse set of teammates and opponents participate in training, and helps to promote robustness in the learned policies. A population-based training method is implemented to enhance the performance of weaker players and improve the overall ability of all players.

PP can accommodate a wide range of agents, making it also suitable for deployment in cooperative settings. However, Siu et al. [51] observed that in such scenarios, human players tended to favor rule-based agents over RL-based ones. This finding highlights the need to take into account human perceptions of AI when designing and developing systems intended for real-world adoption.

To address this issue, fictitious co-play (FCP) [20] aims to produce robust partners that can assist humans with different styles and skill levels without relying on human-generated data (i.e., zero-shot coordination with humans). FCP is a two-stage approach. In the first stage, N partner agents are trained independently in self-play to create a diverse pool of partners. In the second stage, FCP trains a best-response agent against the diverse pool to achieve robustness. Hidden-utility self-play [36] follows the FCP framework and uses a hidden reward function to model human bias with domain knowledge to solve the human–AI cooperation problem. A similar work for assistive robots learns a good latent representation for human policies [52].

#### *3.4. Evolution-Based Training Methods*

Evolutionary algorithms are a family of optimization algorithms inspired by the process of natural selection. They involve generating a population of candidate solutions and iteratively improving them by applying operators such as mutation, crossover, and selection, which mimic the processes of variation, reproduction, and selection in biological evolution. These algorithms are widely used in solving complex optimization problems in various fields, including engineering, finance, and computer science. Evolutionary-based DRL is a type of PB-DRL that approaches training from an evolutionary perspective and often incorporates swarm intelligence techniques, particularly evolution algorithms. In this subsection, we will focus on recent hybrid DRL algorithms that combine evolutionary approaches with deep reinforcement learning to accelerate the training phase. These algorithms can be used alongside SP or PP algorithms [35].

Population-based training (PBT) introduced in [37] is an online evolutionary process that adapts internal rewards and hyperparameters while performing model selection by replacing underperforming agents with mutated versions of better agents. Multiple agents are trained in parallel, and they periodically exchange information by copying weights and hyperparameters. The agents evaluate their performance, and underperforming agents are replaced by mutated versions of better-performing agents. This process continues until a satisfactory performance is achieved, or a maximum budget is reached.

Majumdar et al. [38] propose multi-agent evolutionary reinforcement learning (MERL) as a solution for the sample inefficiency problem of PBT in cooperative MARL environments where the team reward is sparse and agent-specific reward is dense. MERL is a splitlevel training platform that combines both gradient-based and gradient-free optimization methods, without requiring domain-specific reward shaping. The gradient-free optimizer is used to maximize the team objective by employing an evolutionary algorithm. Specifically, the evolutionary population maintains a variety of teams and uses evolutionary algorithms to maximize team rewards (fitness). The gradient-based optimizer maximizes the local reward of each agent by using a common replay buffer with other team members in the evolutionary population. Collaborative evolutionary reinforcement learning (CERL) [39] is a similar work which addresses the sample inefficiency problem of PBT. It uses a collective replay buffer to share all information across the population.

Deep evolutionary reinforcement learning (DERL) [40] is a framework for creating embodied agents that combines evolutionary algorithms with DRL, which aims to find a diverse solutions. DERL decouples the processes of learning and evolution in a distributed asynchronous manner, using tournament-based steady-state evolution. Similar to PBT [37], DERL maintains a population to encourage diverse solutions. The average final reward is used as a fitness function, and a tournament-based selection method is used to choose the parents for generating children via mutation operations. Liu et al. [53] demonstrated that end-to-end PBT can lead to emergent cooperative behaviors in the soccer domain. They also applied an evaluation scheme based on Nash averaging to address the diversity and exploitability problem.

#### *3.5. General Framework*

The policy-space response oracles (PSRO) framework is currently the most widely used general framework for PB-DRL. It unifies various population-based methods, such as SP and PP, with empirical game theory to effectively solve games [25]. As shown in Figure 4, PSRO divides these algorithms into three modules: meta strategy, best-response solution, and policy zoo expansion. The first module, meta strategy, involves solving the meta-game using a meta-solver to obtain the meta strategy (policy distribution) of each policy zoo. The second module, best-response solution, involves each agent sampling policies of other agents *π*−*<sup>i</sup>* and computing its best response *π<sup>i</sup>* with fixed *π*−*i*. The third module, policy zoo expansion, involves adding the best response to the corresponding policy zoo. The process starts with a single policy. In each episode, one player trains its policy *π<sup>i</sup>* using a fixed policy set, which is sampled from the meta-strategies of its opponents (*π <sup>−</sup><sup>i</sup>* <sup>∼</sup> *<sup>π</sup>zoo <sup>−</sup><sup>i</sup>* ). At the end of every epoch, each policy zoo expands by adding the approximate best response to the meta-strategy of the other players, and the expected utilities for new policy combinations computed via simulation are added to the payoff matrix.

**Figure 4.** Overview of PSRO.

Although PSRO has demonstrated its performance, several drawbacks have been identified and addressed by recent research. One such extension is Rectified Nash response (PSRO*rN*) [41], which addresses the diversity issue and introduces adaptive sequences of objectives that facilitate open-ended learning. The effective diversity of the population is defined as:

$$d(\pi^{zao}) = \sum\_{i,j=1}^{n} \lfloor \phi(w\_i, w\_j) \rfloor + \cdot p\_i \cdot p\_j \tag{7}$$

where *<sup>n</sup>* <sup>=</sup> <sup>|</sup>*πzoo*|, *<sup>φ</sup>*(*x*, *<sup>y</sup>*) is the payoff function, *<sup>p</sup>* is the Nash equilibrium on *<sup>π</sup>zoo*, *<sup>x</sup>*<sup>+</sup> is the rectifier, denoted by *x*<sup>+</sup> = *x* if *x* ≤ 0 and *x*<sup>+</sup> = 0 otherwise. Equation (7) encourages agents to play against opponents who they can beat. Perhaps surprisingly, the authors found that building objectives around the weaknesses of agents does not actually encourage diverse skills. To elaborate, when the weaknesses of an agent are emphasized during training, the gradients that guide its policy updates will be biased towards improving those weaknesses, potentially leading to overfitting to a narrow subset of the state space. This can result in a lack of diversity in the learned policies and a failure to generalize to novel situations. Several other works have also focused on the diversity aspect of PSRO frameworks. In [42], the authors propose a geometric interpretation of behavioral diversity in games (Diverse PSRO) and introduce a novel diversity metric that uses determinantal point process (DPP). The diversity metric is based on the expected cardinality of random samples from a DPP in which the ground set is the strategy population. It is denoted as:

$$\text{Diversity}(\pi^{\mathsf{zoo}}) = E\_{\pi' \sim \mathbb{P}\_{L\_{\pi^{\mathsf{zoo}}}}} \left[ |\pi'| \right] = \text{Tr} (I - (L\_{\pi^{\mathsf{zoo}}} + I)^{-1}),\tag{8}$$

where a DPP defines a probability P, *π* is a random subset drawn from the DPP, and *Lπzoo* is the DPP kernel. They incorporate this diversity metric into best-response dynamics to improve overall diversity. Similarly, [54] notes the absence of widely accepted definitions for diversity and offers a redefined behavioral diversity measure. The authors propose response diversity as another way to characterize diversity through the response of policies when facing different opponents.

Pipeline PSRO [43] is a scalable method that aims to improve the efficiency of PSRO, which is a common problem of most of PSRO-related frameworks, in finding approximate Nash equilibrium. It achieves this by maintaining a hierarchical pipeline of reinforcement learning workers, allowing it to parallelize PSRO while ensuring convergence. The method includes two classes of policies: fixed and active. Active policies are trained in a hierarchical pipeline, while fixed policies are not trained further. When the performance improvement

of the lowest-level active worker in the pipeline does not meet a given threshold within a certain time period, the policy becomes fixed, and a new active policy is added to the pipeline. Another work has improved the computation efficiency and exploration efficiency by introducing a new subroutine of no-regret optimization [55].

PSRO framework has another branch which optimizes the meta-solver concept. Alpha-PSRO [44] extends the original PSRO paper to apply readily to general-sum, many-player settings, using an *α*-Rank [56], a ranking method that considers all pairwise comparisons between policies, as the meta-solver. Alpha-PSRO defines preference-based best response (PBR), an oracle that finds policies that maximize their rank against the population. Alpha-PSRO works by expanding the strategy pool through constructing a meta-game and calculating a payoff matrix. The meta-game is then solved to obtain a meta-strategy, and finally, a best response is calculated to find an approximate optimal response. Joint PSRO [57] uses correlated equilibrium as the meta-solver, and Mean-Field PSRO [58] proposes newly defined mean-field no-adversarial-regret learners as the meta-solver.

#### **4. Challenges and Hot Topics**

In the previous section, we discussed several PB-DRL algorithms that have shown significant improvements in real-life game scenarios. However, the application of these algorithms also faces several challenges that need to be addressed to further advance the field of PB-DRL.

#### *4.1. Challenges*

One of the most significant challenges in PB-DRL is the need for increased diversity within the population. Promoting diversity not only helps AI agents avoid checking the same policies repeatedly, but also enables them to discover niche skills, avoid being exploited, and maintain robust performance when encountering unfamiliar types of opponents [22]. As the population grows, it becomes more challenging to maintain diversity and ensure efficient exploration of the search space. Without adequate diversity, the population may converge prematurely to suboptimal solutions, leading to the stagnation of the learning process. Overfitting to policies in the policy zoo is a significant challenge to generalization [25,59]. Although the diversity of a population has been widely discussed in the evolutionary algorithm community at the genotype level, phenotype level, and the combination of the previous two cases [60], which typically operate on a fixed set of candidate solutions, PB-DRL is often used in dynamic and uncertain environments where the population size and diversity can change over time. Additionally, since policies are always represented as neural networks, using difference-based or distance-based methods directly, which are widely used in evolutionary computations, are not suitable choices. Some heuristic algorithms have been proposed. Balduzzi et al. [41] design an opponents selection method to expand the policy game space to improve diversity. Another approach is to incorporate different levels of hierarchy within the population to maintain diversity [44]. An interesting work [42] models behavioral diversity for learning in games by using a determinantal point process as the diversity metric. Other techniques that improve the diversity of the policy pool can be found in [61–63].

The need for increased efficiency is a significant challenge in PB-DRL, as evaluating each individual within a growing population becomes computationally expensive, resulting in a reduced learning rate. PB-DRL is often applied to large-scale environments with highdimensional state and action spaces, making the evaluation of each individual within a population even more computationally expensive. For instance, AlphaStar trained the league over a period of 44 days using 192 8-core TPUs, 12 128-core TPUs, and 1800 CPUs, which potentially cost more than 120 billion dollars in renting cloud computing services for training [5]. One promising approach to improving efficiency in PB-DRL is to develop more sample-efficient algorithms. This can be achieved through various means, such as monotonic improvement in exploitability [55], regret bound [64]. Another approach to improving efficiency in PB-DRL is to use distributed computing techniques [43,65]. These

techniques can enable faster evaluation of individuals within a population, as well as better parallelization of the learning process. For example, some recent works named distributed deep reinforcement learning [66] are often used to accelerate the training process in PB-DRL, such as SEED RL [67], Gorila [68], and IMPALA [69]. In addition to the technical challenges of improving efficiency in PB-DRL, there are also practical challenges related to the cost and availability of computing resources. One possible solution to this challenge is to develop more energy-efficient algorithms that can run on low-power devices or take advantage of specialized hardware, such as GPUs or TPUs. Flajolet et al. [70] indicate that the judicious use of compilation and vectorization allows population-based training to be performed on a single machine with one accelerator with minimal overhead compared to training a single agent.

#### *4.2. Hot Topics*

Despite these challenges, it is essential to note that this field is rapidly evolving. Currently, there are several hot topics and future directions in PB-DRL worth exploring, and researchers are actively engaged in these endeavors.

*Games*: PB-DRL has demonstrated outstanding performance in many games, including board games [47], online games [5,11], and more. As a result, game manufacturers have become interested in exploring several directions. These include:


*Zero-shot coordination*: The zero-shot coordination (ZSC) problem refers to the situation where agents must independently produce strategies for a collaborative game that are compatible with novel partners not seen during training [63]. Population-based reinforcement learning has been used for this problem, starting with FCP [20], and there is ongoing research using the keywords "zero-shot human-AI coordination." Researchers aim to identify a sufficiently robust agent capable of effectively generalizing human policies. Many methods have been used in this problem, such as lifetime learning [71], population diversity [62,63], and model human bias [36].

*Robotics*: Reinforcement learning has become increasingly prevalent in the robotics field [72–74]. The use of PB-DRL has also expanded to robots, including robotic manipulation [75], assistance with robots [52], multi-robot planning [76], and robot table tennis [77]. In a recent study, it was shown that PB-DRL could generate varied environments [78], which is advantageous for developing robust robotics solutions.

*Financial markets*: Population-based algorithms and concepts have immense potential for use in financial markets and economic forecasting [79]. Despite the widespread use of MARL in financial trading, the application of PB-DRL to financial markets appears to be underutilized in both academic and industry-related research. This is partly due to the high demands placed on simulation environments when working with PB-DRL. Once an environment that meets the requirements is created, PB-DRL will show its power.

#### **5. Conclusions**

In this paper, we have provided a comprehensive survey of representative populationbased deep reinforcement learning (PB-DRL) algorithms, applications, and general frameworks. We categorize PB-DRL research into the following areas: naive self-play, fictitious self-play, population-play, evolution-based training methods, and general framework. We compare the main ideas of different types of algorithms by summarizing the various types of PB-DRL algorithms and describing how they have been used in real-life applications. Furthermore, we introduce evolution-based training methods to expound on common ways to adjust hyperparameters or accelerate training. General frameworks for PB-DRL are also introduced for different game settings, providing a general training process and theoretical proofs. Finally, we discuss the challenges and opportunities of this exciting field. We aim to provide a valuable reference for researchers and engineers working on practical problems.

**Author Contributions:** Conceptualization, W.L. and P.Z.; methodology, W.L. and P.Z.; software, T.H.; validation, X.W. and S.Y.; formal analysis, X.W.; investigation, W.L. and X.W.; resources, W.L.; writing original draft preparation, W.L.; writing—review and editing, T.H., X.W. and P.Z; visualization, S.Y.; supervision, L.Z.; project administration, P.Z. and L.Z.; funding acquisition, L.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** The work reported in this paper was supported by the National Key R&D Program of China (Grant Number: 2021ZD0113502, 2021ZD0113503), Shanghai Municipality Science and Technology Major Project (Grant Number: 2021SHZDZX0103), and China Postdoctoral Science Foundation (Grant Number: BX20220071, 2022M720769), and Research on Basic and Key Technologies of Intelligent Robots (Grant Number: KEH2310017).

**Data Availability Statement:** Not applicable.

**Acknowledgments:** Many thanks to FDU IPASS Group for taking the time to proofread this article.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

#### MDPI

St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Mathematics* Editorial Office E-mail: mathematics@mdpi.com www.mdpi.com/journal/mathematics

Academic Open Access Publishing

www.mdpi.com ISBN 978-3-0365-8255-9