**1. Introduction**

Combinatorial optimization is an important branch of operations research. It refers to solving problems of variable combinations by minimizing (or maximizing) an objective function under given constraints, and is based on the research of mathematical methods to find optimal arrangements, groupings, orderings, or screenings of discrete events. As a research hot-spot in combinatorial optimization, the Max-cut problem is one of the 21 typical non-deterministic polynomial (NP)-complete problems proposed by Richard M. Karp [1]. It refers to obtaining a maximum segmentation for a given directed graph, such that the sum of the weights across all edges of two cutsets is maximized [2]. The Max-cut problem has a wide range of applications in engineering problems, such as Very Large Scale Integration (VLSI) circuit design, statistical physics, image processing, and communications network design [3]. As a solution of the Max-cut problem can be used to measure the robustness of a network [4] and as a standard for network classification [5], it can also be applied to social networks.

It has been discovered that many classic combinatorial optimization problems derived from engineering, economics, and other fields are NP-hard. The Max-cut problem concerned in this paper is among these problems. For combinatorial optimization problems, algorithms can be roughly

divided into two categories: one is represented by exact solution approaches, including enumeration methods [6] and branch and bound methods [7], etc. The other category is represented by heuristic algorithms including genetic algorithms, ant colony algorithms, simulated annealing algorithms, neural networks, Lin–Kernighan Heuristic (LKH) algorithms, and so on [8]. However, there is no polynomial time solvable algorithm to find a global optimal solution for NP-hard problems. Compared with the exact approach, heuristic algorithms are able to deal with large-scale problems efficiently. They have certain advantages in computing efficiency and can be applied to solving large-scale problems with huge amount of variables. In order to solve the Max-cut problem, a large number of heuristic algorithms have been proposed, such as evolutionary algorithms and ant colony algorithms. However, for these algorithms, the most obvious disadvantage of them is that they are easy to fall into local optima. For this reason, more and more experts have begun working on the research and innovation of some novel and effective algorithms for large-scale Max-cut problems.

Deep learning is a research field which has developed very rapidly in recent years, achieving great success in many sub-fields of artificial intelligence. From its root, deep learning is a sub-problem of machine learning. Its main purpose is to automatically learn effective feature representations from a large amount of data, such that it can better solve the credit assignment problem (CAP) [9]; that is, the contribution or influence of different components in a system or their parameters to the output of the final system. The emergence of deep neural networks has provided the possibility for solving large-scale combinatorial optimization problems. In recent years, with the development of the combination of deep neural networks and operations research for large-scale combinatorial optimization problems, scholars have explored how to apply deep neural networks in these fields, and have achieved certain results. The related research has mainly focused on the algorithm design for combinatorial optimization problems based on pointer networks. Vinyals used the attention mechanism [10] to integrate a pointer structure into the sequence-to-sequence model, thus creating the pointer network. Bello improved the pointer network structure and used a strategy gradient algorithm combined with time-series difference learning to train the pointer network in reinforcement learning to solve the combinatorial optimization problem [11]. Mirhoseini removed the coding part of a recurrent neural network (RNN) and used the embedded input to replace the hidden state of the RNN. With this modification, the computational complexity was greatly reduced and the efficiency of the model was improved [12]. In Reference [13], a purely data-driven method to obtain approximate solutions of NP-hard problems was proposed. In Reference [14], a pointer network was used to establish a flight decision prediction model. Khalil solved classical combinatorial optimization problems by Q-learning [15]. The pointer network model has also been used, in Reference [16], to solve the weightless Max-cut problem. Similarly, solved the unconstrained boolean quadratic programming problem (UBQP) through the pointer network [17].

The section arrangement of this paper is as follows. Section 2 mainly introduces the Max-cut problem and the method for generating its benchmark. Section 3 demonstrates the pointer network model, including the Long Short-Term Memory network and Encoder–Decoder. Section 4 introduces two ways to train the pointer network model to solve the Max-cut problem, namely supervised learning and reinforcement learning. Section 5 illustrates the details of the experimental procedure and the results. Section 6 provides the conclusions.

#### **2. Motivation and Data Set Structure**

#### *2.1. Unified Model of the Max-Cut Problem*

The definition of the Max-cut problem is given as follows.

An undirected graph *G*= (*V*, *E*) consists of a set of vertices *V* and a set of edges *E*, where *V* = {1, 2, ..., *n*} is its set of vertices, and *E* ⊆ *V* × *V* is its set of edges, and *wi*,*<sup>j</sup>* is the weight on the edge connecting vertex *i* and vertex *j*. For any proper subset *S* of the vertex set *V*, let:

$$\delta(S) = \{ \mathfrak{e}\_{i,j} \in E; i \in S, j \in V - S \}, \tag{1}$$

where *δ*(*S*) is a set of edges, one end of which belongs to *S* and the other end belongs to *V*−*S*. Then, the cut *cut*(*S*) determined by *S* is:

$$cut(\mathcal{S}) = \sum\_{c\_{i\cdot j} \in \mathcal{S}(\mathcal{S})} w\_{i,j}.\tag{2}$$

In simple terms, the Max-cut problem is to find a segmentation (*S*, *V*−*S*) of a vertex set *V*, where the maximum weight of the edges is segmented.

#### *2.2. Benchmark Generator of the Max-Cut Problem*

When applying deep learning to train and solve the Max-cut problem, whether supervised learning or reinforcement learning, a large number of training samples are necessary. The method of data set generation introduced here is to transform the {−1,1} quadratic programming problem into the Max-cut problem.

First of all, the benchmark generator method for the boolean quadratic programming (BQP) problem, proposed by Michael X. Zhou [18], is used to generate random {−1,1} quadratic programming problems, which can be solved in polynomial time. Next, inspired by [19], we transform the results of the previous step into solutions of the Max-cut problem. The specific implementation is described below.

Michael X. Zhou transformed the quadratic programming problem shown by Equation (3) into the dual problem shown by Equation (4) through the Lagrangian dual method.

$$\min \left\{ f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T \mathbf{Q} \mathbf{x} - \mathbf{c}^T \mathbf{x} \, \middle| \, \mathbf{x} \in \{-1, 1\}^n \right\},\tag{3}$$

where *<sup>Q</sup>* <sup>=</sup> *<sup>Q</sup><sup>T</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>n</sup>* is a given indefinite matrix, and *<sup>c</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is a given non-zero vector.

The dual problem is described as follows:

$$\begin{array}{ll}\text{find} & Q, c, x, \lambda \\ \text{s.t.} & (Q + \operatorname{diag}(\lambda)) = c \\ & Q + \operatorname{diag}(\lambda) > 0 \\ & x \in \{-1, 1\}^n \end{array} \tag{4}$$

Then, according to the paper [19], the solution of the {−1,1} quadratic programming problem can be transformed into the solution of the Max-cut problem.

The integer programming for the Max-cut problem is given by:

$$\begin{aligned} \max & \frac{1}{2} \sum\_{i$$

where *i* in *xi* ∈ {−1, 1} represents the vertex *i*, and −1 and 1 represent the values of the two sets. If *xi* · *xj* is equal to 1 and the vertices of edge (*i*, *j*) are in the same set, then (*i*, *j*) ∈ *E* is not a cut edge; if *xi* · *xj* is equal to −1 and the vertices of the edge (*i*, *j*) are not in the same set, then (*i*, *j*) ∈ *E* is the cut edge. If (*i*, *j*) ∈ *E* is a cut edge, (1 − *xi* · *xj*) ( 2 is equal to 1; if (*i*, *j*) ∈ *E* is not a cut edge, (1 − *xi* · *xj*) ( 2 is equal to 0. Thus, the objective function represents the sum of the weights of the cut edges of the Max-cut. Define *S* = {*i* : *xi* = 1}, *S* = {*i* : *xi* = −1}, and the weight of the cut is *w*(*S*, *S*) = ∑ *i*<*j wi*,*j*(1 − *xi* · *xj*) ( 2.

The pseudocode for generating the benchmark of the Max-cut problem is shown in Algorithm 1, where the parameter *base* is used to control the value range of the elements in matrix *Q*.

**Algorithm 1** A benchmark generator for the Max-cut problem

**Input:** Dimension: *n*; *base* = 10;

**Output:** Matrix: *Q*; Vector: *x*

1: Randomly generate an *n*-dimensional matrix that conforms to the standard normal distribution to

obtain *Q*;


values of *Q* (except the main diagonal) be zero.


$$j \le n$$


This method for obtaining Max-cut benchmark data sets effectively solves the difficulty in training the network to solve the Max-cut problem model when lacking a large number of training samples. However, there is a common defect in this method: in the training set obtained using the dual problem to deduce the solution of the original problem, its data samples obey certain rules. This may lead to difficulty in learning the general rule of the Max-cut problem when training with the method by deep learning.

Therefore, in addition to the above method, we consider using the benchmark generator in the Biq Mac Library to solve the Max-cut problem. The Biq Mac Library offers a collection of Max-cut instances. Biq Mac is a branch and bound code based on semi-definite programming (SDP). The dimension of the problems (i.e., number of variables or number of vertices in the graph) ranges from 60–100. These instances are mainly used to test the pointer network model for the Max-cut problem.

#### **3. Models**

#### *3.1. Long Short-Term Memory*

It is difficult for traditional neural networks to classify subsequent events by using previous event information. However, an RNN can continuously operate information in a cyclic manner to ensure that the information persists, thereby effectively processing time-series data of any length. Given an input sequence *x*1:*<sup>T</sup>* = (*x*1, *x*2, ..., *xt*, ..., *xT*), the RNN updates the activity value *ht* of the hidden layer with feedback and calculates the output sequence *y*1:*<sup>T</sup>* = (*y*1, *y*2, ..., *yt*, ..., *yT*) using the following equations:

$$h\_t = \text{sigmoid}(M^{hx} \mathbf{x}\_t + M^{hh} h\_{t-1}),\tag{6}$$

$$y\_t = M^{y\_t} h\_t. \tag{7}$$

As long as the alignment between input and output is known in advance, an RNN can easily map sequences to sequences. However, the RNN cannot solve the problem when the input and output sequences have different lengths or have complex and non-monotonic relationships [20]. In addition, when the input sequence is long, the problem of gradient explosion and disappearance will occur [21]; which is also known as the long-range dependence problem. In order to solve these problems, many improvements have been made to RNNs; the most effective way, thus far, is to use a gating mechanism.

A long short-term memory (LSTM) network [22] is a variant of RNN, which is an outstanding embodiment of RNN based on the gating mechanism. Figure 1 shows the structure of the loop unit of a LSTM. By applying the LSTM loop unit of the gating mechanism, the entire network can establish long-term timing dependencies to better control the path of information transmission. The equations of the LSTM model can be briefly described as:

$$
\begin{bmatrix} \tilde{\varepsilon}\_{t} \\ o\_{t} \\ i\_{t} \\ f\_{t} \end{bmatrix} = \begin{bmatrix} \tanh \\ \sigma \\ \sigma \\ \sigma \end{bmatrix} \left( \mathcal{M} \begin{bmatrix} \mathbf{x}\_{t} \\ h\_{t-1} \end{bmatrix} + b \right), \tag{8}
$$

$$\mathfrak{c}\_{t} = f\_{t} \odot \mathfrak{c}\_{t-1} + i\_{t} \odot \mathfrak{c}\_{t\prime} \tag{9}$$

$$h\_t = o\_t \odot \tanh(c\_t),\tag{10}$$

where *xt* <sup>∈</sup> <sup>R</sup>*<sup>e</sup>* is the input at the current time; *<sup>M</sup>* <sup>∈</sup> <sup>R</sup>4*d*×(*d*+*e*) and *<sup>b</sup>* <sup>∈</sup> <sup>R</sup>4*<sup>d</sup>* are the network parameters; *σ*(·) is the Logistic function, with output interval (0, 1); *ht*−<sup>1</sup> is the external state at the previous time; is the product of vector elements; *ct*−<sup>1</sup> is the memory unit at the previous moment; and *<sup>c</sup>*)*<sup>t</sup>* is the candidate state obtained by the non-linear function. At each time *t*, the internal state *ct* of the LSTM records historical information up to the current time. The three gates used to control the path of information transmission are *ft*, *it*, and *ot*. The functions of three gates are:


**Figure 1.** Long short-term memory (LSTM) loop unit structure.

In our algorithm, the purpose of the LSTM is to estimate the conditional probability *p*(*y*1, ..., *yT* |*x*1, ..., *xT* ), where (*x*1, ..., *xT*) is the input sequence, *y*1, ..., *yT* is the corresponding output sequence, and the length *T* may be different from *T*. The LSTM first obtains a fixed dimension

representation *X* of the input sequence (*x*1, ..., *xT*) (given by the last hidden state of the LSTM), then calculates *y*1, ..., *yT* , whose initial hidden state is set to *x*1, ..., *xT*:

$$p(y\_1, \ldots, y\_{T'} | \mathbf{x}\_1, \ldots, \mathbf{x}\_T) = \prod\_{t=1}^{T'} p(y\_t | X, y\_1, \ldots, y\_{t-1})\_\prime \tag{11}$$

where each *p*(*yt*|*X*, *y*1, ..., *yt*−1) distribution is represented by the softmax of all variables in the input Max-cut problem matrix.

#### *3.2. Encoder–Decoder Model*

The encoder–decoder model is also called the asynchronous sequence-to-sequence model; that is, the input sequence and the output sequence neither need to have a strict correspondence relationship, nor do they need to maintain the same length. Compared with traditional structures, it greatly expands the application scope of the model. It can directly model sequence problems in a pure data-driven manner and can train the model using an end-to-end method. It can be seen that it is very suitable for solving combinatorial optimization problems.

In the encoder–decoder model (shown in Figure 2), the input is a sequence *x*1:*<sup>T</sup>* = (*x*1, ..., *xT*) of length *T*, and the output is a sequence *y*1:*T* = (*y*1, ..., *yT*) of length *T* . The implementation process is realized by first encoding and then decoding. Firstly, a sample *x* is input into an RNN (encoder) at different times to obtain its encoding *hT*. Secondly, another RNN (decoder) is used to obtain the output sequence *<sup>y</sup>*\*1:*T* . In order to establish the dependence between the output sequences, a non-linear autoregressive model is usually used in the decoder:

$$h\_t = f\_1(h\_{t-1}, \mathbf{x}\_t), \forall t \in [1, T], \tag{12}$$

$$h\_{T+t} = f\_2(h\_{T+t-1}, \hat{y}\_{t-1}), \forall t \in [1, T'], \tag{13}$$

$$\{y\_t = \emptyset (h\_{T+t})\_\prime \,\forall t \in [1, T'], \tag{14}$$

where *<sup>f</sup>*1(·) and *<sup>f</sup>*2(·) are RNNs used as encoder and decoder, respectively; *<sup>g</sup>*(·) is a classifier; and *<sup>y</sup>*\**<sup>t</sup>* are vector representations used to predict the output.

**Figure 2.** Encoder–decoder model.

#### *3.3. Pointer Network*

The amount of information that can be stored in a neural network is called the network capacity. Generally speaking, if more information needs to be stored, then more neurons are needed or the network must be more complicated, which will cause the number of necessary parameters of the neural network to increase exponentially. Although general RNNs have strong capabilities, when dealing with complex tasks, such as processing large amounts of input information or complex computing processes, the computing power of computers is still a bottleneck that limits the development of neural networks.

In order to reduce the computational complexity, we use the mechanisms of the human brain to solve the information overload problem. In such a way, we add an attention mechanism to the RNN. When the computing power is limited, it is used as a resource allocation scheme to allocate computing resources to more important tasks.

A pointer network is a typical application for combining an attention mechanism and a neural network. We use the attention distribution as a soft pointer to indicate the location of relevant information. In order to save computing resources, it is not necessary to input all the information into the neural network, only the information related to the task needs to be selected from the input sequence *X*. A pointer network [9] is also an asynchronous sequence-to-sequence model. The input is a sequence *X* = *x*1, ..., *xT* of length *T*, and the output is a sequence *y*1:*T* = *y*1, *y*2, ..., *yT* . Unlike general sequence-to-sequence tasks, the output sequence here is the index of the input sequence. For example, when the input is a group of out-of-order numbers, the output is the index of the input number sequence sorted by size (e.g., if the input is 20, 5, 10, then the output is 1, 3, 2).

The conditional probability *p*(*y*1:*T* |*x*1:*<sup>T</sup>* ) can be written as:

$$\begin{split} p(y\_{1:T'}|\mathbf{x}\_{1:T}) &= \prod\_{i=1}^{m} p(y\_i|y\_{1:i-1}, \mathbf{x}\_{1:T}) \\ &\approx \prod\_{i=1}^{m} p(y\_i|\mathbf{x}\_{y\_{1:t'}}, \mathbf{x}\_{y\_{i-1}}, \mathbf{x}\_{1:T})\_{\prime} \end{split} \tag{15}$$

where the conditional probability *p*(*yi* ' '*xy*<sup>1</sup> , ..., *xyi*−<sup>1</sup> , *<sup>x</sup>*1:*<sup>T</sup>* ) can be calculated using the attention distribution. Suppose that an RNN is used to encode *xy*<sup>1</sup> , ..., *xyi*−<sup>1</sup> , *x*1:*<sup>T</sup>* to obtain the vector *hi*, then

$$p(y\_i | y\_{1:i-1}, x\_{1:T}) = \text{softmax}(s\_{i,j}),\tag{16}$$

where *si*,*<sup>j</sup>* is the unnormalized attention distribution of each input vector at the *i*th step of the encoding process,

$$s\_{i\_\circ j} = v^T \tanh(lL\_1 x\_j + lL\_2 h\_i), \forall j \in [1, T]. \tag{17}$$

where *v*, *U*1, and *U*<sup>2</sup> are learnable parameters.

Figure 3 shows an example of a pointer network.

**Figure 3.** The architecture of pointer network (encoder in green, decoder in purple).

#### **4. Learning Mechanism**

Machine learning methods can be classified according to different criteria. Generally speaking, according to the information provided by the training samples and different feedback mechanisms, we classify machine learning algorithms into three categories: supervised learning, unsupervised learning, and reinforcement learning. Our algorithm uses supervised learning (SL) and reinforcement learning (RL) to train the pointer network model to obtain the solution of the Max-cut problem, which will be described in detail below.

#### *4.1. Supervised Learning*

#### 4.1.1. Input and Output Design

The feature of the Max-cut problem is that its variable is either 0 or 1, such the problem is equivalent to selecting a set of variables from all variables with a value of 1 to maximize the objective function. This is a typical choice problem in combinatorial optimization problems. The goal of supervised learning is to learn the relationship between the input *x* and the output *y* by modeling *y* = *f*(*x*; *θ*) or *p*(*y* |*x* ; *θ*). For the Max-cut problem, the pointer network uses an *n* × *n* symmetric matrix *Q* to represent the input sequence of the *n* nodes, where *qij* is an element in the symmetric matrix, which represents the weight of the connection between vertex *i* vertex and vertex *j* (*qij* ≥ 0, *qij* = 0 means there is no connection between vertex *i* and vertex *j*). The output sequence of the pointer network is represented by *X* = *x*1, *x*2, ..., *xn*, which contains two variables; that is 0 and 1. Vertices with 0 and vertices with 1 are divided into two different sets. The result of summing weights with all edges across the two cut sets is the solution to the Max-cut problem.

The following example is used to explain the input and output design of the pointer network to solve the Max-cut problem.

#### **Example 1.**

$$\begin{array}{l} f(\mathbf{x}) = 3\mathbf{x}\_1 \mathbf{x}\_2 + 4\mathbf{x}\_1 \mathbf{x}\_4 + 5\mathbf{x}\_2 \mathbf{x}\_3 + 2\mathbf{x}\_2 \mathbf{x}\_4 + \mathbf{x}\_3 \mathbf{x}\_4\\ \mathbf{x}\_i \in \{0, 1\} \;/\ \ (i = 1, \dots, 4) \end{array} \tag{18}$$

The symmetric matrix *Q* of the above problem can be expressed as:

$$Q = \begin{pmatrix} 0 & 3 & 0 & 4 \\ 3 & 0 & 5 & 2 \\ 0 & 5 & 0 & 1 \\ 4 & 2 & 1 & 0 \end{pmatrix} \prime$$

and the characteristics of the variables *x*1, *x*2, *x*3, and *x*<sup>4</sup> are represented by the vectors *q*<sup>1</sup> = (0, 3, 0, 4)*T*, *q*<sup>2</sup> = (3, 0, 5, 2)*T*, *q*<sup>3</sup> = (0, 5, 0, 1)*T*, and *q*<sup>4</sup> = (4, 2, 1, 0)*T*, respectively.

For the Max-cut problem, the optimal solution of the above example is *x*. The sequence (*q*1, *q*2, *q*3, *q*4) is the input of the pointer network, and the known optimal solution is used to train the network model and guide the model to select *q*<sup>1</sup> and *q*3. The input vector selected by the decoder represents the corresponding variable value of 1, while the corresponding variable value of the unselected vector is 0.

For the output part of the pointer network model, for the *n* × *n* matrix, we design a matrix of dimension (*n* + 1) to represent the network output. Exactly as in Example 1, the output result is a label that be described by the matrix *Olabel*:

$$O\_{label} = \begin{pmatrix} 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 \end{pmatrix}.$$

The relationship between *Olabel* and the variable *x* is:

$$x\_{\dot{j}} = \begin{cases} 1 & \text{if } o\_{\dot{i}\dot{j}} = 1 \text{, } \dot{j} \neq 0; \\ \text{EOS } \text{, if } o\_{\dot{i}\dot{j}} = 1 \text{, } \dot{j} = 0; \\ 0 & \text{others.} \end{cases} \tag{19}$$

We use EOS = (1, 0, ··· , 0) *<sup>T</sup>* to indicate the end of the pointer network solution process. After the model training is completed, the probability distribution of the softmax of the output matrix is obtained. The corresponding result may be as described by the matrix *Opredict*. In the solution phase, we select the one with the highest probability in the output probability distribution and set it to 1, and the rest of the positions to 0. According to the result of *Opredict*, the pointer network selects the variables *x*<sup>1</sup> and *x*<sup>3</sup> with a value of 1, and the remaining variables have a value of 0—which is consistent with the result selected by *Olabel*:

$$O\_{predict} = \begin{pmatrix} 0.03 & 0.8 & 0.02 & 0.1 & 0.05 \\ 0.1 & 0 & 0.2 & 0.7 & 0 \\ 0.9 & 0.03 & 0.03 & 0.01 & 0 \\ 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 \end{pmatrix}.$$

#### 4.1.2. Algorithm Design

When training deep neural networks, for *<sup>N</sup>* given training samples **x**(*n*), *<sup>y</sup>*(*n*) *<sup>N</sup> n*=1 , the softmax regression in supervised learning uses cross entropy as a loss function and uses gradient descent to optimize the parameter matrix *W*. The goal of neural network training is to learn the parameters which minimize the value of the cross-entropy loss function. In practical applications, the mini-batch stochastic gradient descent (SGD) method has the advantages of fast convergence and small computational overhead, so, it has gradually become the main optimization algorithm used in large-scale machine learning [23]. Therefore, during the training process, we use mini-batch SGD. At each iteration, we randomly select a small number of training samples to calculate the gradient and update the parameters. Assuming that the number of samples per mini-batch is *K*, the training process of softmax regression is: initialize *W*<sup>0</sup> ← 0, and then iteratively update by the following equation

$$\mathcal{W}\_{t+1} \leftarrow \mathcal{W}\_t + a(\frac{1}{K} \sum\_{n=1}^{N} \mathbf{x}^{(n)} (\mathbf{y}^{(n)} - \hat{\mathbf{y}}\_{\mathcal{W}\_t}^{(n)})^T),\tag{20}$$

where *<sup>α</sup>* is the learning rate and *<sup>y</sup>*\* (*n*) *Wt* is the output of the softmax regression model when the parameter is *Wt*.

The training process of mini-batch SGD is shown in Algorithm 2.

#### **Algorithm 2** Mini-batch SGD of pointer network

**Input:** training set: *<sup>D</sup>* <sup>=</sup> {(*x*(*n*), *<sup>y</sup>*(*n*))}*<sup>N</sup> <sup>n</sup>*=1; mini-batch size: *K*; number of training steps: *L*; learning

rate: *α*;

**Output:** optimal: *W*


5: select samples (*x*(*n*), *y*(*n*)) from the training set *D*;


#### *4.2. Reinforcement Learning*

Reinforcement learning is a very attractive method in machine learning. It can be described as an agent continuously learning from interaction with the environment to achieve a specific goal (such as obtaining the maximum reward value). The difference between reinforcement learning and supervised learning is that reinforcement learning does not need to give the "correct" strategy as supervised information, it only needs to give the return of the strategy and then adjust the strategy to achieve the maximum expected return. Reinforcement learning is closer to the nature of biological learning and can cope with a variety of complex scenarios, thus coming closer to the goal of general artificial intelligence systems.

The basic elements in reinforcement learning include:


For simplicity, we consider the interactions between *agent* and *environment* as a discrete time-series in this paper. Figure 4 shows the interaction between an *agent* and an *environment*.

**Figure 4.** Agent–environment interaction.

#### 4.2.1. Input and Output Design

The pointer network input under reinforcement learning is similar to that under supervised learning. The only difference is that, when applying reinforcement learning, a special symbol *Split* needs to be added, as reinforcement learning only focuses on those variables selected before the variable *Split*. *Split* is a separator that divides a variable into two types. We use the following rules: when inputting into the pointer network, all variables before *Split* are set to 1, and all variables after *Split* are set to 0. We use the zero vector to represent the *Split*. Therefore, in order to change the *n*-dimensional matrix *Q* into *n* + 1 dimensions, we add a row and a column of 0 to the last row and the last column of matrix *Q*. Under this rule, taking Example 1 as an example, to convert the matrix *Q* into the matrix *P*, the input sequence of the pointer network is (*p*1, *p*2, *p*3, *p*4, *Split*):

$$P = \begin{pmatrix} 0 & 3 & 0 & 4 & 0 \\ 3 & 0 & 5 & 2 & 0 \\ 0 & 5 & 0 & 1 & 0 \\ 4 & 2 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{pmatrix}.$$

Similar to supervised learning, at the output of the pointer network, a symbol EOS is added to divide the set of output vertices. As in Example 1, the output of the pointer network is (1, 3, EOS, 2, 4), which means that the four vertices are divided into two sets, which are (1, 3) and (2, 4). The numbers in front of EOS indicate that the value at these vertex positions is 1, and the numbers after EOS indicate that the value at these positions is 0. Thus, the max-cut value can be calculated according to the divided sets, and it is this value that is used as the reward in reinforcement learning.

#### 4.2.2. Actor–Critic Algorithm

The actor–critic algorithm is a reinforcement learning method which combines a policy gradient and temporal difference learning. We combine the input–output structure characteristics of the Max-cut problem with the actor–critic algorithm in reinforcement learning to train the pointer network model. The actor–critic algorithm used for solving such combinatorial optimization problems uses the same pointer network encoder for both the actor network and the critic network. First, the actor network encodes the input sequence. Next, the decoder part selects the variable with value 1, according to the probability. The critic network encodes the input sequence, then predicts the optimal value of the Max-cut problem using a value function.

In the actor–critic algorithm, *φ*(*s*) is the input to the actor network, which corresponds to the given symmetric matrix *Q* in the Max-cut problem; that is, *Q* is used as the input sequence of the actor network. The actor refers to the policy function *πθ* (*s*, *a*), which can learn a strategy to obtain the highest possible reward. For the Max-cut problem, *πθ* (*s*, *a*) represents the strategy scheme in which variables are selected as 1. The critic refers to the value function *Vφ*(*s*), which estimates the value function of the current strategy. With the help of the value function, the actor–critic algorithm can update the parameters in a single step, without having to wait until the end of the round to update. In the actor–critic algorithm, the policy function *πθ* (*s*, *a*) and the value function *Vφ*(*s*) are both functions that need to be learned simultaneously during the training process.

Assuming the return *G* (*τt*:*T*) from time *t*, we use Equation (21) to approximate it:

$$
\hat{G}\left(\tau\_{t:T}\right) = r\_{t+1} + \gamma V\_{\Phi}\left(s\_{t+1}\right),
\tag{21}
$$

where *st*+<sup>1</sup> is the state at *t* + 1 and *rt*+<sup>1</sup> is the instant reward.

In each step of the update, the strategy function *πθ* (*s*, *a*) and the value function *Vφ*(*s*) are learned. On one hand, the parameter *φ* is updated, such that the value function *Vφ*(*st*) is close to the estimated real return *G*ˆ (*τt*:*T*):

$$\min\_{\phi} \left( \hat{G}\left(\tau\_{t:T}\right) - V\_{\phi}\left(s\_{t}\right) \right)^{2}. \tag{22}$$

On the other hand, the value function *Vφ*(*st*) is used as a basis function to update the parameter, in order to reduce the variance of the policy gradient:

$$
\theta \gets \theta + a\gamma^t \left( \hat{\mathbf{G}}\left(\mathbf{r}\_{t:T}\right) - V\_{\Phi}\left(\mathbf{s}\_t\right) \right) \frac{\partial}{\partial \theta} \log \pi\_{\theta}\left(a\_t|\mathbf{s}\_t\right). \tag{23}
$$

In each update step, the actor performs an action *a*, according to the current environment state *s* and the strategy *πθ* (*a* |*s*); the environment state becomes *s* and the actor obtains an instant reward *r*. The critic (value function *Vφ*(*s*)) adjusts its own scoring standard, according to the real reward given by the environment and the previous score (*r* + *γVφ*(*s* )), such that its own score is closer to the real return of the environment. The actor adjusts its strategy *πθ* according to the critic's score, and strives

to do better next time. At the beginning of the training, actors performs randomly and critic gives random marks. Through continuous learning, the critic's ratings become more and more accurate, and the actor's movements become better and better.

Algorithm 3 shows the training process of the actor–critic algorithm.

#### **Algorithm 3** Actor–critic algorithm

```
Input: state space: S; action space: A; differentiable strategy function: πθ (a |s); differentiable state
```

```
value function: Vφ(s); discount rate: γ; learning rate: α > 0, β > 0;
```