3.4.2. A3C Method

In DRL, there are some key definitions [30]. The value function *V*(*s*) denotes the expected return for the agent in a given state *s*. The policy function Π(*s*, *a*) denotes the probability of an action executed by the agent in a given state *s*.

Compared with valued-based DRL methods such as Deep Q Network (DQN) and its variations, A3C as a representative policy-based method can deal with continuous state and action spaces, and be more time-efficient. A3C is an on-policy and online algorithm that can use samples generated by current policies to efficiently update network parameters and increase robustness [12].

As shown in Figure 4, A3C asynchronously parallel executes multiple agents on multiple instances of the environment. All of the agents have the same local network architecture as the global network that is used to approximate policy functions and value functions. After the agent collects certain samples to calculate the gradient of its network, it uploads this gradient to the global network. Then, the agent downloads the global network to substitute its current network, after the global network updates its parameters by the uploaded gradient. Note that agents cannot upload their gradient simultaneously.

**Figure 4.** The flowchart of Asynchronous Advantage Actor-Critic (A3C).

It is difficult for the predecessor of A3C, Actor-Critic (AC), to converge, since consecutive sequences of samples used to update networks have strong temporal relationships. To solve this problem, A3C adopts an asynchronous framework to construct independent and identically distributed training data. Compared with Deep Deterministic Policy Gradient (DDPG) [8], which uses experience replay to reduce the correlation between training samples, A3C can better decorrelate the agents' data by using asynchronous actor-learners interacting with environments in parallel.

A3C includes two neural networks, actor-network and critic-network. Actor-network is used to approximate the policy function (i.e., Π(*at*|*st*; *θa*)); it outputs the expectation and variance of the action distribution in continuous action spaces, and then samples an action from the built distribution. Critic-network used to approximate the value function (i.e., *V*(*st*; *θv*)); it outputs the estimation of the value function of the current state to optimize probabilities of choosing actions. As seen in Algorithm 1 [12], each actor-learner can share experiences through communicating with the global network, which can greatly increase the training efficiency. The condition of updating is every *tmax* step or reaching the end of episode. The gradient of the actor-network is:

$$\nabla\_{\theta\_{\mathsf{a}}} \log \Pi(a\_{\mathsf{t}} | \mathbf{s}\_{\mathsf{t}}; \theta\_{\mathsf{a}}) A(\mathbf{s}\_{\mathsf{t}}, a\_{\mathsf{t}}; \theta\_{\mathsf{v}}) + \beta \nabla\_{\theta\_{\mathsf{a}}} H(\Pi(\mathbf{s}\_{\mathsf{t}}; \theta\_{\mathsf{a}})) \tag{4}$$

where *H* is the entropy of the policy Π, and the hyper-parameter *β* controls the strength of the entropy regularization term. *A*(*st*, *at*; *θv*) is an estimate of the advantage function given by:

$$\sum\_{i=0}^{k-1} \gamma^i r\_{t+i} + \gamma^k V(s\_{t+k}; \theta\_v) - V(s\_t; \theta\_v) \tag{5}$$

where *k* can vary from state to state, and is upper-bounded by *tmax* [12].

**Algorithm 1.** Pseudo-code of each actor-learner in Asynchronous Advantage Actor-Critic (A3C).

