2.1.3. Reward

As the goal of the algorithm is to eliminate voltage issues at a minimal cost, the reward includes both the costs inherent to control actions (changing set-points of control variables, which may reflect the costs of relying on ancillary services [35]), and the costs of violating network constraints (which may damage the equipment). In this way, the immediate reward *rt* at time step *t* is defined as follows:

$$r\_{l} = -\sum\_{\mathcal{g} \in \mathcal{G}} \left( \mathbb{C}\_{Q} |\Delta Q\_{\mathcal{\mathcal{G}}l}| + \mathbb{C}\_{P} |\Delta P\_{\mathcal{\mathcal{G}}l}| \right) - \mathbb{C}\_{TR} |\Delta Tap\_{l}| + \begin{cases} +R\_{\text{pos}} \,\,\forall \, V\_{n,t} \in [\underline{V}, \overline{V}] \\ -R\_{\text{neg}} (\underline{V} - V\_{n,t}) \,\,\forall \, V\_{n,t} < \underline{V} \\ -R\_{\text{neg}} (V\_{n,t} - \overline{V}) \,\,\forall \, V\_{n,t} > \overline{V} \end{cases} \tag{6}$$

where *V* and *V* are respectively the lower and upper bounds delimiting the safe voltage levels. Then, coefficients *CP* and *CQ* represent the costs of modifying the active and reactive powers of DG units, while *CTR* stands for the (high) cost of changing the transformer tap position. Typically, we have *CQ* < *CP* < *CTR*. Indeed, modifying the reactive power of generators can be done at almost no

cost (using the power electronics converters), while the curtailment of active power infers a loss of generated energy that ultimately results into a financial loss [36]. Then, high costs are associated with a tap change of the transformer due to the aging effects on the tap changer contacts. The terms *Rpos* and *Rneg* respectively reflect the positive reward for the nodes having voltages within the safe range, and the negative reward (i.e., penalty) for nodes outside the permitted zone. In general, all these costs need to be properly weighted (see Section 4). Indeed, if the costs of actions *CQ*, *CP* and *CTR* are too high with respect to *Rpos* and *Rneg*, the agen<sup>t</sup> may choose to suffer the negative rewards related to voltage violations (rather than correcting the voltage problem). Conversely, if the costs of actions are too low, unnecessary actions may be taken (to ensure the positive rewards related to safe voltage levels).

#### *2.2. Reinforcement Learning Algorithm*

Like most machine learning techniques, it is important to differentiate training and test stages.

During the training, the goal of the agen<sup>t</sup> is to learn the best policy *<sup>π</sup>*<sup>∗</sup>, i.e., to select actions that maximize the cumulative future reward *Gt* = ∑*<sup>T</sup> j*=*<sup>t</sup> γj*−*<sup>t</sup> rj* with a discount factor *γ* ∈ [0, 1]. This can be achieved by approximating the optimal action-value function *Q*<sup>∗</sup>(*<sup>s</sup>*, *a*) = E*π*<sup>∗</sup> (*Gt*|*<sup>s</sup>*, *<sup>a</sup>*), which is the expected discounted return of taking action *a* in state *s*, then continuing by choosing actions optimally. Indeed, once *Q*∗-values are obtained, the optimal policy can be easily constructed by taking the action given by *a*<sup>∗</sup> *t* = argmax*a*∈A *Q*<sup>∗</sup>(*st*, *<sup>a</sup>*). Using Bellman's principle of optimality, *Q*<sup>∗</sup>(*st*, *at*) can be expressed as

$$Q^\*(s\_t, a\_t) = \mathbb{E}\_{s\_{t+1} \sim \mathcal{E}} \left[ r\_t + \gamma \max\_{a\_{t+1}} Q^\*(s\_{t+1}, a\_{t+1}) \right] \tag{7}$$

where the next state *st*+1 is sampled from the environment's transition rules P(*st*+<sup>1</sup>|*st*, *at*). In general, an agen<sup>t</sup> starts from an initial (poor) policy that is progressively improved through many experiences (during which the agen<sup>t</sup> learns how to maximize its rewards).

When the training is completed, i.e., during the test (in practical field operations), the trained agen<sup>t</sup> selects the greedy action *a*<sup>∗</sup> *t* according to its learned policy.

This general principle is the source of many different RL algorithms, each with different characteristics that suits different needs. In this context, the choice of the most suited technique for the voltage control task is mainly driven by the fact that both state and action spaces are continuous. Hence, well-known algorithms, such as (deep) Q-learning, will not be considered as they only deal with a discrete action space. In this work, we will thereby focus on the deep deterministic policy gradient (DDPG) technique.

#### *2.3. Deep Deterministic Policy Gradient (Ddpg) Algorithm*

The deep deterministic policy gradient (DDPG) relies on a complicated architecture, referred to as actor-critic [37], which is depicted in Figure 1. The goal of the actor is to learn a deterministic policy *μφ*(*s*) which selects the action *a* based on the state *s*. The quality of the action is estimated by the critic, by computing the corresponding *Qθ* (*s*, *<sup>a</sup>*). To achieve good generalization capabilities of both actor and critic functions, they are estimated using deep neural networks, which are universal non-linear approximators that are very robust when the state and action spaces become large.

**Figure 1.** Working principle of the DDPG agent, which relies on an actor-critic architecture.

Overall, starting from an initial state *s*, the actor neural network (characterized by weight parameters *φ*) determines the action *at* = *μφ*(*st*). This action is then applied to the environment, which yields the reward *rt* and the next state *st*+1. The experience tuple (*st*, *at*,*rt*,*st*+<sup>1</sup>) is then stored in the replay memory. Once the replay memory includes enough experiences, a random mini-batch of *D* experiences is sampled. For each sample in the mini-batch, the state *st* and the action *at* are fed into the critic neural network (characterized by weight parameters *θ*) that yields the *Q*-value. Both networks are then jointly updated, and the procedure is iterated until convergence.

Practically, the critic network is trained by adjusting its parameters *θi* (at regular intervals *i* ∈ I during the learning phase) so as to minimize the mean-squared Bellman error (MSBE) (9). In contrast to supervised learning, the actual (i.e., optimal) target value *rt* + *<sup>γ</sup>*max*at*+<sup>1</sup>*Q*<sup>∗</sup>(*st*+1, *at*+<sup>1</sup>) is unknown, and is thus substituted with an approximate target value *yt* (using the estimation *Q<sup>θ</sup>i*):

$$y\_l = r\_l + \gamma \text{max}\_{a\_{l+1}} Q\_{\theta\_l}(s\_{l+1}, a\_{l+1}) \tag{8}$$

where *at*+1 is given by the critic network, i.e., *at*+1 = *μφi*(*st*+<sup>1</sup>).

Contrary to supervised learning where the output of the neural network and the target value (i.e., ground truth) are completely independent, we see that the target value *y* in (8) depends on the parameters *θi* and *φi* that we are optimizing in the training. This link between the critic's output *Q<sup>θ</sup>i*(*st*, *at*) and its target *rt* + *<sup>γ</sup>*max*at*+<sup>1</sup>*Q<sup>θ</sup>i*(*st*+1, *at*+<sup>1</sup>) may infer divergence in the learning procedure. A solution to this problem is to use separate target networks (for both the critic and the actor), which are responsible for calculating the target values. Practically, these target networks are time-delayed copies of the original networks with parameters *<sup>θ</sup>i*,targ and *φ<sup>i</sup>*,targ that slowly track the (reference) learned networks. As explained in [37], these target networks are not trained, and enable to break the dependency between the values computed by the networks and their targeted value, thereby improving stability in learning.

As a result, the critic network is trained (i.e., updated) by minimizing the following MSBE loss function L(*<sup>θ</sup>i*) with stochastic gradient descent:

$$\mathcal{L}(\theta\_i) = \sum\_{D} \left( \underbrace{Q\_{\theta\_i}(s\_t, a\_t)}\_{(i)} - \underbrace{\left(r\_t + \gamma Q\_{\theta\_i \text{larg}}(s\_{t+1}, \mu\_{\theta\_i \text{larg}}(s\_{t+1}))\right)}\_{(ii)} \right)^2 \tag{9}$$

Starting from random values *θi*=0, the parameters *θi* are thus progressively updated towards the optimal action-value function *Q*∗ by minimizing the difference between (i) the output of the critic and (ii) the target (computed with target networks), which provides an estimate of the *Q*-function using both the outcome *rt* of the simulation model and the action *at*+1 from the target actor network. The update is performed on a mini-batch *D* of different experiences (*st*, *at*,*rt*,*st*+<sup>1</sup>) ∼ *<sup>U</sup>*(*D*), drawn uniformly at random from the pool of historical samples. This (replay buffer) procedure breaks the similarity between consecutive training samples, thus avoiding that the model is updated towards a local minimum.

In parallel, the actor network is trained (on the same mini-batch *D*) with the goal of adapting its parameters *φ<sup>i</sup>*, so as to provide actions *at* that maximize *Q<sup>θ</sup>i* . This amounts to maximize the following function L(*φi*), which is achieved with a gradient ascent algorithm:

$$\mathcal{L}(\phi\_i) = \sum\_{D} Q\_{\theta\_i} \left( s\_{t\prime} \mu\_{\phi\_i}(s\_t) \right) \tag{10}$$

To ensure that the DDPG algorithm properly explores its environment during the training phase, noise *t* is added to the action space, i.e., *at* = *μφ*(*st*) + *t*. In particular, we use an exponential decaying noise so as to favor exploration at the start of the training, which is then progressively decreased to

stimulate exploitation as the agen<sup>t</sup> converges towards the optimal policy. Naturally, when the model is trained (and used during test time), no noise is added to the optimal action *<sup>a</sup>*<sup>∗</sup>.
