**4. Case Study**

To solve the voltage control problem, the DDPG algorithm is implemented in Python using PyTorch and Gym libraries. The solution is tested on the 11 kV radial distribution system with *N* = 77 buses shown in Figure 3 [39]. The bus 1 is the high-voltage (HV) connection point, which is considered as the slack node. The substation (between nodes 1 and 2) supplies 8 different feeders, for a total of 75 loads. The maximum (peak) active and reactive consumption powers equal to 24.27 MW and 4.85 Mvar, respectively. The system is also hosting 22 (identical) distributed generators, with an installed power equal to 4 MW.

The objective of the DRL-based agen<sup>t</sup> is to maintain the voltage magnitudes of the 77 buses within the desired range. In order to illustrate the effectiveness of the proposed control scheme, these allowed voltage limits are defined by a very conservative range of [0.99, 1.01] p.u., and the initial reactive powers of DGs are set to zero. The reward function (6) is characterized by a compromise between the costs of voltage violations and those of corrective actions. We give more weight in maintaining safe voltage levels by defining *Rpos* = 0.1 and *Rneg* = 15, while *CTR*, *CP* and *CQ* are respectively set to 1, 0.1 and 0.04.

A total of 12,000 initial operating states (that need to be processed by the DRL-based agent) are generated with the simulation model, among which 10,000 are used to train the agent, while the remaining 2000 scenarios are kept (as a test set) to evaluate the performance of the resulting model. It should be noted that, in this work, the agen<sup>t</sup> has a single step to process each of the generated scenarios (it cannot rely on several interactions with the environment to solve a voltage problem). The value of the discount factor *γ* is thereby fixed to 1.

To have an overview of the global network conditions in the case where no control action is performed, we show in Figure 4 the distribution of nodal voltage levels (for the 12,000 simulated states) using a boxplot representation. We observe that violations of voltage limits [0.99, 1.01] p.u. occur more than 50% of the time. In particular, the distribution is asymmetrical, skewed towards more over-voltage issues (due to the high penetration of distributed generation) which occurs in 40.1% of the simulated samples.

**Figure 3.** Schematic diagram of the 77-bus distribution system. The section between bus 1 and 2 is the substation, which is supplying 8 different feeders.

**Figure 4.** Boxplot representing the (nodal) distributions of the voltage levels for the 77 buses among the 12,000 simulated states.

#### *4.1. Impact of Ddpg Parameters*

In the proposed case study, the state space *st* is of size 100, i.e., 77 dimensions for the nodal voltages *Vn*,*t*, 22 dimensions for the (predicted) maximum power of the 22 generators *Pg*,*t*+1, and 1 dimension for the position of the tap changer *Tapt*. Also, the action space *at* is of size 45, i.e., 2 × 22 = 44 dimensions corresponding to the changes in active and reactive power for the 22 generators, and 1 dimension for changing the position of the tap changer. Hence, as sketched in Figure 5, the actor network has an input layer of size 100 (i.e., composed of 100 neurons), and an output layer of size 45. Then, the critic network is characterized by 145-dimensional input layer, for a single output.

Based on this (fixed) information, we then performed an optimization of the hyper-parameters of the DRL-based agent, which consists in optimizing its complexity by adding extra hidden layers in the architecture of both actor and critic neural networks. In particular, the best performance was achieved by connecting the input and output layers (for both the actor and the critic networks) with 5 fully connected layers, with 20 units in all layers. The activation functions of the hidden layers are ReLU (rectified linear units). Then, the hyperbolic tangent function is used for the output layer of the actor, while a linear function is employed for the critic. The batch size of the learning is set to 16 samples, and the target networks are updated (during the training) with a delay of 10 iterations. Both actor and critic networks are initialized with random weights in the range [−0.1, 0.1].

**Figure 5.** Representation of the neural network architectures for both actor and critic.

The exploration-exploitation parameter (i.e., extra noise added to the actions during the training) is *t* = N (0, 0.2) × (0.005 + 0.995*e*<sup>−</sup>*k*/Δ*<sup>T</sup>* ), where N (0, 0.2) is a zero-mean Gaussian noise with a standard deviation of 0.2, which is exponentially decaying along the training iterations *k*. The decay period Δ*T* is equal to 5000 episodes. In general, this action noise has a significant impact on the learning abilities of the DRL-based agent. This observation is illustrated in Figure 6, where we depict two different learning curves where all parameters of the agents are similar, except for the action noise. In particular, the optimal calibration of N (0, 0.2) is compared to a perturbation of N (0, 0.6) (with the same decaying intensity over the training samples).

In general, when the perturbations are too small, the training may fail to properly explore the search space (which increases the probability to end up in a local minimum), while oversized perturbations may negatively affect the learning (and even leading the algorithm to repeatedly perform the same action).

**Figure 6.** Evolution of the total immediate rewards *rt* across training episodes for two different configurations of the action noise *t*.

For the best model (right part of Figure 6), we see that the DDPG control scheme quickly learns (after around 7500 interactions with the environment) a stable and efficient policy. In particular, at the beginning (during the 2000 first training steps), the agen<sup>t</sup> randomly selects actions, which lead to many situations where it deteriorates the electrical network conditions. However, in the course of the learning procedure, the agen<sup>t</sup> is progressively evolving, and starts solving the voltage issues with less costly decisions. The agen<sup>t</sup> eventually converges to total rewards *r* ≈ 5. In contrast, the other model (left part of Figure 6) achieves convergence at a much lower performance (total rewards of *r* ≈ −7.5), which roughly corresponds to the same reward as when no action is performed. In general, the main advantage of the proposed framework lies in its generic design that makes it broadly applicable (e.g., to any distribution system), and in its ability to adapt to the varying operating conditions. Evidently, when the methodology is applied to another environment, the DDPG agen<sup>t</sup> needs to be re-trained from scratch, and its hyper-parameters (e.g., training noise, as well as number of hidden layers and number of neurons for both actor and critic networks) also need to be adapted.

#### *4.2. Impact of Endogenous Uncertainties*

The impact of endogenous uncertainties (regarding the physical parameters of the distribution system) is evaluated through the analysis of three cases.


The simulation results regarding the three cases are summarized in Figure 7. Practically, we represent the evolution of the negative reward (which is a measure of the voltage violations) in both training and test phases. This negative reward *<sup>r</sup>*neg is equal to 0 in the perfect situation where all nodal voltages pertain to [*<sup>V</sup>*, *V*] = [0.99, 1.01] p.u., and decreases in negative values with the severity of voltage violations, i.e.,:

$$r\_{\text{neg}} = \begin{cases} 0, \,\,\forall \, V\_n \in [\underline{V}, \overline{V}] \\ -R\_{n\text{eg}}(\underline{V} - V\_n), \,\,\forall \, V\_n < \underline{V} \\ -R\_{n\text{eg}}(V\_n - \overline{V}), \,\,\forall \, V\_n > \overline{V} \end{cases} \tag{11}$$

**Figure 7.** Evolution of the reward *<sup>r</sup>*neg in the three studied cases in both training and test stages.

We observe that when uncertainties associated with the model parameters are neglected during the training (cases I and II), the RL agen<sup>t</sup> quickly find actions that remove voltage violations, i.e., the upper bound of the negative reward *<sup>r</sup>*neg = 0 is almost reached in around 2000 episodes. This performance is achieved in more than 4000 episodes when dealing with endogenous uncertainties due to the increased difficulty of the task. This effect is also translated into a higher variability of the reward. Interestingly, by comparing the evolution of *<sup>r</sup>*neg with the total reward *r* in Figure 6 during the training, we see that even though the agen<sup>t</sup> is able to mitigate the voltage issues after 4000 training episodes, the cost-efficiency of the actions can still be improved (which is realized during the next 4000 episodes).

To quantify the impact of neglecting the endogenous model uncertainties, the mean value of the negative reward *<sup>r</sup>*neg in (11) over the last 2000 episodes of the training phase, and over the 2000 new episodes of the test set are provided in Table 1 for the three studied cases.


**Table 1.** Average value of the negative reward *<sup>r</sup>*neg across training and test sets.

As expected, the agen<sup>t</sup> that is agnostic to endogenous uncertainties on the physical parameters of the system during the training (cases 1 and 2) achieves a lower out-of-sample performance when these effects are modeled in the test set. Specifically, the reward *<sup>r</sup>*neg drops from −0.53 (in case 1 when endogenous uncertainties are also disregarded at the test stage) down to −0.93 in the realistic case 2. In this latter situation, the agen<sup>t</sup> expects a reward of around −0.6 (at the end of its learning), while it actually results in a disappointing ex-post outcome of −0.93. This problem can be efficiently alleviated by incorporating these endogenous uncertainties within the learning procedure. In that framework (case 3), the training and test rewards are close to each other, i.e., *<sup>r</sup>*neg ≈ −0.67, which illustrates the good performance of the proposed method.
