**1. Introduction**

The massive integration of Distributed Generation (DG) units in electric distribution networks poses significant challenges for system operators [1–5]. Indeed, distribution networks were historically sized (with a radial structure) to meet maximum load demands while avoiding under-voltages at the end of the lines. However, in presence of local generation, the opposite over-voltage problem may appear. In case of severe voltage violation, inverters of DG units are temporarily cut off. This induces not only a loss of renewable-based energy, but also a deterioration of the delivered power quality (due to resulting voltage and current transients) that accelerates the equipment degradation [6]. In this context, the objective of modern Distribution System Operators (DSOs) is to adopt a reliable and cost-efficient strategy that is able to maintain a safe voltage profile in both normal and contingency conditions, with the goal of enhancing the ability of the system to accommodate new renewable-based resources. To that end, researchers have developed a wide range of techniques, with the aim of avoiding costly investment plans that simply upgrade/reinforce the network. Also, static (experience-based) strategies based on past observations have shown limitations, as they are often sub-optimal and unable to react in a very short time frame (to prevent cascading faults just after a disturbance) [7].

Theoretically, different methods can be applied for voltage managemen<sup>t</sup> of Medium-Voltage (MV) distribution systems, but the most common methods are based on using on-load tap changer mechanism of the transformer, reactive power compensation and curtailment of DG active powers [8,9]. It is generally known that each of the above voltage control methods has its own advantages and drawbacks, and there is no single perfect voltage regulation method [10]. Recently, there has been a growing literature focusing purely on local strategies in which resources rely only on localized measurements of the voltages' magnitude, and do not exchange information with other agents [11,12]. Such local algorithms are easy to implement in practice, but the lack of a global vision may prevent to cost-efficiently solve the voltage control problem. As an alternative, distributed strategies, which require communication capabilities between neighbouring agents, are also considered. Such approaches enable resources (that are physically close) to share information in order to cooperatively achieve the desired target levels, while considering other objectives such as losses minimization [13–15]. However, to further improve the optimality of the control solution, centralized voltage control algorithms, which are mainly based on an Optimal Power Flow (OPF) formulation, have also been proposed [16–18].

In general, although the latter centralized model-based techniques have shown promising performance, they are plagued with two main issues.

Firstly, they require to solve challenging optimization problems, which are non-linear and non-convex (from the AC power flow equations used to comply with the physical constraints of the electrical distribution system), and subject to uncertainties (from the stochastic load and generation changes, and the unexpected contingencies). The OPF-based methods thereby face scalability issues, which makes them of little relevance for real-time operation. This is partly addressed by using efficient nonlinear programming techniques [19], or through convex approximations of power flow constraints, which mainly resort on second order cone programs [20] or linear reformulations using the sensitivity analysis [21–23]. However, modeling errors inevitably arise and may lead to unsafe and sub-optimal solutions. Moreover, the recent trend of operating the modern distribution networks in closed loop mode makes traditional approximations even less accurate [24].

Secondly, the common feature of model-based techniques is that they assume that the physical parameters of the distribution networks are perfectly known, which is impractical due to the high complexity of these systems. In that regard, the real-time characteristics of the network components are not static, and are governed by complex dynamic dependencies [25]. For instance, deviations of parameters can arise from the atmospheric conditions and aging. Important effects are thereby often neglected, i.e., load power factors are not available precisely, there is a complex dependence structure between load and voltage levels, line impedances vary with the conductor temperatures, and the shunt admittances of lines as well as the internal resistance of transformers are also affecting network conditions [26].

The first issue (related to the high computational costs of model-based control algorithms) has led to the implementation of reinforcement learning (RL). These data-driven methods have the advantage to directly learn their operating strategy from historical data in a model-free fashion (without any assumptions on the functional form of the model). Consequently, they can show good robustness under very complex environments with measurement noise [27,28]. A novel deep reinforcement learning (DRL)-based voltage control scheme (named Grid Mind) is developed in [29]. In particular, two different techniques have been compared, i.e., deep Q-network (DQN) and deep deterministic policy gradient (DDPG), and both have shown promising outcomes. In [30], voltage regulation is improved using a RL-based policy that determines the optimal tap setting of transformers. Then, a new voltage control solution combining actions on two different time scales is implemented in [31], where DQN is applied for the (slow) operation of capacitor banks. Finally, multi-agent frameworks have been developed in [32,33] to enable decentralized executions of the control procedure that do not require a central controller.

However, all these methods are disregarding the endogenous uncertainties on network parameters, which may mislead the DSO into believing that the control strategy satisfies technical constraints, while it may actually result into unsafe conditions. In this context, the main contribution of this paper is to propose a self-learning voltage control tool based on deep reinforcement learning (DRL), which accounts for the limited knowledge on both the network parameters and the future (very-short-term) working conditions. The proposed tool can support DSOs in making autonomous and quick control actions to maintain nodal voltages within their safe range, in a cost-optimal manner (through the optimal use of ancillary services in a market environment). In this work, it is assumed that for voltage control purpose, we can act on the active and reactive powers of DG units as well as on the transformer tap position. The resulting problem is formulated as a single-agent centralized control model.

The main advantage of the proposed method lies in its ability to learn from scratch (in an off-line fashion) and gradually master the system operation. Hence, the computational burden is transferred in pre-processing (when the model is calibrated/learned through many simulations), such that the real-time control process (in actual field operation when the agen<sup>t</sup> is trained) is insignificant ( 1 s). Also, the model-free tool allows to immunize the voltage control procedure against uncertainties in both exogenous (load conditions) and endogenous (network parameters) variables, while accounting for approximations in the power flow models describing the system operation. Results from a case study on a 77-bus, 11 kV radial distribution system reveal that the proposed tool allows determining an optimal policy that lead to safe grid operation at low costs.

The remainder of this paper is organized as follows. Section 2 introduces the theoretical background in reinforcement learning, with a particular interest on the deep deterministic policy gradient (DDPG) algorithm, which allows to handle high-dimensional (and continuous) action spaces. Section 3 describes the simulation environment, including the different sources of uncertainty. The developed method is tested (using new representative network conditions) in Section 4 on a realistic 77-bus system, where we validate its robustness through the numerical simulations. Finally, conclusions and perspectives for future research are given in Section 5.

## **2. Reinforcement Learning Background**

In this section, we introduce the basics of reinforcement learning (RL), while making the practical connections with the voltage control problem.

#### *2.1. Markov Decision Process*

Firstly, the problem has to be formulated as a Markov Decision Process (MDP). The general principle consists of an agen<sup>t</sup> interacting with an environment E over a number of discrete time steps until the agen<sup>t</sup> reaches a terminal state. In particular, at each step *t*, the agen<sup>t</sup> observes a state *st* from the state space S, and selects an action *at* ∈ A according to its policy *<sup>π</sup>*(*at*|*st*). As a result, the agen<sup>t</sup> ends up in the next state *st*+1 ∼ P(*st*+<sup>1</sup>|*st*, *at*) while receiving an immediate scalar reward *rt* based on the distribution R(*rt*|*st*, *at*) in accordance with the natural laws of the environment. The next state *st*+1 depends only on the action *at* on state *st* (and not on the prior history), which is a characteristic referred to as the Markov property.

In this work, the agen<sup>t</sup> is the central controller which regulates the voltage level within its control area, and the environment is the electrical distribution network (including the realization of the different sources of uncertainty affecting its operation). The state-transition model P(*st*+<sup>1</sup>|*st*, *at*) and the reward function R(*rt*|*st*, *at*) are inherently stochastic, and the problem can thus be formulated using reinforcement learning.
