**1. Introduction**

With the development of renewable energy related technology, our dependence on conventional energy has been gradually declining. As the core of the third industrial revolution, a new concept named as the energy Internet (EI) has been proposed and investigated extensively [1,2], in which a new architecture of energy supply and demand is constructed through the integration of information and energy [3–5]. Typically, an EI scenario can have access to the utility grid. Alternatively, when disconnected from the main power grid, multiple sub-grids interconnected via energy routers are able to function normally [6,7]. For the detailed definition, architecture and key technologies of EI, readers can refer to [8,9], and the references therein.

Due to the increase of uncertainty in power generation and usage, compared with traditional power grids, one of the challenges faced by EI is how to match power demand with supply and how to maintain the safety and reliability of the whole network. The problem of resilient multi-scale coordination control against a set of adversarial or non-cooperative nodes in directed networks has been investigated in [10]. In power systems, static and transient voltage stability analysis have been extensively studied; see, e.g., [11]. Transient voltage stability problems, such as voltage sag, may occur in a local network that is not robust in the event of a large disturbance. It is notable that such transient voltage stability issues also exist in the field of EI, which is worth investigation [12].

The loss of reactive power can increase the voltage loss and may also lead to voltage fluctuation. Reactive power compensation is of grea<sup>t</sup> significance for the safe and reliable operation of EI. The following four targets can be achieved by proper reactive power compensation: (1) stabilizing the grid voltage, (2) increasing the power factor, (3) improving the equipment utilization rate, (4) reducing the loss of network active power; see, e.g., [13]. In order to guarantee the normal operation of EI, the dilemma caused by reactive power consumption can be solved by installing a static var generator (SVG) [14]. The SVG, also known as STATCOM, is a commonly used device to solve the reactive power consumption problem [15]. The work principle of SVG is as follows: the voltage source inverter is connected in parallel to EI. Amplitude and phase of output voltage on the AC side is adjusted, or the current on the AC side is directly adjusted to absorb or emit reactive power. Thus, reactive power compensation can be dynamically implemented. On the basis of stability assessment and prediction, SVG is installed to maintain the safe and stable operation of EI. For the case that voltage stability is not restored by the self-healing ability of EI, the installation of SVGs at different locations and the setting of different SVGs' output reactive power affects the voltage stability and the time of restoring stability. For the case of restoring voltage stability through the self-healing ability of EI, the restoration speed can be accelerated by installing the SVG. Additionally, the influence on the power consumption on the customer side can be reduced [16].

Within EI scenarios, transient short-circuit failure may cause grea<sup>t</sup> economic loss [17]. The judgment of transient voltage stability is not only the basis for subsequent decision optimization of reactive power compensation, but also the key to maintaining the normal operation of EI. Additionally, the credibility of subsequent decision optimization is affected by the accuracy of the judgment of the stability state. At present, the mainstream conventional methods used for the judgement of the transient voltage stability state are mainly time domain simulation approaches [18] and direct methods [19], which are based on deterministic analysis. Due to the intermittence and volatility of power generation by renewable energy sources, judgement of the transient voltage stability state cannot be analyzed via deterministic approaches.

In recent years, with the development of big data technology and data mining technology, machine learning algorithms have been applied to the judgement of the transient voltage stability state [20]. The aforementioned algorithms mainly include artificial neural network, decision tree, support vector machine (SVM) [21] and other shallow machine learning algorithms. To illustrate, an intelligent algorithm using forward feedback neural network for online voltage stability assessment and monitoring has been studied in [22], where voltage, active power, reactive power of generators and loads are used as characteristic inputs for online voltage stability evaluation. In [21], the SVM algorithm was applied to select the voltage level, generator rate and rotor angle as input features for the prediction and evaluation of transient voltage stability after any fault occurs. In [23], the authors propose a voltage safety evaluation method through regularly updating the decision tree. The multi-layer perceptron neural network is employed to select new characteristics of voltage value and reactive power generation for online voltage stability testing and evaluation [24]. In [25], the extreme learning machine is used for voltage stability margin evaluation.

It is notable that the rapidity and accuracy of optimization is difficult to be achieved simultaneously by conventional evaluation methods. The shallow machine learning algorithm that processes the input characteristics of complex classification problems has limited computing power, which cannot meet the accuracy requirement of transient voltage stability prediction and evaluation in EI. In recent years, extensive applications of deep learning in the field of transient voltage stability prediction and evaluation have been used to solve the aforementioned challenge. Deep learning has a strong feature extraction ability and can solve dimensional disaster problems including multi-nodes and multi-features in EI; see, e.g., [26]. At present, the commonly used deep learning algorithms include the deep belief network [27,28], recurrent neural network [29], stacked denoising auto-encoders [30] and convolutional neural network (CNN) [31]. The combination of deep learning and reinforcement learning forms the deep reinforcement learning approach [32,33]. Reinforcement learning can be viewed as a process of exploration in the unknown environment [34]. From environment mapping to action, the subject not only obtains the action with the maximum reward value through exploration, but also receives the ultimate optimal effect by continuous trials and errors. Thus, as the ultimate goal, the maximum cumulative reward value is obtained. Reinforcement learning mainly includes

four key aspects: strategy, reward, evaluation and environment. Such methods explore the unknown environment, and di fferent strategic selection actions are performed to obtain di fferent reward and punishment values, such that the quality of the strategy is evaluated. It is worth mentioning that the quality of the evaluation goal is limited to the reward value obtained after the completion of an action. Besides, it depends on the follow-up action and the reward value obtained eventually. Based on the environment of discrete-time Markov decision process, the Q learning algorithm is one of the most important algorithms in reinforcement learning [35].

The data in EI include information about the state of each node at each time point and information about the network topology, and such data has both time and spatial correlation. The conventional simplified power network model [36] based on simulation fails to make full use of the real-time information obtained by massive data acquisition devices. In addition, the decision-making on reactive power compensation in existing power grids is mainly based on manual operations. In this paper, reactive power optimization for transient voltage stability in EI is studied. Based on data with su fficient information, a deep reinforcement learning model is used to judge the transient voltage stability state. In order to avoid losing information about time and space while training the model, CNN is selected to predict the transient voltage stability. Next, based on the stability prediction results, the deep reinforcement learning algorithm is applied to the decision optimization of reactive power compensation. Simulations show the e ffectiveness of the proposed method.

The contribution of this paper can be outlined as follows:


The rest of this paper is organized as follows. Section 2 provides the problem formulation. Section 3 constructs the flow of reactive power decision optimization algorithm. Section 4 provides some simulations. Finally, we conclude our paper in Section 5.
