**3. Optimization Algorithm**

In this section, we introduce the flow of reactive power decision optimization algorithm based on the judgement of the transient voltage stability state.

### *3.1. The Algorithm for Judging the Transient Voltage Stability State Based on CNN*

For the prediction of transient voltage stability, normally, the selected input characteristics are only time domain data. Single-node historical data is mainly considered, without taking into account the overall spatial characteristics of the grid. Therefore, the historical data of other nodes which contain a large amount of valid information that is useful for the stability prediction of such nodes is missed. In addition, the massive collected PMU data fails to be properly processed. In this paper, the judgment algorithm of the transient voltage stability state is designed based on the analysis of each node's data acquired by PMU. Meanwhile, the distance between the time period of the selected characteristic and the time to be predicted is enlarged. In this sense, the prediction effect can be achieved in advance.

The detailed algorithm for judging the transient voltage stability is as follows:

Step 1: Establishment of the model input sample matrix.

Real-time data is acquired from the data acquisition device PMU deployed at each key node of the real power grid. The output data can be obtained by simulation software. The data values of voltage *U*, frequency *f*, active power *P* and reactive power *Q* of the key nodes in EI during a characteristic time period *T* are obtained. The input sample matrix that makes up the model is as follows:

> ⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣*U*1,1 *f*1,1 *P*1,1 *Q*1,1 ··· *U*1,*M f*1, *M P*1,*M Q*1,*M U*2,1 *f*2,1 *P*2,1 *Q*2,1 ··· *U*2,*M f*2, *M P*2,*M Q*2,*M U*3,1 *f*3,1 *P*3,1 *Q*3,1 ··· *U*3,*M f*3, *M P*3,*M Q*3,*M* ... ... ... ... ... ... ... ... ... *UN*,<sup>1</sup> *fN*,<sup>1</sup> *PN*,<sup>1</sup> *QN*,<sup>1</sup> ··· *UN*,*<sup>M</sup> fN*, *M PN*,*<sup>M</sup> QN*,*<sup>M</sup>* ⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

where the subscript of voltage *U*, frequency *f*, active power *P* and reactive power *Q* is (*i*, *j*). The first subscript *i* represents the *i*-th sample. The second subscript *j* represents the *j*-th time collection point.

Step 2: Determination and labeling of the input data stability.

According to the industrial standard, the stability of the input sample data is labeled. The value of voltage *U* at a specific time is used to determine whether the voltage is stable or not. If the value of node voltage *U* returns to 0.8 times of the standard value, it is regarded as stable and is denoted as "1". Conversely, if it is considered as unstable, it is denoted as "0".

Step 3: Expansion of data.

Considering the imbalance of positive and negative samples under the situation of stability and instability, the input sample data is expanded by translating window, in order to avoid deflection in the training process. Such process is illustrated in Figure 2.

**Figure 2.** Data expansion mode in the case of unbalanced samples.

Step 4: Construction of CNN.

The CNN is constructed by input layer, convolution layer, pooling layer, fully connected layer and output layer. The appropriate number of CNN layers, convolutional cores and parameters are selected to achieve a better prediction effect.

Step 5: Offline training and online evaluation.

The combination model of offline training and online evaluation is shown in Figure 3. According to the transient stability rule, the transient stability assessment model is obtained by offline training using historical data or simulation data. Then, real time data is used in the trained model for online testing, and the stability assessment results are obtained.

**Figure 3.** Combination model of offline training and online evaluation.

### *3.2. Reactive Power Decision Optimization Algorithm*

Based on the judgement of the transient voltage stability state, the process of the reactive power decision optimization algorithm is proposed as follows:

Step 1: State perception.

The output data is obtained through BPA. During a characteristic time period *T*, voltage *U*, frequency *f*, active power *P* and reactive power *Q* of each key node are selected to form an input sample matrix of the model as the current state *s*.

Step 2: Stability prediction.

According to the judgement in Section 3.1, the state information perceived in Step 1 is taken as the input data of the model, and the output is whether the grid would restore stability or not within a certain time period. The stable output is used as an important basis for calculating the reward value by the subsequent deep reinforcement learning approach.

Step 3: Capture of action.

The location of SVGs and compensation value of each SVG are used as action *a* of operator *Agent* in the deep reinforcement learning algorithm. The action value is acquired according to the setting mode in the e ffective action collection and then converted into a *one* − *hot* form.

Step 4: Perception of the next state.

In the case of the perceived state *s* in Step 1, the position and compensation value of SVGs are set in BPA by executing action *a* obtained in Step 3. The next state value *s* is obtained by performing the simulation.

Step 5: Reward value setting.

There are two goals for reactive power optimization. The first one is to enable the grid to recover in a certain time period after a short-circuit fault occurs. The other is to use distributed reactive power compensation, so as to reduce the compensation value of each reactive power compensator. The calculation rule of the reward value *r* is set in conjunction with the stability prediction in Step 2. The action value is acquired in Step 3.

Step 6: Experiential playback.

The collected status, action, reward and other data are stored in the database *memory\_replay* which is self-defined. The training data is randomly selected in small batches during training. In this sense, the dependency relationship of the observed data can be avoided. In addition, the e ffect of the operator *Agent* influenced by the recent operation can also be avoided. Otherwise, what happens before would be "forgotten". The correlation of the samples is weakened, the e fficiency of data usage can be improved, and the correlation between data can be reduced. Therefore, the algorithm can be easily convergent, and the generalization ability can be improved.

Step 7: Training of *Q* network.

The CNN is used to fit *Q* value function. The experiential playback technique in Step 6 is adopted. The small batch is randomly taken from the database *memory\_replay* for training.

The goal is to obtain the action combination of the highest *Q* value and to output this value.
