**1. Introduction**

Nonlinear programming is a very common issue in the operation of power systems, including reactive power optimization (RPO) [1], unit commitment (UC) [2], economic dispatch (ED) [3]. In order to tackle this issue, several optimization approaches have been adopted, such as the Newton method [4], quadratic programming [5], interior-point method [6]. However, these methods are essentially gradient-based mathematic optimization methods, which highly depend on an accurate system model. When there is nonlinearity, there are discontinuous functions and constraints, and there usually exist many local minimum upon which the algorithm may be easily fall into one local optimum [7].

In the past decades, artificial intelligence (AI) [8–18] has been widely used as an effective alternative because of its high independence from an accurate system model and strong global optimization ability. Inspired by nectar gathering of bees in wild nature, the artificial bee colony (ABC) [19] has been applied to optimal distributed generation allocation [8], global maximum power point (GMPP) tracking [20], multi-objective UC [21], and so on, and has the merits of simple structure, high robustness, strong universality, and efficient local search.

However, the ABC mainly depends on a simple collective intelligence without self-learning or knowledge transfer, which is a common weakness of AI algorithms such as genetic algorithm (GA) [9], particle swarm optimization (PSO) [10], group search optimizer (GSO) [11], ant colony system (ACS) [12], interactive teaching–learning optimizer (ITLO) [13], grouped grey wolf optimizer

(GGWO) [14], memetic salp swarm algorithm (MSSA) [15], dynamic leader-based collective intelligence (DLCI) [16], and evolutionary algorithms (EA) [17]. Thus, a relatively low search efficiency may result, particularly while considering a new optimization task of a complex industrial system [22], e.g., the optimization of a large-scale power system with different complex new tasks. In fact, the computational time of these algorithms can be effectively reduced for RPO or optimal power flow (OPF) via the external equivalents in some areas (e.g., distribution networks) [23]. This is because the optimization scale and difficulty are significantly reduced as the number of optimization variables and constraints decreases. However, the optimization results are highly determined by the accuracy of the external equivalent model [24], which is usually worse than that obtained by global optimization. Hence, this paper aims to propose a fast and efficient AI algorithm for global optimization.

Previous studies [25] discovered that bees have evolved an instinct to memorize the beneficial weather conditions of their favorite flowers, e.g., temperature, air humidity, and illumination intensity, which may rapidly guide bees to find the best nectar source in a new environment with high probability, hence, the survival and prosperity of the whole species living in various environments can be guaranteed via the exploitation of such knowledge. The above natural phenomenon resulted from the struggle for existence in a harsh and unknown environment can be regarded as a knowledge transfer, which has been popularly investigated in machine learning and data mining [26]. In practice, prior knowledge is from the source tasks and then it is applied to a new but analogous assignment, such that fewer training data can be used to achieve a higher learning efficiency [27], e.g., learn to ride a bike before starting to ride a motorcycle. In fact, knowledge transfer-based optimization is essentially the knowledge-based history data-driven method [28], which can accelerate the optimization speed of a new task according to prior knowledge. Also, reinforcement learning (RL) can be accelerated by knowledge conversion [29], and agents learn new tasks faster and interact less with the environment. As a consequence, knowledge transfer reinforcement learning (KTRL) has been developed [30] through combining AI and behavior psychology [31] and is divided into behavior shift and information shift.

In this paper, behavior shift was used for Q-learning [32] to accelerate the learning of a new task, which was called *Q*-value transfer. The *Q*-value matrix was applied in knowledge learning, storage, and transfer. However, the practical application of conventional Q-learning was restricted to only a group of new tasks with small size due to the calculation burden. To deal with this obstacle, an associated state-action chain was introduced after the solution space was decomposed into several low-dimensional solution spaces. Therefore, this paper proposes a transfer bee optimizer (TBO) based on Q-learning and behavior transfer. The main novelties and contributions of this work are given as follows:


The remaining of this paper is arranged as follows: Section 2 presents the basic principles of the TBO. Section 3 designs the TBO for RPO. Section 4 shows the simulation results, and Section 5 summarizes paper.

#### **2. Transfer Bees Optimizer**

#### *2.1. State-Action Chain*

Q-learning typically finds and learns different state-action pairs through a look-up table, i.e., *Q*(*s*,*a*), but this is not enough to handle a complex task with multiple controllable variables because of the curse of dimension, as illustrated in Figure 1a. Suppose the optional operand of the controllable variable *xi* is *mi*, and |*A*|= *m*1*m*<sup>2</sup> ··· *mn* , *n* is the sum of the controlled variables, and *A* is the action set. If *n* is dramatically increased, the dimension of *Q*-value matrix will grow very fast, so the calculation convergence is slow and may even lead to failure. Hierarchical reinforcement learning (HRL) [33] is commonly used to avert this obstacle and decomposes the original complex task into several simpler subtasks, e.g., MAXQ [34]. However, it is easy to fall into a near-optimum for the overall task due to the fact that it is difficult to design and coordinate all the subtasks.

In contrast, the whole solution space is decomposed into several low-dimensional solution spaces by the associated state-action chain. In such frame, each controlled variable has a unique memory matrix *Q<sup>i</sup>* , the size of *Q*-value matrix can be significantly reduced, it has the advantages of convenient storage and transfer, and the controlled variables are linked, as shown in Figure 1b.

**Figure 1.** Comparison of Q-learning and transfer bee optimizer (TBO).

#### *2.2. Knowledge Learning and Behavior Transfer*

#### 2.2.1. Knowledge learning

Figure 2 shows that all bees search a nectar source according to their prior knowledge (*Q*-value matrix); the obtained knowledge will then be updated after each interaction with the nectar source, therefore a cycle of knowledge learning and conscious exploration can be fully completed. As shown in

Figure 1a, a simple RL agent is usually adopted for traditional Q-learning to acquire knowledge [18,35], which is the cause of inefficient learning. In contrast, the TBO adopts the method of swarm collaborative exploration for knowledge learning, which can update multiple elements of the Q value matrix at the same time, thus greatly improving the learning efficiency, which can be described as [32]

$$\begin{aligned} \mathbf{Q}\_{k+1}^{i}(\mathbf{s}\_{k'}^{ij}\mathbf{a}\_k^{ij}) &= \mathbf{Q}\_k^i(\mathbf{s}\_{k'}^{ij}\mathbf{a}\_k^{ij}) + \alpha[\mathbb{R}^{ij}(\mathbf{s}\_{k'}^{ij}\mathbf{s}\_{k+1'}^{ij}\mathbf{a}\_k^{ij}) + \\ &\quad \underset{\mathbf{a}^i \in \mathcal{A}\_i}{\operatorname\*{\max}} \mathbf{Q}\_k^i(\mathbf{s}\_{k+1'}^{ij}\mathbf{a}) - \mathbf{Q}\_k^i(\mathbf{s}\_{k'}^{ij}\mathbf{a}\_k^{ij})], \\ &\quad \mathbf{j} = 1, 2, \dots, J; \mathbf{i} = 1, 2, \dots, n \end{aligned} \tag{1}$$

where α represents the factor of knowledge study; γ is the discount coefficient; the superscripts *i* and *j* signify the *i*th *Q*-value matrix (i.e., the *i*th controlled variable) and the *j*th bee, respectively; *J* is the population size of bees; (*sk*,*ak*) means a pair of state-action in the *k*th iteration; *R*(*sk*,*sk*+1,*ak*) is the reward function that transforms from state *sk* to *sk*<sup>+</sup><sup>1</sup> used under an optional operation *ak*; *a<sup>i</sup>* means random optional action of the *i*th controlled variable *xi*; and *A<sup>i</sup>* represents the multiple active result sets of the *i*th controlled variable.

**Figure 2.** The principle of knowledge learning of the TBO inspired by the nectar gathering of bees.

#### 2.2.2. Behavior transfer

In the initial process, the TBO needs to go through a series of source tasks to get the optimal *Q*-value matrix, so as to make use of and to prepare prior knowledge for similar new tasks in the future. The relevant prior knowledge of source task is shown in Figure 3. According to the similarity of the source task, the optimal *Q*-value matrix of the source task *Q*<sup>∗</sup> <sup>S</sup> is shifted from the initial *Q*-value matrix to the new task *Q*<sup>0</sup> <sup>N</sup>, as

$$\mathbf{Q}\_{\rm N}^{0} = \begin{bmatrix} r\_{11} & r\_{12} & \cdots & r\_{1h} & \cdots\\ r\_{21} & r\_{22} & \cdots & r\_{2h} & \cdots\\ \vdots & \vdots & \ddots & \vdots & \cdots\\ r\_{y1} & r\_{y2} & \cdots & r\_{yh} & \cdots\\ \vdots & \vdots & \cdots & \vdots & \ddots \end{bmatrix} \mathbf{Q}\_{\rm S}^{\*} \tag{2}$$

with

$$\mathbf{Q}\_{\mathrm{N}}^{0} = \begin{bmatrix} \mathbf{Q}\_{\mathrm{N}1}^{10} & \cdots & \mathbf{Q}\_{\mathrm{N}1}^{01} & \cdots & \mathbf{Q}\_{\mathrm{N}1}^{n0} \\ \mathbf{Q}\_{\mathrm{N}2}^{10} & \cdots & \mathbf{Q}\_{\mathrm{N}2}^{01} & \cdots & \mathbf{Q}\_{\mathrm{N}2}^{n0} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \mathbf{Q}\_{\mathrm{N}y}^{10} & \cdots & \mathbf{Q}\_{\mathrm{N}y}^{01} & \cdots & \mathbf{Q}\_{\mathrm{N}y}^{n0} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix}, \mathbf{Q}\_{\mathrm{S}}^{\*} = \begin{bmatrix} \mathbf{Q}\_{\mathrm{S}1}^{1\*} & \cdots & \mathbf{Q}\_{\mathrm{S}1}^{i\*} & \cdots & \mathbf{Q}\_{\mathrm{S}1}^{n\*} \\ \mathbf{Q}\_{\mathrm{S}2}^{1\*} & \cdots & \mathbf{Q}\_{\mathrm{S}2}^{i\*} & \cdots & \mathbf{Q}\_{\mathrm{S}2}^{n\*} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \mathbf{Q}\_{\mathrm{S}l}^{1\*} & \cdots & \mathbf{Q}\_{\mathrm{S}l}^{i\*} & \cdots & \mathbf{Q}\_{\mathrm{S}l}^{n\*} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix}$$

where *Qi*<sup>0</sup> <sup>N</sup>*<sup>y</sup>* is the *<sup>i</sup>*th initial *<sup>Q</sup>*-value matrix in the *<sup>y</sup>*th new task; *<sup>Q</sup>i*<sup>∗</sup> <sup>S</sup>*<sup>h</sup>* is the *i*th optimized *Q*-value matrix in the *h*th source task; and *ryh* represents the comparability between the *h*th source task and the *y*th new task; here, a large *ryh* indicates that the *y*th new task can gain much knowledge from the *h*th source task, and 0 ≤ *ryh* ≤ 1.

**Figure 3.** Behavior transfer of the TBO.

#### *2.3. Exploration and Feedback*

#### 2.3.1. Action policy

There are two kinds of bees in Figure 2, e.g., scout and worker, determined by their nectar amounts (fitness values) [19], which are responsible for global searching and local searching. As a consequence, a bee colony can balance the exploration and exploitation through different action policies in a nectar source. In the TBO, 50% of bees with nectar amounts that rank in the half top of all bees are designated as worker, while the others are scout. On the basis of the ε-*Greedy* rule [31], the scouts' actions are based on the proportion of *Q*-value in the current status. As for the controlled variable *xi*, one behavior of each scout is chosen as

$$a\_{k+1}^{ij} = \begin{cases} \underset{a^i \in A\_i}{\operatorname{argmax}} \underbrace{\mathbf{Q}\_{k+1}^i (s\_{k+1}^{ij}, a^i)}\_{a\_{\mathbf{reg}}} & \text{if } \varepsilon \le \varepsilon\_0 \\\ a\_{\mathbf{reg}} & \text{otherwise} \end{cases} \tag{3}$$

where ε is any value, uniformly distributed between [0, 1]; ε<sup>0</sup> represents the exploration rate; and *a*rg represents any global behavior ascertained by the distribution of the state-action probability matrix *P<sup>i</sup>* , updated by

$$\begin{cases} P^i(s^i, a^i) = \frac{\mathfrak{s}^i(s^i, a^i)}{\sum\_{\mathfrak{s}' \in A\_i} \mathfrak{s}^i(s^i, a^i)} \\\ e^i(s^i, a^i) = \frac{1}{\max\_{\mathfrak{s}' \in A\_i} \mathfrak{Q}^i(s^i, a^i) - \beta \mathfrak{Q}^i(s^i, a^i)} \end{cases} \tag{4}$$

where β is the discrepancy factor, and *e<sup>i</sup>* is an evaluation matrix of the pairs of the state-action.

On the other hand, the workers keep exploring new nectar sources at nearby nectar sources, which can be written as [8]

$$\begin{cases} a\_{\text{new}}^{ij} = a^{ij} + \text{Rnd}(0, 1)(a^{ij} - a\_{\text{rand}}^{ij}) \\\ a\_{\text{rand}}^{ij} = a\_{\text{min}}^{i} + \text{Rnd}(0, 1)(a\_{\text{max}}^{i} - a\_{\text{min}}^{i}) \end{cases} \tag{5}$$

where *a ij* new, *aij*, and *a ij* rand denote the new action, current action, and random action of the *i*th controlled variable selected by the *j*th worker; *ai* max and *ai* min are the maximum and minimum behavior, respectively, in the *i*th controlled variable.

#### 2.3.2. Reward function

After each exploration, each bee will get an instant reward based on its fitness value. Because it is the goal of the TBO to maximize the expected long-term rewards for each state [28], the reward function is designed as

$$R^{ij}(s\_{k'}^{ij}s\_{k+1'}^{ij}, a\_k^{ij}) = \frac{\mathbb{C}}{f\_k^j} \tag{6}$$

where *C* is a positive multiplicator, and *fk <sup>j</sup>* represents the fitness function of the *j*th bee in the *k*th iteration. This is closely related to the target function.

After each bee obtains its new reward, the scouts and workers will swap their roles according to the obtained reward rank, precisely, 50% of bees who had a larger reward will become workers. As a result, a compromise is reached between exploration and development to ensure a global search for the TBO.

#### **3. Design of the TBO for RPO**

### *3.1. Mathematical Model of RPO*

⎧

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

As the subproblem of OPF, the conventional RPO aims to lower the active power losses or other appropriate target functions via optimizing the different types of controlled variables (e.g., transformer tap ratio) under multiple equality constraints and inequality constraints [35]. In this article, the integrated target function of the active power loss and the deviation of supply voltage were used as follows [18]

$$\min f(\mathbf{x}) = \mu P\_{\text{loss}} + (1 - \mu)V\_{\text{cl}} \tag{7}$$

subject to

$$\begin{cases} P\_{\rm Ci} - P\_{\rm Di} - V\_i \sum\_{j \in \mathcal{N}\_i} V\_j (g\_{ij} \cos \theta\_{ij} + b\_{ij} \sin \theta\_{ij}) = 0, i \in \mathcal{N}\_0 \\ Q\_{\rm Ci} - Q\_{\rm Di} - V\_i \sum\_{j \in \mathcal{N}\_i} V\_j (g\_{ij} \sin \theta\_{ij} - b\_{ij} \cos \theta\_{ij}) = 0, i \in \mathcal{N}\_{\rm PQ} \\ Q\_{\rm Ci}^{\rm min} \le Q\_{\rm Ci} \le Q\_{\rm Ci}^{\rm max}, \qquad i \in \mathcal{N}\_{\rm C} \\ V\_i^{\rm min} \le V\_i \le V\_i^{\rm max}, \qquad i \in \mathcal{N}\_i \\ Q\_{\rm Ci}^{\rm min} \le Q\_{\rm Ci} \le Q\_{\rm Ci}^{\rm max}, \qquad i \in \mathcal{N}\_{\rm C} \\ T\_k^{\rm min} \le T\_k \le T\_k^{\rm max}, \qquad k \in \mathcal{N}\_{\rm T} \\ |S\_l| \le S\_l^{\rm max}, \qquad l \in \mathcal{N}\_{\rm L} \end{cases} \tag{8}$$

where *x* = [*V*G, *Tk*, *Q*C, *V*L, *Q*G] represents the variable vector, *V*<sup>G</sup> represents the terminal voltage of generator; *Tk* means the transformer tapping ping ratio; *Q*<sup>C</sup> is the reactive power of the shunt capacitor; *V*<sup>L</sup> is the load-bus voltage; *Q*<sup>G</sup> is the reactive power of the generator; *P*loss is the power loss; *V*<sup>d</sup> is the deviation of supply voltage; 0 ≤ μ ≤ 1 is the weight coefficient; *P*G*<sup>i</sup>* and *Q*G*<sup>i</sup>* are the generated active power; *P*D*<sup>i</sup>* and *Q*D*<sup>i</sup>* are the demanded active and reactive power, respectively; *Q*C*<sup>i</sup>* represents the reactive power compensation; *Vi* and *Vj* are the voltage magnitude of the *i*th and *j*th node, respectively; θ*ij* is the phase difference of voltage; *gij* represents the conductance in the transmission line *i-j*; *bij* represents the susceptance of the transmission line *i-j*; *Sl* is the apparent power of the transmission line *l*; *Ni* is the node set; *N*<sup>0</sup> is the set of the slack bus; *N*PQ is the set of active/reactive power (PQ) buses; *N*<sup>G</sup> is the unit set; *N*<sup>C</sup> is the compensation equipment; *N*<sup>T</sup> is the set of transformer taps; and *N*<sup>L</sup> is the branch set. In addition, the active power loss and the deviation of supply voltage can be computed by [18]

$$P\_{\rm loss} = \sum\_{i,j \in \mathcal{N}\_{\rm L}} g\_{ij} [V\_j^2 + V\_j^2 - 2V\_i V\_j \cos \theta\_{ij}] \tag{9}$$

$$V\_{\rm d} = \sum\_{i \in N\_i} \left| \frac{2V\_i - V\_i^{\rm max} - V\_i^{\rm min}}{V\_i^{\rm max} - V\_i^{\rm min}} \right| \tag{10}$$

#### *3.2. Design of the TBO*

#### 3.2.1. Design of state and action

The terminal voltage of generator, transformer tapping ratio, and d shunt capacitor reactive power compensation were chosen as the controlled variables of RPO, in which each controlled variable had its own *Q*-value matrix *Q<sup>i</sup>* and action set *Ai*, as shown in Figure 1. In addition, the operation sets for every controlled variable were the state sets for the next controlled variable, i.e., *Si*+<sup>1</sup> = *Ai*, where the initial controlled variable depends on different tasks of RPO, thus each task can be considered as a specific state of *x*1.

#### 3.2.2. Design of the reward function

It can be found from (6) that the fitness function determines the reward function that represents the overall target function (7) and needs to satisfy all constraints (8). Hence, the fitness function is designed as

$$f^j = \mu P^j\_{\text{loss}} + (1 - \mu)V^j\_{\text{cl}} + \eta q^j \tag{11}$$

where *q* represents the sum of inequalities that violate the constraints, and η represents the regularization factor.

#### 3.2.3. Behavior transfer for RPO

Based on eq (2), the transfer efficiency of TBO mainly depends on getting the comparability between the source tasks and the new tasks [30]. In fact, the distribution of power flow principally determines the RPO in the power system, thus it is principally influenced by the power demand, because the topological structure of the power system cannot be changed much daily. Therefore, the active power demand was divided into several load intervals as follows

$$\left\{ \left[ P\_{\rm D}^{1}, P\_{\rm D}^{2} \right], \left[ P\_{\rm D}^{2}, P\_{\rm D}^{3} \right], \dots, \left[ P\_{\rm D}^{h}, P\_{\rm D}^{h+1} \right], \dots, \left[ P\_{\rm D}^{H-1}, P\_{\rm D}^{H} \right] \right\} \tag{12}$$

where *Ph* <sup>D</sup> represents the demand of active power in the *<sup>h</sup>*th source task for RPO, *<sup>P</sup>*<sup>1</sup> <sup>D</sup> <sup>&</sup>lt; *<sup>P</sup>*<sup>2</sup> <sup>D</sup> <sup>&</sup>lt; ··· <sup>&</sup>lt; *<sup>P</sup><sup>h</sup>* <sup>D</sup> < ··· < *P<sup>H</sup>* D.

Suppose the active power required in the *y*th new task is *P<sup>y</sup>* <sup>D</sup>, *<sup>P</sup>*<sup>1</sup> <sup>D</sup> <sup>&</sup>lt; *<sup>P</sup><sup>y</sup>* <sup>D</sup> <sup>&</sup>lt; *<sup>P</sup><sup>H</sup>* <sup>D</sup>, then the comparability between two tasks will be computed by

$$r\_{yh} = \frac{\left[\mathcal{W} + \Delta P\_{\mathcal{D}}^{\text{max}}\right] - \left|P\_{D}^{h} - P\_{D}^{y}\right|}{\sum\_{h=1}^{H} \left\{ \left|\mathcal{W} + \Delta P\_{\mathcal{D}}^{\text{max}}\right| - \left|P\_{D}^{h} - P\_{D}^{y}\right| \right\}} \tag{13}$$

$$
\Delta P\_D^{\text{max}} = \max\_{h=1,2,\cdots,M} \left( \left| P\_D^h - P\_D^\mathcal{V} \right| \right) \tag{14}
$$

where *W* represents the transfer factor, and Δ*P*max <sup>D</sup> represents the ultimate error of active power demand, where *<sup>H</sup> h*=1 *ryh* = 1.

Note that a smaller deviation *Ph <sup>D</sup>* <sup>−</sup> *<sup>P</sup><sup>y</sup> D* brings in a larger similarity *ryh*, therefore the new *<sup>y</sup>*th task can develop more knowledge.

Therefore, the overall process of TBO behavior transfer is generalized as follows:

Step 1. Determine the source tasks according to a typical load curve in a day by Equation (12); Step 2. Complete the source tasks in the initial study process and store their optimal *Q*-value matrices; Step 3. Calculate the comparability between original tasks and new task according to the deviation of power demand according to Equations (13) and (14);

Step 4. Obtain the original *Q*-value matrices in the new task by Equation (2).

#### 3.2.4. Parameters setting

For the TBO, eight parameters, α, γ, *J*, ε0, β, *C*, η, and *W* are important and need to be set following the general guidelines below [18,26,32,35].

Of these, α represents the knowledge learning factor, with 0 < α < 1, a determines the rate of knowledge acquisition of the bees. A larger α will accelerate knowledge learning, which may bring about a local optimization, while a smaller value can improve the algorithm stability [35].

The discount factor is defined to exponentially discount the rewards obtained by the *Q*-value matrix in the future, and its value is 0 < γ < 1. Since the future return on RPO is insignificant, it is required assume a value close to zero [18].

*J* is the number of bees, with *J* ≥ 1; it determines the rate of convergence and the rate of solution. A large *J* makes the algorithm approximate a more accurate global optimization solution but results in a larger computational burden [26].

The exploration rate, 0 < ε<sup>0</sup> < 1, balances the exploration and development of a nectar resource by the scouts. The scouts are encouraged to pick a voracious action instead of any action according to the state-action probability matrix by a larger ε0.

The discrepancy factor, 0 < β < 1, increases the differences among the elements of each row in *Q*-value matrices.

The positive multiplicator, *C* > 0, distributes the weight and fitness functions of the reward function. The bees are encouraged to get more rewards by the fitness function with a large *C*, while the difference in rewards is smaller among all bees.

The penalty factor, η > 0, makes sure to satisfy the restrain of inequality. Solution infeasibility may arise because of a smaller η [18]. Here, *W* is the shift factor, with *W* ≥ 0, that identifies the comparability among two tasks.

The parameters were selected through trial and error, as shown in Table 1.


**Table 1.** Parameters used in TBO.

3.2.5. Execution Procedure of the TBO for RPO

At last, the overall implementation process of TBO is shown in Figure 4, and *Qi <sup>k</sup>*+<sup>1</sup> <sup>−</sup> *Qi k* <sup>2</sup> < σ is the memory value difference of matrix 2-norm, with the precision factor σ = 0.001.

**Figure 4.** Flowchart of TBO for reactive power optimization (RPO).

#### **4. Case Studies**

The TBO for RPO was assessed by the IEEE 118-bus system and the IEEE 300-bus system and compared with those of ABC [8], GSO [11], ACS [12], PSO [10], GA [9], quantum genetic algorithm (QGA) [36], and ant colony based Q-learning (Ant-Q) [37]. Furthermore, the main parameters of other algorithms were obtained through trial and error and were set according to reference [38], while the weight coefficient μ applied to eq (7) assigned the active power loss and the deviation of the output voltage. The simulation is executed on Matlab 7.10 by a personal computer with Intel(R) Xeon (R) E5-2670 v3 CPU at 2.3 GHz with 64 GB of RAM.

The IEEE 118-bus system consists of 54 generators and 186 branches and is divided into three voltage levels, i.e., 138 kV, 161 kV, and 345 kV. The IEEE 300-bus system is constituted by 69 generators and 411 branches, with 13 voltage levels, i.e., 0.6 kV, 2.3 kV, 6.6 kV, 13.8 kV, 16.5 kV, 20 kV, 27 kV, 66 kV, 86 kV, 115 kV, 138 kV, 230 kV, and 345 kV [39–41]. The number of controlled variables in IEEE 118-bus system was 25, and the number of controlled variables in IEEE 300-bus system was 111. More specifically, reactive power compensation is divided into five levels from rated level [−40%, −20%, 0%, 20%, 40%], the transformer tapping is divided into three levels [0.98, 1.00, 1.02], and the terminal voltage of generator is uniformly discretized into [1.00, 1.01, 1.02, 1.03, 1.04, 1.05, 1.06].

According to the given typical daily load curves shown in Figure 5, the active power demand was discretized into 20 and 22 load intervals, respectively, where every interval was 125 MW and 500 MW, respectively, i.e., {[3500, 3625), [3625, 3750), ... , [5875, 6000]} MW and {[19,000, 19,500), [19,500, 20,000), ... , [28,500, 29,000]} MW. Moreover, the implementation time of RPO was set at 15 min. Hence, the number of new tasks per day was 96, while the source tasks of the IEEE 118-bus system was 21, and the original tasks of the IEEE 300-bus system was 23.

**Figure 5.** Typical daily load curves of the IEEE 118-bus system and IEEE 300-bus system.

#### *4.1. Study of the Pre-Learning Process*

The TBO required a preliminary study to gain the optimal *Q*-value matrices for all source tasks, and then convert them to an initial *Q*-value. Figures 6 and 7 illustrate that the TBO will astringe to the optimal *Q*-value matrices of the source tasks while the optimal objective function can be obtained. Furthermore, the convergence of all *Q*-value matrices was consistent, as the same feedback rewards were used from the same bees to update the *Q*-value matrices at each iteration.

**Figure 6.** *Cont*.

**Figure 6.** Convergence of the seventh source task of the IEEE 118-bus system obtained in the pre-learning process.

**Figure 7.** Convergence of the eighth source task of the IEEE 300-bus system obtained in the pre-learning process.

#### *4.2. Study of Online Optimization*

#### 4.2.1. Study of behavior transfer

Through the preliminary study process, the TBO was ready for online optimization of RPO under different scenarios (different new tasks) with prior knowledge. The convergence of the target functions gained by different algorithms in online optimization is given in Figures 8 and 9. Compared to the preliminary study process, the convergence of the TBO was approximately 10 to 20 times that of other algorithms in online optimization, which verified the effectiveness of knowledge transfer. Furthermore, the convergence rate of the TBO algorithm was much faster than that of other algorithms due to transcendental knowledge exploitation.

**Figure 8.** Convergence of the second new task of the IEEE 118-bus system obtained in the online optimization.

**Figure 9.** Convergence of the fourth new task of the IEEE 300-bus system obtained in the online optimization.

#### 4.2.2. Comparative results and discussion

Tables 2 and 3 provide the performance and the statistical results gained by these algorithms in 10 runs, in which the convergence time was the average time of each scene, and the others were the average time of one day. The variance, standard deviation (Std. Dev.), and relative standard deviation (Rel. Std. Dev.) [42–44] were introduced in order to assess the stability. One can easily find that the convergence rate of the TBO was faster compared with that of other algorithms, as illustrated in Figure 10. Compared with that of other algorithms, the convergence rate of the TBO was about 4 to 68 times, indicating the benefit of cooperative exploration and action transfer. In addition, the optimization target function from the TBO was much smaller than that of other algorithms, which verified the advantageous effect of self-learning and global search. Note that the TBO would gain a better solution which is closer to the global optimum with respect to other algorithms with a smaller optimal objective function in most new tasks (72.92% of new tasks on the IEEE 118-bus system and 89.58% of new tasks on the IEEE 300-bus system), as shown in Figures 11 and 12.

(**b**) IEEE 300-bus system

**Figure 10.** Comparison of performance of different algorithms obtained in 10 runs.

**Figure 11.** Optimal objective function of the IEEE 118-bus system obtained by different algorithms in 10 runs.

**Figure 12.** Optimal objective function of the IEEE 300-bus system obtained by different algorithms in 10 runs.

**Table 2.** Performance indices results of different algorithms on the IEEE 118-bus system obtained in 10 runs. ABC: artificial bee colony, GSO: group search optimizer, ACS: ant colony system, PSO: particle swarm optimization, GA: genetic algorithm, QGA: quantum genetic algorithm, Ant-Q: ant colony based Q-learning


**Table 3.** Performance indices results of different algorithms on the IEEE 300-bus system obtained in 10 runs.


On the other hand, Table 3 shows that the convergence stability of the TBO was the highest as the values of all of its indices were the lowest, and Rel. Std. Dev. of the TBO was only 17.39% with respect to that of PSO gotten from the IEEE 300-bus system and was up to 75.79% of that of ABC gotten from the IEEE 118-bus system. This was due to the exploitation of prior knowledge by scouts and workers, which beneficially avoids the randomness of global search, thus a higher search efficiency can be achieved.

#### **5. Conclusions**

In this article, a novel TBO incorporating behavior conversion was obtained for RPO. Like other AI algorithms, the TBO is highly independent from the accurate system model and has a much stronger global search ability and a higher application flexibility compared with the traditional mathematical optimization methods. Compared with network simplified model (e.g., external equivalent model) -based methods, the TBO also can rapidly converge to an optimum for RPO but it can obtain a higher quality optimum via global optimization. By introducing the Q-learning-based optimization, the TBO can learn, store, and transfer knowledge between different optimization tasks, in which the state-action chain can significantly reduce the size of the *Q*-value matrix, and the cooperative exploration between scouts and workers can dramatically accelerate knowledge learning. Compared with the general AI algorithms, the greatest advantage of the TBO is that it can significantly accelerate the convergence rate for a new task via re-using prior knowledge from the source tasks. Through simulation comparisons on the IEEE 118-bus system and IEEE 300-bus system, the convergence rate of the TBO was 4 to 68 times faster than that of existing AI algorithms for RPO, while ensuring the quality and convergence stability of the optimal solutions. Thanks to its superior optimization performance, the TBO can be readily applied to other cases of nonlinear programming of large-scale power systems.

**Author Contributions:** Preparation of the manuscript was performed by H.C., T.Y., X.Z., B.Y., and Y.W.

**Funding:** This research was funded by [National Natural Science Foundation of China] grant number [51777078], and [Yunnan Provincial Basic Research Project-Youth Researcher Program] grant number [2018FD036], and The APC was funded by [National Natural Science Foundation of China].

**Conflicts of Interest:** The authors declare no conflict of interest.
