4.2.3. Action Selections

As all individuals are exploring and learning, they are faced with action selections. When the individual *j* prepares to determine the variable *xi*, its action selection is based on the following equation [41]:

$$a\_{k+1}^{ij} = \begin{cases} \operatorname\*{argmax}\_{a^i \in A\_i} \mathbb{Q}\_{k+1}^i \{s\_{k+1}^{ij}, a^i\}, \text{ if } q \le q\_0\\ & a\_{S'} \text{ otherwise} \end{cases} \tag{18}$$

where *q* is a random number; *q*0 is a positive constant for determining the probability of a pseudo-random selection; *as* denotes the action determined by the pseudo-random selection. In this paper, the rotary selection method is adopted to determine the action to be selected according to the *Pik* distribution of the action probability matrix, and the probability matrix is calculated as follows:

$$P\_{k+1}^{i} \left( s\_{k+1'}^{i j}, a\_{k+1}^{i} \right) = \frac{\mathbb{Q}\_{k+1}^{i} \left( s\_{k+1'}^{i j}, a\_{k+1}^{i} \right)}{\sum\_{a^{i} \in A\_{i}} \mathbb{Q}\_{k+1}^{i} \left( s\_{k+1'}^{i j}, a^{i} \right)} \tag{19}$$

When an individual finds the best value of the objective function, the probability of its state-action for the corresponding action will be increased, which will attract other individuals to perform the same action. When the algorithm converges, all individuals will perform the same state-action pair when selecting all variables from the start to the end.

#### **5. OCECF Based on MCR-Q(**λ**) Learning**

#### *5.1. Design of State and Action*

As mentioned above, the action space of each variable is designed to be the state space of the next variable, in which the state space of the first variable is designed to be the state set of the environment (i.e., the power grid). For OCECF, the power grid load scenario can be designed as the state of the first variable, where a load scenario is divided at every 15 min and the scenarios with similar loads are set to the same state, e.g., the power grid load scenarios with different loads at 11:00 a.m. and 11:15 a.m. can be regarded as two different states.

In addition, OCECF mainly optimizes the carbon emissions on the power grid side, and the variables in the model are mainly divided into two categories: (a) reactive power compensation device and (b) the OTLC ratio. Thus, the action set corresponding to each variable is a discrete optional action of the reactive power compensation quantity or transformer changer ratio.

#### *5.2. Design of Reward Function*

As shown in Equation (17), *LBest* represents the optimal objective function value of all individuals. According to the OCECF model described by Equation (5), the inequality constraint is brought in by the objective function, and then the objective function value obtained by the individual *j* becomes [41]

$$L^{\dot{j}} = \mu\_1 f\_1(\mathbf{x}^{\dot{j}}) + \mu\_2 f\_2(\mathbf{x}^{\dot{j}}) + (1 - \mu\_1 - \mu\_2)V\_d^{\dot{j}} + N^{\dot{j}} \tag{20}$$

$$L\_{\text{Best}} = \min\_{j \in \mathcal{J}} L^j \tag{21}$$

where *Nj* denotes the number of unsatisfied inequality constraints calculated by the power flow after the individual *j* determines the variable, and *J* is the number of groups.
