6.1.2. Convergence Analysis

Figure 3 illustrates the convergence process of the *Q*-value deviation between Q(λ) learning and MCR-Q(λ) learning under scenario 1, where the *Q*-value deviation is defined as the 2-norm of matrix (*Qk*+<sup>1</sup> − *Qk*), that is, *Qk*+<sup>1</sup> − *Qk*2. As obtained from Figure 3a, since the *Q* matrix of single-objective Q(λ) learning is large and the updating speed is slow, the algorithm can converge to the optimal *Q*\* matrix through a variety of trial-and-error explorations, while the convergence time is about 530s. In contrast, after reducing the dimension of the solution space of MCR-Q(λ) learning, the *Qi* matrix corresponding to each variable is very small, and 20 objectives are updated at the same time. The optimization speed is more than 100 times of that of Q(λ) learning, which can converge after about 3.5 s, as shown in Figure 3b. Moreover, it can be obtained from the convergence of the objective function values in Figure 4 that the optimization speed of MCR-Q(λ) learning is much faster, and both algorithms can converge to the global optimal solution.

(**b**) MCR-Q(λ) learning

**Figure 3.** *Q*-value di fference convergence.

**Figure 4.** Convergence process of the objective function value.

When MCR-Q(λ) learning converges, the value function matrix *Qi* and probability matrix *Pi* corresponding to all variables will prefer a state-action pair, and all individuals will tend to be consistent in selecting the action, as demonstrated in Figure 5.

**Figure 5.** Convergent results of state-action pairs by MCR-Q(λ) learning.

#### 6.1.3. Comparative Analysis of Simulation Results

For the purpose of evaluating the optimization capability of MCR-Q(λ) learning, this section applies all the algorithms to solve the OCECF model for 10 repetitions. For each method, the objective function value is directly taken to evaluate the quality of a solution during the searching process, which is the most crucial index to evaluate the optimization performance.

Table 5 indicates the average convergence results of 10 repetitions for the di fferent algorithms, and it can be found that:



**Table 5.** Average results of di fferent algorithms on the IEEE 118-bus system in 10 runs.

Figure 6 gives the results comparison between di fferent methods, where each value is the average of the sum value of five scenarios in 10 runs. It is obvious that the result obtained by GA is the worst among all the methods due to its premature convergence. On the other hand, the proposed MCR-Q(λ) learning only has a slight improvement on each index compared with the other methods, but it also can obtain the lowest total carbon flow loss and objective function. It verifies that the proposed method can effectively satisfy the low-carbon requirement from the viewpoint of power networks.

(**d**) Objective function

**Figure 6.** Comparison of results obtained by different methods in the IEEE 118-bus system.

Lastly, Table 6 gives the statistic convergence results of 10 repetitions for the different algorithms, and it can be found that:


**Table 6.** Distribution statistics of the objective function under different algorithms in the IEEE 118-bus system in 10 runs.


#### *6.2. Case Study of the IEEE 300-Bus System*
