*2.2. PSC E*ff*ect*

In general, the PV system needs to ensure a certain output voltage; however, a single PV cell can only output extremely low voltage (almost 0.6 V). Hence, PV cells are always connected with each other in a string to improve the output voltage. At the same time, when the array is shaded for some reason, the output voltage of the PV cells in the shaded part will be lower than that of the unshaded PV cells, due to the decline of received solar irradiation. Consequently, the shaded PV cells will consume a part of the generated power. This phenomenon causes large loss of output power in the PV string. In addition, it also leads to hot spots in the location of the shaded PV cells, which will greatly decrease the service life of PV cells [27].

To solve this issue, the shaded PV cells are usually bypassed by bypass diodes. Figure 1a demonstrates the operation in a PV array with parallel strings. Although adding bypass diodes can effectively solve the issues mentioned above in shaded PV cells, they also result in a new problem, e.g., they will distort the original *P–V* characteristic curves of PV cells and form a two-peak curve. In particular, such a situation turns thornier when a few PV strings are connected in parallel for the sake of obtaining a larger output current. Generally, when the number of shaded PV cells on each string changes, each string will generate various PV curves. Because of the parallel connection, those PV curves with multiple peaks are usually combined to produce a multi-peak curve illustrated in Figure 1b Hence, in order to determine the maximum solar energy from the PV array, the PV systems ought to operate at the GMPP all the time. Only in this way, the large amount of energy will not be lost at LMPP.

**Figure 1.** Partial shading conditions (PSC) effect. (**a**) Power–voltage (*P*–*V*) curve under uniform solar irradiation and temperature and (**b**) *P*–*V* curve under PSC.

### **3. Transfer Reinforcement Learning with Space Decomposition**

The proposed TRL mainly contains two operators, i.e., the RL via uninterrupted interplay with the environment and the knowledge transfer between the previous and new tasks, as clearly illustrated in Figure 2.

**Figure 2.** Principle of transfer reinforcement learning (TRL) with space decomposition.

### *3.1. Space Decomposition Based Reinforcement Learning*

RL is a commonly used machine learning technique, which can acquire new knowledge in a dynamic environment via interaction. Here, the famous RL called Q-learning is adopted to learn the MPPT knowledge. However, if a system needs a high control accuracy, the searching space of the continuous control variable should be divided into a large number of selected actions (e.g., 10<sup>6</sup> actions for a continuous control variable between 0 to 1). As a result, the conventional Q-learning [28] easily encounters the curse of dimension and a low-efficiency learning rate for selecting an optimal action of a continuous control variable [29].

To handle this problem, the space decomposition is introduced to decompose the large original searching space into multi-layered smaller searching subspaces. As illustrated in Figure 3, the optimization space of the *i*th controllable variable *x*i can be decomposed into *J* smaller searching subspaces in each layer. If the *j*th action *ai*1*<sup>j</sup>* is selected in the first layer's searching space, then the agen<sup>t</sup> will seek a more accurate searching space in the corresponding second layer's searching space. Therefore, the optimization accuracy of the control variable *xi* can be calculated as

$$OA\_i = \frac{\mathbf{x}\_i^{\text{ub}} - \mathbf{x}\_i^{\text{lb}}}{c \cdot l} \tag{4}$$

where *c* represents the number of decomposition layers; and *xi*lb and *xi*ub are the lower and upper bounds of the *i*th controllable variable, respectively.

**Figure 3.** Knowledge transfer of TRL between two adjacent tasks for maximum power point tracking (MPPT) with PSC.

Based on Equation (4), if the number of actions in each layer is set to be 10 (i.e., *J* = 10), then the same accuracy (10−6) can be achieved for a continuous control variable between 0 to 1 when *c* = 6. This means that the number of selected actions can be significantly reduced from 10<sup>6</sup> to 10. Therefore, the learning rate and control accuracy of Q-learning can be considerably improved, based on the space decomposition.

After selecting all the actions in all the layers, the solution of the controllable variable can be identified as

$$\mathbf{x}\_{i} = \mathbf{x}\_{i}^{\text{c,lb}} + a\_{i}^{\text{c}j} \left( \mathbf{x}\_{i}^{\text{c,ub}} - \mathbf{x}\_{i}^{\text{c,lb}} \right) / I \tag{5}$$

$$\mathbf{x}\_{i}^{l, \text{lb}} = \begin{cases} \mathbf{x}\_{i}^{\text{lb}}, & \text{if } l = 1\\ \mathbf{x}\_{i}^{l-1, \text{lb}} + a\_{i}^{l-1, j} \frac{\left(\mathbf{x}\_{i}^{l-1, \text{ub}} - \mathbf{x}\_{i}^{l-1, \text{lb}}\right)}{l}, & \text{otherwise} \end{cases} \tag{6}$$

$$\mathbf{x}\_{i}^{l,\text{ub}} = \begin{cases} \mathbf{x}\_{i}^{\text{ub}}, & \text{if } l = 1\\ \mathbf{x}\_{i}^{l-1,\text{ub}} + a\_{i}^{l-1,j} \cdot \frac{\left(\mathbf{x}\_{i}^{l-1,\text{ub}} - \mathbf{x}\_{i}^{l-1,\text{lb}}\right)}{l}, & \text{otherwise} \end{cases} \tag{7}$$

where *xl*,lb *i* and *xl*,ub *i* are the lower and upper bounds of the *l*th layer's searching space, respectively; while *ailj* is the *j*th action in the *l*th layer's searching space.
