*3.2. Knowledge Update*

According to the learning mechanism of Q-learning, the knowledge matrix can be updated based on the executed state–action pair with the feedback reward. By combining the space decomposition, the knowledge matrix of each searching space layer can be updated as [28]:

$$\begin{split} \mathbf{Q}\_{i,k}^{l} \{ s\_{i,k'}^{l} a\_{i,k}^{l} \} &= \mathbf{Q}\_{i,k}^{l} \{ s\_{i,k'}^{l} a\_{i,k}^{l} \} \\ &+ \alpha \left[ \mathbf{R}\_{i,k}^{l} \{ s\_{i,k'}^{l} s\_{i,k+1'}^{l} a\_{i,k}^{l} \} + \gamma \max\_{a \in \mathcal{A}\_{i}^{l}} \mathbf{Q}\_{i,k}^{l} \{ s\_{i,k+1'}^{l} a \} - \mathbf{Q}\_{i,k}^{l} \{ s\_{i,k'}^{l} a\_{i,k}^{l} \} \right] \end{split} \tag{8}$$

where *Qli* represents the knowledge matrix of the *l*th layer' searching space for the *i*th controllable variable; *sli*,*k*, *ali*,*k* is the state–action pair executed at the *k*th iteration, with *k* = 1,2, ... , *k*max; *k*max represents the maximum iteration number; α is the knowledge learning factor, with α∈ (0, <sup>1</sup>); γ denotes the discount factor, with γ ∈ (0, <sup>1</sup>); *Ril* is the reward function; and *<sup>A</sup>i<sup>l</sup>* means the action space of the *l*th layer's searching space, respectively.

It can be seen from Equation (8) that at each iteration, only one element of each knowledge matrix can be updated since the conventional Q-learning employs a single RL agen<sup>t</sup> for exploration and exploitation in a dynamic environment. Consequently, it will lead to a slow learning rate; thus a high-quality optimal solution cannot be rapidly obtained for a real-time control of PV systems. Hence, a cooperative swarm is employed to further accelerate the learning rate, as it can simultaneously update multiple elements of each knowledge matrix with multiple state-action pairs. Similar to (8), each knowledge matrix of TRL can be updated by [30]

$$\begin{split} \mathbf{Q}\_{i,k}^{l} \{ s\_{i,k}^{l,m}, a\_{i,k}^{l,m} \} &= \mathbf{Q}\_{i,k}^{l} \{ s\_{i,k}^{l,m}, a\_{i,k}^{l,m} \} \\ &+ a \left[ R\_{i,k}^{l,m} \{ s\_{i,k}^{l,m}, s\_{i,k+1}^{l,m}, a\_{i,k}^{l,m} \} + \gamma \max\_{a \in A\_{i}^{l}} \mathbf{Q}\_{i,k}^{l} \{ s\_{i,k+1}^{l,m}, a \} \right. \\ &- \mathbf{Q}\_{i,k}^{l} \{ s\_{i,k}^{l,m}, a\_{i,k}^{l,m} \} \Big|, \qquad m = 1, \ 2, \ \dots, \ M. \end{split} \tag{9}$$

where *M* represents the population size of the cooperative swarm.
