*4.2. Reward Function*

For a given output voltage *<sup>V</sup>*pv, the PV system can generate the corresponding power under the current solar irradiation, temperature, and PSC. In TRL, the higher the quality of the solution is, the larger reward the individual will receive. Based on this rule, the reward function can be designed as [30]:

$$R\_{i,k}^{l,m}(s\_{i,k}^{l,m}, s\_{i,k+1}^{l,m}, a\_{i,k}^{l,m}) = \begin{cases} \max\_{m=1,2,\dots,M} f(V\_{\text{PV}}^{m}), \text{ if } \{s\_{i,k}^{l,m}, a\_{i,k}^{l,m}\} \in \text{SA}\_k^{\text{best}}\\ 0, & \text{otherwise} \end{cases} \tag{12}$$

where *Vm*pv is the obtained solution by the *m*th individual and *SA*best *k* denotes the explored state–action pairs set of the best individual with the maximum power output at the *k*th iteration.
