**Milad Hooshyar 1, S. Jamshid Mousavi 2, Masoud Mahootchi <sup>3</sup> and Kumaraswamy Ponnambalam 4,\***


Received: 13 June 2020; Accepted: 23 September 2020; Published: 25 September 2020

**Abstract:** Stochastic dynamic programming (SDP) is a widely-used method for reservoir operations optimization under uncertainty but suffers from the dual curses of dimensionality and modeling. Reinforcement learning (RL), a simulation-based stochastic optimization approach, can nullify the curse of modeling that arises from the need for calculating a very large transition probability matrix. RL mitigates the curse of the dimensionality problem, but cannot solve it completely as it remains computationally intensive in complex multi-reservoir systems. This paper presents a multi-agent RL approach combined with an aggregation/decomposition (AD-RL) method for reducing the curse of dimensionality in multi-reservoir operation optimization problems. In this model, each reservoir is individually managed by a specific operator (agent) while co-operating with other agents systematically on finding a near-optimal operating policy for the whole system. Each agent makes a decision (release) based on its current state and the feedback it receives from the states of all upstream and downstream reservoirs. The method, along with an efficient artificial neural network-based robust procedure for the task of tuning Q-learning parameters, has been applied to a real-world five-reservoir problem, i.e., the Parambikulam–Aliyar Project (PAP) in India. We demonstrate that the proposed AD-RL approach helps to derive operating policies that are better than or comparable with the policies obtained by other stochastic optimization methods with less computational burden.

**Keywords:** multireservoir operations; optimization; multi-agent reinforcement learning; aggregation–decomposition; neural networks

#### **1. Introduction**

Multi-reservoir optimization models are generally non-linear, non-convex, and large-scale in terms of the number of variables and constraints. Moreover, uncertainties in stochastic variables such as inflows, evaporation, and demands make it difficult to find even a sub-optimal operating policy. In general, two types of stochastic programming approach are used to optimize multi-reservoir systems operations under uncertainty, i.e., implicit and explicit. In implicit stochastic optimization (ISO), a large number of historical or synthetically generated sequences of random variables such as streamflow are generated as the input for a deterministic optimization model. The generated results represent different aspects of the underlying stochastic process, such as spatial or temporal correlations among random variables involved in the process. Optimal operation policies are then acquired by performing post-processing analysis on the outputs of the deterministic optimization model solved for different input sequences (samples).

Different optimization models have been used in water reservoir problems through ISO, including linear programming (LP) [1,2], successive LP [3], successive quadratic programming [4], method of multipliers [5], and dynamic programming (DP) [6]. ISO is computationally intensive in large multi-reservoir systems, especially the highly non-linear ones, because solving a large-scale non-linear optimization is often tedious and time-consuming. More importantly, this approach does not directly lead to optimal operating rules as functions of hydrological variables available at the time of decision making; so a post-processing technique such as multiple linear regression [7], neural nets [8], and fuzzy rule-based modeling [9–11] may be used in deriving operating policies.

In the explicit stochastic optimization (ESO), probability distributions of random variables are used to derive a transition probability matrix (TPM) required for modeling dynamics of the underlying stochastic process. Several stochastic optimization methods have been applied to optimal multi-reservoir operations problems such as chance-constrained programming [12,13], reliability programming [14–16], the Fletcher–Ponnambalam (FP) method that does not require discretization or inflow scenarios [17–19], stochastic LP [20], stochastic dynamic programming (SDP), with some approximations for dimensionality reduction [21–35], stochastic dual DP (SDDP) [36–40], sampling SDP (SSDP) [41], and reinforcement learning (RL) [6,42,43].

SDP is one of the most widely-used explicit stochastic methods in reservoir operations which requires models to derive the TPM, and the optimal steady-state policy is guaranteed to be global for a given discretization. However, the application of SDP is limited since perfect knowledge about the underlying stochastic model is needed, and the computational effort increases exponentially by the size of the system or the number of state variables (the curse of dimensionality). More details on SDP is presented later in the corresponding section.

In order to address the drawback of deriving TPM for SDP, RL can be applied in a similar framework as SDP but in a simulation-based setup. In RL, an agent (operator) learns to take proper actions through independent interaction with a stochastic environment. In one of the first RL application in water management in literature, Bhattacharya et al. [44] developed a controller for a complex water system in the Netherlands using coupled artificial neural networks (ANN) and RL model, where RL is just used to mitigate the error of the ANN. They also stated that RL and its combinations could be very helpful in water management problems such as reservoir operations, but such works did not appear in literature until 2007. Lee and Labadie [6] compared the performance of RL with some other optimization techniques in a 2-reservoir case study. They used all possible actions as a set of admissible actions. This choice can make the applicability of the learning method inefficient as some actions-state pairs are practically impossible (infeasible), an issue that is dealt with in detail later. Mahootchi et al. [42] applied a method called opposition-based RL (OBRL) in a reservoir case study where the agent takes an action and its opposite action simultaneously in order to speed up the convergence of RL. Castelletti et al. [45] developed a new method based on RL and tree-based regression called Tree-based RL (TBRL), which uses operational data through the learning process. They applied TBRL to a system that consisted of one reservoir and nine run-of-the-river hydro plants. In comparison with SDP, TBRL is computationally more tractable and has better performance, especially during floods. More recent applications of RL to multireservoir operations optimization can be seen in [46,47].

Although RL is capable of eliminating the need for prior perfect knowledge of the underlying stochastic model of the environment, as the learning can be performed using historical operational data, and also provides some computational advantages, it still suffers from the curse of dimensionality for large-scale problems. As a remedy, this paper presents an RL-based model combined with an aggregation–decomposition (AD) approach (referred to as AD-RL) that reduces the dimensionality problem to efficiently solve a stochastic multi-reservoir operation optimization problem.

#### **2. Aggregation–Decomposition Methods and Reinforcement Learning**

#### *2.1. Reservoir Operation Optimization Model*

In the following, the objective function and constraints for the optimization model of the reservoir operations are described.

#### 2.1.1. Objective Function

In reservoir problems, releases and/or storage values of reservoirs are the decision variables and the objective function may be defined as the maximization of net benefit or minimization of a penalty function. A general form of the objective function in a multi-reservoir application can be written as:

$$\text{rf}\{\mathcal{R}^t\} = \sum\_{i=1}^{N} \sum\_{j=1}^{N} \sum\_{t=1}^{T} \mathbf{f}\_{ij}^t \{s\_i^t, \mathcal{R}\_{ij}^t, D\_i^t\} \tag{1}$$

where f is the cost or revenue function which could be a function of *s<sup>t</sup> <sup>i</sup>* (the storage volume of reservoir *i* at the beginning of time period *t*), *Rt ij* (release from reservoir *i* to reservoir *j* in time period *t*), *Dt <sup>i</sup>* (the demand to be met by reservoir *i* in time period *t*), and *N* denotes the number of reservoirs.

#### 2.1.2. Constraints

There are three types of constraint in reservoir operations. The first type is balance equations which are in fact equality constraints referring to the conservation of mass (in appropriate units) with respect to inflow to and outflow from the reservoirs as expressed below:

$$s\_i^{t+1} = s\_i^t - \sum\_{l=1, l \neq i}^{N\_D} R\_{il}^t \times \delta\_{il} + l\_i^t - \upsilon\_i^t + \sum\_{l=1, l \neq i}^{N\_{ll}} R\_{il}^t \times \delta\_{li} \quad \forall i = 1 \ldots N \text{ \& } t = 1 \ldots T \tag{2}$$

where *I t <sup>i</sup>* is the amount of incremental natural inflow to every reservoir *<sup>i</sup>* in period *<sup>t</sup>*, *<sup>v</sup><sup>t</sup> <sup>i</sup>* is the loss (e.g., evaporation) from reservoir *i* in period *t*, *Rt il* is the amount of inflow from upstream reservoir *i* to reservoir *l* in period *t*, *ND* is the number of downstream reservoirs, *NU* is the number of upstream reservoirs. δ*ij* is an element of the routing matrix that is 1 if *i*th reservoir is physically connected to *j*th reservoir and is zero otherwise.

The second set of constraints are upper and lower bounds on reservoir storage variables. The upper bounds may consider the flood control objective or physical reservoirs storage capacities, while the lower bounds may take the objectives of sedimentation, recreation, and functionality of power generation into account. The constraints can be expressed as:

$$Smin\_{i}^{t} \leq s\_{i}^{t} \leq Smax\_{i}^{t} \tag{3}$$

where *Smin<sup>t</sup> <sup>i</sup>* and *Smax<sup>t</sup> <sup>i</sup>* are the maximum and minimum storage volumes of reservoir *i* in time period *t*, respectively.

The last set of constraints are upper and lower bounds of reservoir releases. The purpose of these constraints is to provide a minimum instream flow for water quality and ecosystem services and to supply water to meet different demands while considering capacities of outlet works and preventing downstream flooding. These constraints are often modeled as:

$$Rmin\_{i}^{t} \leq \sum\_{l=1}^{N} R\_{il}^{t} \times \delta\_{il} \leq Rmax\_{i}^{t} \tag{4}$$

where *Rmin<sup>t</sup> <sup>i</sup>* and *Rmax<sup>t</sup> <sup>i</sup>* are the maximum and minimum releases from reservoir *i*, respectively.

#### *2.2. Stochastic Dynamic Programming (SDP)*

Discrete SDP is a popular method for stochastic optimization of reservoir operations [45]. The solution of SDP for long-term operations is typically a steady-state operating policy representing optimal decisions (e.g., the volume of releases) for all possible combinations of states (e.g., storages of reservoirs, etc.) which is obtained through solving the Bellman's optimality equation iteratively. The state of the system is usually divided into some specific discrete values and the recursive function (a function of the state and the decision vectors) is updated in every iteration based on the estimated value function, *Vt*(*i*), defined as the maximum accumulated reward from period *t* to a termination point in time for a given state *i*. The value function is obtained based on the optimality equation proposed by Bellman [48] as:

$$\mathcal{V}^t(i) = \max\_{a \in \mathcal{A}(i)} \sum\_{j} \mathcal{P}^t\_{ij}(a) \times \left[ \mathbf{r}^t\_{ij}(a) + \boldsymbol{\gamma} \boldsymbol{V}^t(j) \right] \tag{5}$$

where P*<sup>t</sup> ij*(*a*) is the probability of transition from state *i* to state *j* when an action *a* in period *t* is taken, r*<sup>t</sup> ij*(*a*) is the reward function pertinent to action *a* for that transition in period *t*, A(*i*) is the set of admissible actions for state *i*, and γ is the discount factor. Equation (5) is solved backward, i.e., *t* : *T*, *T* − 1, ··· , 1 where *T* is the last time period.

SDP suffers from a dual curse which makes it unsuitable to cope with large-scale problems [45]. First, the dimension of the optimization problem grows exponentially with the number of the state and decision variables (the curse of dimensionality). Therefore, the SDP algorithm is not computationally tractable in systems where the number of reservoirs is more than a few. Second, prior knowledge about the underlying Markov decision process (explicit model) of inflow variables, including state TPM and rewards, is required (curse of modeling). This prior knowledge may be difficult to access due to the complexities of multi-reservoir systems or insufficient data [6], considering the spatial and temporal dependence structure of inflow stochastic variables.

The attempts to overcome the curse of dimensionality can be categorized into two main classes, namely, methods based on function approximation and aggregation/decomposition. In methods based on function approximation, the combination of a coarse discretization size and approximation of value functions is used in order to save the quality of the extracted policy. Different methods are applied to approximate value function, including Hermitian polynomials [27], cubic piecewise polynomial [24], and ANN [22].

In the aggregation/decomposition-based methods, the original problem is broken down into some tractable sub-problems solvable by SDP. Each sub-problem is related to one specific reservoir (might be more than one reservoir in some cases) which is connected to other reservoirs based on the designed configuration. The main goal in each sub-problem is to find the best release policy for that reservoir based on two different state variables: actual and virtual states. Turgeon [34] developed a simple aggregation/decomposition method for a serial or parallel configuration called one-at-a-time where an N-reservoir problem changed into N one-reservoir sub-problems. The sub-problems are solved successively by SDP. Turgeon [35] also modified their method by considering only the potential hydropower energy of the downstream reservoirs in addition to the corresponding state of the actual reservoir.

The technique of state aggregation may also be performed in a different way in which some unimportant action-state pairs are systemically eliminated. For instance, Mousavi and Karamouz [26] considered eliminating infeasible action-state pairs in order to speed up the convergence of DP-based methods in multi-reservoir problems. Saad and Turgeon [29] also developed a method to eliminate some components of state vector based on principal component analysis which was modified by Saad et al. [31] through applying censored data algorithm.

Some researchers also combined both aggregation and function approximation techniques to alleviate the curse of dimensionality in SDP. Saad et al. [30] aggregated a 5-reservoir problem into one-reservoir problems which were solved then using SDP.

In this study, we use the aggregation/decomposition methods proposed by Archibald, et al. [21], and Ponnambalam and Adams [28]. The detailed description of these methods is presented in the following sections.

#### *2.3. Aggregation–Decomposition Dynamic Programming (AD-DP)*

The aggregation-decomposition method, proposed by Archibald et al. [21] and referred to as aggregation–decomposition-dynamic programming (AD-DP), decomposes the original problem into some sub-problems equivalent to the number of reservoirs. All sub-problems could be solved in parallel using SDP. The set of state variables for each sub-problem can be defined as the beginning storage for actual (focus) reservoir, the summation of beginning storages of all upstream reservoirs (virtual up-stream reservoir), and the summation of beginning storages of all down-stream reservoirs (virtual down-stream reservoir). The decisions at each period are the next state of the upstream reservoirs, released from the focus (actual) reservoir and the next state of the non-upstream reservoirs. The main limitation of AD-DP is that the total storage of the virtual upstream and non-upstream reservoirs in each iteration of SDP should be proportionally distributed among the respective reservoirs based on their capacities. Given the end-of-period storage for the virtual reservoir, one could analogously find the end-of-period storages for these upstream reservoirs as well. Archibald et al. [21] note that the proposed aggregation/decomposition method in a multi-reservoir system would end up with a near-optimal release policy, however, it is still more restrictive than the method that we describe next.

#### *2.4. Multilevel Approximation-Dynamic Programming (MAM-DP)*

Ponnambalam and Adams [28] proposed a decomposition method to overcome the main restriction of Turgeon's model [35] in which only the serial or parallel configurations can be tackled. The number of reservoirs considered was only two (one virtual), and the objective function was separable by the reservoir. In multilevel approximation-dynamic programming (MAM-DP), the state variables for each reservoir can be defined similarly to AD-DP [21]; however, the virtual reservoir only includes the summation of all characteristics (inflow, capacity, minimum storage, etc.) of the rest of the reservoirs (e.g., the capacity of this virtual reservoir is the total summation of capacities of all corresponding reservoirs). The way of decomposition generally depends on the presented configuration (i.e., decomposition might be different from one problem to another; in Ponnambalam and Adams [49] the algorithm proceeded from upstream to downstream, so the virtual reservoir only considered downstream reservoirs).

#### *2.5. Reinforcement Learning (RL)*

RL [50] is a computational method based on learning through interaction with the environment. Despite the SDP, RL does not presume knowledge about the underlying model as the knowledge about the environment is gained through real-world experience (on-line) or simulation (off-line). Using simulation, RL is able to overcome the curse of modeling; however, RL only mitigates the curse of dimensionality to some extent as it searches the feasible action-state pairs heuristically.

The basic idea of RL can simply be described as a learning agent interacting with its environment to achieve a goal [50]. In reservoir operations, the agent is the operator of a reservoir who makes the decisions over the release (as the action) based on a policy while the state space consists of a set of discrete values of storage.

Beyond the agent and the environment, there are four main sub-elements in RL: a policy, a reward function, a value function, and optionally a model of the environment [50]. The model components of RL determine the next state and the reward of the environment based on a mathematical function. Figure 1 illustrates a schematic perspective of RL.

**Figure 1.** The schematic view of reinforcement learning.

#### 2.5.1. Action-Taking Policy

The chance of taking an action in a specific state is actually a trade-off between exploration (taking an action randomly) and exploitation (taking the best action), which leads to four common action policies in the literature: random, greedy, ε-greedy and Softmax. In the random policy, there is no action preference. In the ε-greedy policy, greedy action (*a*∗ ) in each state is chosen most of the time; however, once in a while, the agent tries to choose an action in the set of admissible actions with the probability of <sup>ε</sup> <sup>|</sup>A(*i*)<sup>|</sup> where <sup>ε</sup> is the probability of taking non-greedy actions and <sup>A</sup>(*i*) is the set of admissible actions for state *i*. In the greedy policy, the agent picks the best one among all admissible actions in each iteration, with respect to the last estimate of the action-value function for all admissible actions in the respective state. In contrast to the greedy policy, the Softmax policy derived from the Boltzmann's distribution can be defined in which the proportion of exploration versus exploitation changes as the process of learning continues.

#### 2.5.2. Admissible Actions

The agent should choose an action among candidate actions, which are called possible actions. Evaluation of all state-action pairs is computationally expensive and makes the learning process inefficient. According to the stochastic environment of reservoir operations, some actions might be infeasible to take in some conditions of stochastic variables (e.g., minimum inflow and maximum evaporation) and some are always infeasible. One may define the set of admissible actions based on the worst condition of stochastic variables. However, adopting a such pessimistic approach [42] limits the number of admissible actions, eliminating some actions which might be infeasible in rare conditions; so it is not efficient in terms of finding the optimal policy. Considering the best possible conditions of stochastic variables, the set of admissible actions can be defined using an optimistic approach.

#### 2.5.3. Q-Learning

Q-learning [51] is an RL formulation which has been derived from the formulation of SDP. The value function in SDP is substituted with an action-value function, which is a value defined for every pair of action-states, instead of every state in the value function. Since the learning process is implemented by direct interaction with the environment, value functions have to be updated after each interaction. To find an updated value function based on the Bellman equation, all admissible actions in the respective state and period must be tested with respect to this equation, and the best value is chosen as a new value function. Therefore, we can introduce another terminology called action-value function, Q(*i*, *a*), demonstrating the expected accumulated reward when a decision maker starts from state *i* and takes action *a*. Using these new values, the formulation of SDP in the Bellman equation can be written as follows:

$$\mathbb{Q}^t(i, a) = \sum\_{j} \mathbf{p}\_{ij}^t(a) \left[ \mathbf{r}\_{ij}^t(a) + \gamma \max\_{b \in \mathcal{A}(j)} \mathbb{Q}^{t+1}(j, b) \right] \tag{6}$$

*Water* **2020**, *12*, 2688

where r*<sup>t</sup> ij*(*a*) is the immediate reward in period *t* when the action *a* is taken for the transition from state *i* to state *j*, and *b* and *b*∗ are admissible actions and the best one with respect to the next state *j*, respectively.

As demonstrated in Equation (6), the action-value function is the expected value of the sequence of data in the form of r*t ij*(*a*) + γ max *b*∈A(*j*) Q*t*+1(*j*, *b*) . Using the Robbins–Monro algorithm [52], it is simple to find a new formulation as follows:

$$\mathbf{Q}^t(i, a) \leftarrow \mathbf{Q}^t(i, a) + a \times \left[ \mathbf{r}^t\_{ij}(a) + \gamma \max\_{b \in \Lambda(j)} \mathbf{Q}^{t+1}(j, b) - \mathbf{Q}^t(i, a) \right] \tag{7}$$

where α is called the learning parameter. Q-learning is a model-free algorithm in which the transition probabilities are not used for updating the action values.

#### *2.6. Aggregation*/*Decomposition Reinforcement Learning (—RL)*

In this section, a new method, namely aggregation/decomposition RL (AD-RL) is proposed in which Q-learning, is used jointly with AD-DP method. Furthermore, the way of using aggregation/decomposition has been derived from [21] which was generalized in [25].

As mentioned above, Archibald et al. [21] proposed an aggregation/decomposition method in which the original problem should be decomposed to *n* sub-problems where *n* is the number of reservoirs. All these sub-problems can be individually solved using SDP or Q-learning. The release of each actual reservoir in each sub-problem is a function of states: the beginning storage of focus (actual) reservoir, the total beginning storage of upstream and non-upstream reservoirs. This means that two virtual reservoirs are assumed to be connected to the actual reservoir in which one is stated in its upstream and the other is in its downstream. As has been explained in Archibald's technique [21], to decompose the original problem, the upstream and non-upstream should be properly defined. In this decomposition technique, the upstream reservoirs are initially specified. The rest of the reservoirs are, therefore, non-upstream reservoirs. Based on his definition, the upstream reservoirs of an actual reservoir in a sub-problem are those whose releases directly or indirectly reach the focus (actual) reservoir. We defined the non-upstream reservoirs as those which directly or indirectly receive the release from the focus (actual) reservoir. The rest of the reservoirs are upstream ones. Based on some numerical examples, we found that the new definition of virtual reservoirs would end up with a superior release policy in SDP or Q-learning.

Indeed, there are different interrelated sub-problems in AD-RL that are connected to each other through the releases from actual reservoirs. That means that action (release) taken for each focus (actual) reservoir in a sub-problem in one period is used as an input to the actual reservoir in the next sub-problem in the same period. Therefore, it can be assumed that every sub-problem is conducted by a specific agent and all these agents could share their information through their releases in every interaction of the learning process in AD-RL. Therefore, AD-RL might be interpreted as a multi-agent RL algorithm in which every agent should make a proper decision using its own information and the information received from other agents, including releases and the beginning storages for their respective actual reservoirs. Moreover, the feedback for each agent in the next stage (e.g., the next period) comes from the respective agent and other agents after making their decisions. This feedback includes the next end-of-period storages and the immediate reward for all actual reservoirs in the sub-problems. In other words, each agent should be individually trained until it converges in a steady-state situation in which an optimal policy for that agent can be obtained. The schematic way of training is illustrated in Figure 2.

**Figure 2.** Schematic of aggregation/decomposition reinforcement learning (AD-RL) algorithm.

As illustrated in Figure 2, each agent should interact directly with a specific environment; however, it has indirect interaction with other environments pertinent to other agents. In other words, AD-RL can be projected to a traditional RL in which there are multiple agents (the upper dotted-line box in Figure 2) and multiple environments (the lower dotted-line box in Figure 2) instead of having a single agent and single environment. Therefore, at the time of decision making, each agent should access the beginning storages of all reservoirs. The training part for each agent (updating the action-value functions) should be individually accomplished. The immediate rewards used in the training part for one specific agent come from the reservoir which is related to that agent and from the non-upstream reservoirs that are related to other agents. The developed algorithm (AD-RL) can be summarized as the following steps:

Step 1—The original problem is decomposed to some sub-problems in which the first sub-problem starts for the most upstream reservoir where no releases flow to that reservoir. The last sup-problem is, therefore, one with no downstream reservoirs. As previously mentioned, for each sub-problem, there are two virtual reservoirs: upstream and non-upstream reservoirs. The non-upstream reservoirs should be initially defined. The rest of the reservoirs are upstream reservoirs.

Step 2—For each sub-problem, the states for the respective agent should be defined as follows:


Step 3—all admissible actions (releases), A*<sup>i</sup> st i* , should be properly defined (based on the optimistic or the pessimistic procedure)

Step 4—Initialize the values of Q-factor for each agent i in all possible state-action pairs Step 5—Start with initial beginning storages for all reservoirs (*s*<sup>1</sup> *<sup>i</sup>* ) and set *t* = 1 Step 6—For all agents,


$$\mathbf{P}^{t^t\_i}\_i(\mathbf{R}^t\_i) = \begin{cases} 1 - \boldsymbol{\varepsilon} + \frac{\boldsymbol{\varepsilon}}{\left( |\mathbf{A}^t(\boldsymbol{\varepsilon}^t\_i)| \right)} & \boldsymbol{R}^t\_i = \mathbf{R} \mathbf{g}^t\_i \\\ \frac{\boldsymbol{\varepsilon}}{\left( |\mathbf{A}^t(\boldsymbol{\varepsilon}^t\_i)| \right)} & \boldsymbol{R}^t\_i \neq \mathbf{R} \mathbf{g}^t\_i \end{cases} \tag{8}$$

where *Rgt <sup>i</sup>* is the best (greedy) release (action) for a given state up to the current period for agent *i*. Note that *Rgt <sup>i</sup>* might be a vector, including multiple actions (releases) because there could be more than one outlet (decision variable) for one focus (actual) reservoir in a sup-problem.

Step 7—The next state of the actual reservoir in every sub-problem is calculated for all agents using the respective balance equation as the dynamic of the system.

$$s\_{i}^{t+1} = s\_{i}^{t} + I\_{i}^{t} + R\mu\_{i}^{t} - R\_{i}^{t} \tag{9}$$

where *Rut <sup>i</sup>* is the total releases to focus (actual) reservoir *i* from its upstream reservoirs. The decision taken by an agent (*Rt i* ) in a sub-problem might not be feasible because the storage bounds are not satisfied. In this situation, the next state (the end-of-period storages) are replaced with the respective maximum or the minimum storages, and the releases are revised using the balance equation. Note that the final releases after revision process are used for computing the immediate rewards while the actions (releases before the revision process) are used for training (i.e., updating Q-factors).

Step 8—update the Q-factors for all agents as follows:

$$\begin{aligned} Q\_i^t \{ \mathbf{S}u\_{i'}^t \text{ s}\_{i'}^t \cdot \text{Snu}\_{i'}^t \cdot \text{R}\_i^t \} & \leftarrow Q\_i^t \{ \text{Su}\_{i'}^t \text{ s}\_{i'}^t \cdot \text{Snu}\_{i'}^t \text{R}\_i^t \} + \\ a \times \left[ r\_i^t + \gamma \max\_{b \in \mathcal{A}^i(s\_i^{t+1})} \bigQ\_i^{t+1} \{ \text{Su}\_i^{t+1}, \text{s}\_i^{t+1}, \text{Snu}\_i^{t+1}, b \} - \mathbf{Q}\_i^t \{ \text{Su}\_{i'}^t, \text{s}\_{i'}^t \text{ } \text{Snu}\_{i'}^t \text{R}\_i^t \} \right] \end{aligned} \tag{10}$$

where *rt <sup>i</sup>* is the reward used to update Q-factors for each action-state pair. There are two types of immediate reward for each agent in every sub-problem: actual and virtual. In the first type, the immediate reward of each agent is considered. Whereas the release from one reservoir might be used multiple times in the downstream, in the virtual type of immediate reward for each agent the benefits of this release should be calculated multiple times in those reservoirs which directly or indirectly receive this flow. For example, where all reservoirs generate power, the release from the most upstream reservoir can contribute the power generation multiple times with different benefit functions in all downstream reservoirs. The total immediate reward for every agent in the respective sub-problem is the summation of actual and virtual immediate rewards.

Step 9—If the stopping criterion is not satisfied, increment t and repeat steps 6–8; otherwise, go to the next step.

Step 10—Find the decision policy using the following equation:

$$\operatorname{Po}^{ti}\left(\operatorname{Su}^{t}\_{i'},\operatorname{s}^{t}\_{i'},\operatorname{Snu}^{t}\_{i}\right) = \operatorname\*{arg\,max}\_{b \in A^{i}(s^{t}\_{i})} \operatorname{Q}\left(\operatorname{Su}^{t}\_{i'},\operatorname{s}^{t}\_{i'},\operatorname{Snu}^{t}\_{i'},b\right) \tag{11}$$

#### **3. Problem Settings and Results**

#### *3.1. Case Study: Parambikulam–Aliyar Project (PAP)*

We consider the Parambikulam–Aliyar Project (PAP) from India as this multi-reservoir system has been studied using AD-DP and FP methods in [28,49,53], respectively. We provide only the minimum details that are interesting to know here about the system and to correspond with the numbering of reservoirs different here than in [53]; detailed explanations of these reservoirs, their data and corresponding benefits and policies can be obtained from the above works. For the purpose of comparing results with those of four other methods reported in the literature, we use the same problem and objective functions, although that is not a restriction of the AD-RL method. In other words, any other highly non-linear or even discontinuous objective functions can easily be handled by the proposed model as RL is basically a simulation-based technique.

PAP, as studied, is presented in Figure 3. The PAP system comprises of two series of reservoirs in which the left-side reservoirs are more important in terms of the volume of inflow and demands (i.e., the demands and inflows are remarkably high compared to other side). The number inside the triangle depicting each reservoir in Figure 3 represent the live capacities, which can be considered as the maximum storage volume and the minimum (live) storage is zero. The subscript of inflow also represents the index of the reservoir and is used later for explanations.

**Figure 3.** Parambikulam–Aliyar Project (PAP) system as studied.

The main purpose of this project is to conduct water from western slopes of Anamalai Mountains to irrigate the eastern arid area in two states (Tamilnadu and Kerala), and hydropower and fishing benefits are the secondary incomes of the project. Many operating constraints exist agreed upon in the inter-stage agreement between these two different states that should be taken into consideration for constructing any optimization model; the details are not provided here and are available in the original papers. The objective function is defined as;

$$\text{Max } Z(\mathbb{R}\_{l}) = \text{Max } \sum\_{i=1}^{N=5} \sum\_{j=1}^{N=5} \sum\_{t=1}^{T=12} b\_{i}^{t} \times \mathbb{R}\_{ij}^{t} \tag{12}$$

where *b<sup>t</sup> <sup>i</sup>* is the benefit per unit release which is given in Table 1 and is useful later to understand MAM-DP's objective function. Note that the above simple linear objective function has been chosen to be exactly the same as the objective function used in the literature for other methods to which we compare our proposed model, and it is not a restriction.


**Table 1.** The benefit per unit release.

It is worth noting that 20% loss should be considered for all releases flowing from the third reservoir (Paramabikulam reservoir) to the fourth reservoir (Tirumurthy) because of the long tunnel used for water flow between these reservoirs.

The lower release bounds are set to zero for all periods. Table 2 also illustrates the upper bounds for all different connections in the PAP case study (i.e., the release from reservoir *i* to reservoir *j*). It is assumed that these bounds are the same for 12 months. The diagonal elements in this table indicate the total maximum releases for each of the five reservoirs.


**Table 2.** The upper bounds of release.

The hydrological data available on a monthly basis for the five reservoirs is in Figure 4. The average monthly inflows to the first, second, third and fifth reservoirs are almost the same in which the rainy season starts from December to May. However, the main proportion of rainfall in the fourth reservoir occurs during the monsoon period (from May to September).

**Figure 4.** The average monthly inflows in million cubic meter (MCM).

Given the available inflows for each month and the non-normal highly skewed nature of the inflows, the Kumaraswamy distribution was found to be the best fit and a few examples are shown in Figure 5.

**Figure 5.** Examples for the probability density functions of inflows to reservoirs [19].

Furthermore, based on the nature of PAP case study, there are high correlations between the first reservoir (Tamilnadu Sholayar), the second reservoirs (Parambikalami and Tunacadavu-Peruvarippallam), the third reservoir (Lower Aliyar) and the fifth reservoir (Kerala Sholayar) in terms of natural inflows (Table 3). The inflow to reservoir 4 is independent of other reservoirs' inflows.


**Table 3.** The cross-correlation between inflows to reservoirs.

The PAP project is solved using three other different optimization methods, including the MAM-DP, AD-DP and FP techniques in order to verify the performance of the proposed AD-RL. Two different methods of aggregation/decomposition are used in MAM-DP and AD-DP which are explained in the following sub-sections.

#### *3.2. MAM-DP Method Applied to Parambikulam–Aliyar Project (PAP)*

As observed in Figure 6, the PAP problem can be decomposed into four two-reservoir sub-problems for using the 2-level MAM-DP [28]. The values *<sup>R</sup>*12, *<sup>R</sup>*<sup>13</sup> and *<sup>R</sup>*24-+ *R*<sup>25</sup> are the releases whose conditional probabilities should be calculated from the respective sub-problems. Having solved all sub-problems, the optimal policy for the whole PAP case study can be obtained using the results of all sub-problems' policies (*R*<sup>13</sup> and *R*<sup>3</sup> from sub-problem 1; *R*<sup>12</sup> from sub-problem 2; *R*<sup>25</sup> from sub-problem 3; *R*24, *R*4, *R*<sup>5</sup> and from sub-problem 4).

**Figure 6.** The sub-problems of PAP using the multilevel approximation-dynamic programming (MAM-DP) method [28].

Furthermore, defining an objective function for each sub-problem is a challenging issue. Here, it is assumed that the maximum possible benefit for each sup-problem is achieved from the releases in the downstream reservoirs. For instance, in the first period the benefit per unit release for *R*<sup>12</sup> in the first sub-problem is 1.95 (the sum of the benefit per unit release from reservoirs 1, 2 and 4).

#### *3.3. AD-DP Method Applied to PAP Project*

All sub-problems of the PAP project using the AD-DP [21] aggregation method are demonstrated in Figure 7. The reservoir with a thick borderline in each sub-problem is a virtual reservoir whose release is not what actually occurs in reality. For instance, the release out of virtual reservoirs 1 and 3 in the second sub-problem does not really flow into reservoir 2, and it is divided between reservoirs 2 and 3.

**Figure 7.** The sub-problems of PAP using the aggregation–decomposition dynamic programming (AD-DP) method [28].

As illustrated in Figure 7, the first sub-problem consists of two state variables (the storage of reservoir 1 and the storage of virtual reservoir which is made up from four reservoirs 2, 3, 4 and 5) and three decision variables (the next storage of the virtual reservoir, release from reservoir 1 to reservoir 2 and release from reservoir 1 to reservoir 3). The second sub-problem comprises three state variables and four decision variables. Other sub-problems have two state variables and two decision variables. Furthermore, for these sub-problems, one more assumption (in addition to the one explained in the section on AD-DP regarding the division of storages of the virtual reservoirs) should be taken into consideration in which the release from each virtual reservoir with more than one outlet downstream (e.g., reservoirs 1 and 2) should be proportionally divided based on the respective capacity of the outlets.

#### *3.4. Fletcher–Ponnambalam (FP) Method Applied to PAP*

In the FP [17] method, having assumed a linear decision rule, the first and the second moments of storages are available in analytical forms. Substituting these new moments in the objective function produces an analytical expression for the first and second moments of releases, spills, and deficits of each reservoir in each period. All these expressions are dependent on a single parameter for each reservoir in each period. A non-linear optimization is solved directly for the objective function while the statistical moments of storages, releases and probabilities of containments, spills and deficits are also available for all reservoirs. More details can be seen in [53].

### *3.5. AD-RL Method*

To find a near-optimal policy by the Q-learning algorithm, a proper action-taking policy should be chosen. The discount factor (γ) for the learning step is 0.9 while the respective parameters including the learning factor (α) and ε (in ε-greedy policy) or τ (in Softmax policy) should be accurately set.

As previously mentioned, the main advantage of the Softmax policy is to explore more at the beginning of learning while increasing exploitation as learning continues. This behavior is controlled using the values of Q-factors while converging to the steady-state situation, that is, the action with a greater Q-factor should have a higher chance to be chosen in every interaction. However, while applying this policy, it is observed that the values of all Q-factors become almost close to each other as they converge to the steady-state situation. In other words, Softmax might end up with almost identical probabilities for admissible actions. Therefore, this action-taking policy may lead to a poor performance

through converging to a far-optimal operating policy, which our numerical experiments accomplished herein confirmed this result too.

Despite the Softmax policy, in ε-greedy policy, the rate of exploration is constant over the learning process that may not guarantee reaching a near-optimal policy. To tackle this issue, the whole learning process can be implemented as episodic starting, with a big ε in the first episode and decrease the rate in the next episode. An episode comprises a predefined number of years that the learning (simulation) should be implemented. The initial state at the beginning of the learning in every episode can be randomly selected. It is worth mentioning that the value of ε in every episode is constant. To implement this way of learning in the PAP case study, we considered two different parameters: ε1 and ε2. The value of ε1 equals one in the first episode, so there will be no difference between all admissible actions in terms of probability of being selected. Parameter ε2 is the rate of exploration in the last episode, then the rate of exploration for other episodes is determined using a linear function of these two parameters.

The learning factor (α) is also an important parameter that should be precisely specified. There are different methods to set this parameter [54]. It has been specified based on the number of updating for each action-state pair using the following equation:

$$\alpha = \frac{\text{B}}{\text{NOV} \, (i, a)} \tag{13}$$

where B is a smaller-than-one predefined parameter. The role of this parameter is to cope with the asynchronous error in the learning process [54].

We use a robust ANN-based approach to tune the above-mentioned parameters. The respective input data for training the corresponding multilayer perceptron ANN is obtained using a combination of different values chosen for parameters ε2 and B. To have a space-filling strategy, the values considered for each parameter should be uniformly distributed over its domain (for example B takes values of 0.1, 0.3, 0.5, 0.7 and 0.9). Given the sample values chosen for these parameters presented in Table 4, there will be 25 different input data representing all possible combinations for these parameter values. AD-RL will be implemented then for each individual set of these input combinations 10 times. Given the operating policy for each implementation of Q-learning, one can run the respective simulation to obtain the expected value of the benefit. The expected value (μ) and the semi variance (σ*semi*) of all 10 expected values of benefit value obtained from the simulations are calculated using the following equations:

$$\overline{\mu} = \frac{\sum\_{i}^{N\_R} \mu\_i}{N\_R} \tag{14}$$

$$\sigma\_{\text{semi}} = \frac{\sum\_{i}^{N\_R} \parallel (\mu\_i - \overline{\mu})^2 \Big| \mu\_i < \overline{\mu} \big|}{N\_R - 1} \tag{15}$$

where *NR* is the number of runs (10 in our experiments), and μ*<sup>i</sup>* and σ*semi* are respectively the mean and semi variance values of objective function obtained by *i*th run.

**Table 4.** Values of B, ε1, ε<sup>2</sup> for making input data.


Recall that Q-learning is a simulation-based technique. Therefore, it could end up with different operating policies at the end of the learning process because of different sequences of synthetic data being generated over the learning phase. Obtaining more robust results for different runs of Q-learning (less variation of expected values at the end of simulation for different Q-learning implementations) is a good sign for the respective fine-tuned parameter values. To take the robustness of the results into account in ANN training task, one can use a function of the performance criterion (μ − σ*semi* in our experiments) as a desirable output.

To demonstrate how efficient the proposed tuning procedure is, one of the five reservoirs in the PAP case study (Tamilnadu Sholayar reservoir) is selected which can be considered a one-reservoir problem. Similar to what has been undertaken for training data, Q-learning is implemented for the test data set consisting of 100 data in our experiments, and the respective performance criterion is obtained for each test data. The outputs of the networks for these test data can be found using the trained network. The best outputs obtained from Q-learning and the trained networks are compared then to each other. If all or the most of these two different outputs are the same, we conclude that the training phase for tuning the parameters has been performed appropriately.

Table 5 demonstrates top- 3, 4, 5, 10, and 15 best sets of parameters in terms of the defined performance criterion (μ − σ*semi*). For instance, 3 out of 5 best sets of parameters based on the output of the trained network are among the 5 best sets of parameters based on the Q-learning approach. Figure 8 illustrates the comparisons between the performance obtained from the neural network for the top 20 sets of parameters and the respective ones based on Q-learning. It shows that the function trained maps the input data to desirable data appropriately. Such a training procedure can, therefore, be used in cases with a larger number of reservoirs.

**Table 5.** The number of the same sets of parameters according to estimated-by-artificial neural network (ANN) and actual (obtained-by-Q learning) performances for top-*N* set.

**Figure 8.** The estimated and real performance of top 20 sets based on estimated performance.

Having the trained network using 25 examples given in Table 5 for the five-reservoir case study (PAP project), one can select and sort a number of the best parameter sets using 100 test data (Table 6). We used the first set of parameters to implement AD-RL.


**Table 6.** Top three parameter sets obtained by the ANN-based parameter tuning approach for AD-RL algorithm applied to PAP.

The AD-RL algorithm was implemented 10 times, each with one hundred episodes for the learning process. Each episode comprises 1000 years. Table 7 reports the average and the standard deviation of the annual benefit for all mentioned techniques applied in the PAP case study. It is worth noting that the average and standard deviation reported for the AD-RL method are their mean values obtained from running ten simulations. The AD-DP, MAM-DP, and AD-RL results are also determined by applying them to the same problem (PAP). Finally, the FP1 and FP2 results are from Mahootchi et al. [53] where FP1 and FP2 correspond to different approximations for estimating releases from upstream reservoirs for the Kumaraswamy distributed inflows. See [53] for details.

**Table 7.** The average expected value and the standard deviation of annual benefit.


\* Results from the methods implemented in this paper.

Figure 9 compares the average benefit for all 10 runs of Q-learning. This is a good verification showing how robust the Q-learning algorithm is. It also verifies that the derived-by-Q learning policy for all different 10 runs outperforms the policies derived by FP and AD-DP.

**Figure 9.** The average expected value of the objective function for 10 different runs.

#### **4. Discussion**

In above section, we presented the results of the proposed aggregation decomposition-reinforcement learning (AD-RL) approach for optimizing the PAP multi-reservoir system operations and compared them with those of three other stochastic optimization methods including MAM-DP, FP, and AD-DP. In terms of the average optimality criterion (objective function value), the AD-RL's solution was the second best result after that of MAM-DP. Because both mean and standard deviations are important, from Table 7 it is clear that the solutions of MAM-DP, AD-RL and FP1 are non-inferior and dominate the solutions of FP2 and AD-DP. Although computational run times are not always the best way to compare complexity in these methods where the number of simulations can be easily changed without much loss in optimality, the AD-RL method was three times faster than MAM-DP for the results presented above. This is important as MAM-DP and AD-RL were the top two ranked methods using mean objective function values. Moreover, as AD-RL was performed using the parameters tuned by a trained ANN, the solutions found by the proposed method in the PAP case study were reasonably robust. In other words, performing the AD-RL technique with different synthetic sequences of data leads to almost similar results. In this regard, assessing the performances of the investigated methods against different reservoir inflow scenarios associated with different runs, the standard deviation of the objective function for the AD-RL method was reasonably good compared to those of other stochastic methods investigated. This risk measure for the AD-RL approach was considerably lower than that resulting from the MAM-DP approach.

Overall, compared to other stochastic optimization methods, the results obtained by the proposed AD-RL approach were promising in terms of the optimality and robustness of the solutions found and the required computational burden.

#### **5. Conclusions**

In this paper, an aggregation/decomposition model combined with RL was developed for optimizing multi-reservoir systems operations, in which the whole system is controlled using a number of multiple co-operating agents. This method addresses the so-called dual curses of modeling and dimensionality in SDP. Each agent (operator) finds the best decision (release) for its own reservoir based on its current state and the feedback it receives from other agents represented by two virtual reservoirs, including one each for upstream and non-upstream reservoirs, respectively. An efficient approach based on neural networks was also proposed for tuning the model parameters in order to achieve a robust solution methodology. The developed methodology and original methods that did not use the RL method were applied to the Parambikulam–Aliyar Project (PAP) for which results from other methods were available for comparison.

For the PAP case studied herein, the average performance of the model (based on multiple runs) was compared with the performances of other stochastic optimization techniques, including multilevel approximation dynamic programming (MAM-DP), aggregate dynamic programming (AD-DP), and that of the Fletcher–Ponnambalam (FP) method (reported from the literature). The policies obtained by the AD-RL revealed that the proposed AD-RL approach outperformed FP and AD-DP methods in terms of the objective function criterion while having a comparable performance with MAM-DP but with a less computational time, which can be promising for large-scale problems.

As the proposed method is based on simulation, any other objective functions including non-linear/discontinuous ones can be considered which is a challenging issue for most other stochastic methods and is left for future studies.

**Author Contributions:** The work presented is mainly based on author M.H.'s MSc. thesis, supervised by authors S.J.M. and M.M. and the following contributions: conceptualization, K.P., M.M. and S.J.M.; methodology, S.J.M. and M.M.; software, M.H., M.M. and K.P.; validation, M.M., S.J.M., M.H. and K.P.; formal analysis, M.H., S.J.M. and M.M.; investigation, M.H., S.J.M. and M.M.; resources, K.P., M.M. and S.J.M.; data curation, K.P. and M.M.; writing—original draft preparation, M.M., S.J.M. and M.H.; writing—review and editing, K.P. and S.J.M.; visualization, M.H.; supervision, S.J.M., M.M. and K.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.
