1. Introduction
Reinforcement learning [
1] has been successful in areas like Atari games [
2] and the game of Go [
3]. The learning processes of these applications happen in simulator environments rather than real worlds. The sole objective is to find policies that maximize the return without having to consider any constraints. However, there are also problems with constraints. For example, imagine a recycling robot whose objective is to figure out a route from the origin to the destination to collect as much garbage as possible. Apart from the objective, the robot must keep the battery from running out before it reaches the destination [
4]. Another example is the cellular network, where the objective is the maximum throughput and the constraints are transmission delay, service level, package loss rate, etc. [
5]. Also, in the problem of energy management for hybrid electric vehicles, apart from the objective of minimum fuel consumption, the physical characteristics of motors and engines should be enforced as constraints [
6]. Zhang et al. [
7] considered an energy efficiency maximization problem with the power budget at the transmitter and the quality of service as constraints and tackled it using the proximal policy optimization framework. For heterogeneous networks, the achievable sum information rate is to be maximized with the achievable information rate requirements and the energy harvesting requirements as constraints [
8]. In a word, there are lots of circumstances in practical engineering projects where objectives and constraints are to be considered simultaneously.
Decision-making problems with constraints are typically modeled and solved under the framework of the constrained Markov decision process (CMDP). There are two kinds of constraints: instantaneous and cumulative. The former requires that the action
taken must be a member of an admissible set
, which may be dependent on the current state
. The latter can be divided into two groups: probabilistic and expected. In cases of probabilistic constraints, the probability that the cumulative costs violate a constraint is required to be within a certain threshold. Expected constraints, on the other hand, pose requirements on the cumulated/averaged values of the costs. It can be further divided into two categories: discounted sum and mean value. Liu et al. [
9] provided a summary and classification of RL problems with constraints. In this paper, the problem studied is restricted to discounted sum constraints in an episodic setting. Details are to be provided in
Section 2.
MDPs with cumulative constraints (both discounted sum and mean value) were first studied in [
10]. It is found that if the model is completely known, the CMDP problem can be transformed into a linear programming problem and solved. However, in practical problems, transition dynamics are seldom known in advance, making the theoretical solution inapplicable. Among other methods, Lagrangian relaxation is a popular one that turns the original constrained learning problem into an unconstrained one by adding the constraint functions weighted by corresponding Lagrange multipliers to the original objective function [
11,
12,
13,
14]. Drawbacks of the Lagrangian relaxation include sensitivity to the initialization of the multipliers as well as the learning rate, large performance variation during learning, no guarantee of constraint satisfaction during learning, too slow learning pace, etc. [
9]. Furthermore, to derive the adaptive Lagrange multiplier, one has to solve the saddle point problem in an iterative way, which may be numerically unstable [
15].
Lyapunov-based methods are also popular. Originally, Lyapunov functions were a kind of scalar function to describe the stability property of a system [
16]. They can also represent the steady-state performance of a Markov process [
17] and serve as a tool to transform the global properties of a system into local ones and vice versa [
18]. The first attempts to utilize the Lyapunov functions to tackle CMDP problems can be found in [
18], where an algorithm based on linear programming is proposed to construct the Lyapunov functions for the constraints. It is a value-function-based algorithm and not suitable for continuous action space. Another Lyapunov-based algorithm specifically for large and continuous action spaces using policy gradients (PG) to update the policies is proposed in [
19]. The idea is to use the state-dependent linearized Lyapunov constraints to derive the set of feasible solutions and then project the policy parameters or the actions onto it. Compared with the Lagrangian relaxation methods, Lyapunov-based methods ensure constraint satisfaction both during and after learning. The drawbacks of the Lyapunov methods are in two aspects. First, to derive the Lyapunov functions on each policy evaluation step, a linear programming problem has to be solved, which may be numerically intractable if the state space is large [
19]. Although it is possible to use heuristic constant Lyapunov functions depending only on the initial state and the horizon, theoretical guarantees are lost [
20]. Second, Lyapunov methods require the initial policy
to be feasible, whereas in some problems, feasible initial policies are unavailable, and it is usually more desirable to start with random policies [
19].
Constrained Policy Optimization (CPO) [
21] is an extension of the popular trust region policy optimization (TRPO) [
22] to make it applicable to problems with discounted sum constraints. It respects the constraints both during and after learning and ensures monotonic performance improvement. It uses a conjugate gradient to approximate the Fisher Information Matrix and backtracking line search to determine feasible actions, which makes it computationally expensive and susceptible to approximation error [
9,
20]. CPO does not support mean-valued constraints and is difficult to extend to cases of multiple constraints [
23]. Finally, the methodology of CPO can hardly be applied to other RL algorithms, which are not in the category of proximal policy gradient [
18].
Interior-point policy optimization (IPO) proposed by [
23] is a promising algorithm for RL problems with cumulative constraints. It is a first-order policy optimization algorithm inspired by the interior-point method [
24]. The core idea of IPO is to augment the objective function with logarithmic barrier functions whose values go to negative infinity if the corresponding constraint is violated and zero if it is satisfied. IPO has a lot of merits, like its applicability to general types of cumulative constraints, including both discounted sum and mean-valued ones, its easy extension to handle multiple constraints; easy tuning of hyperparameters; and its robustness in stochastic environments. It is also noteworthy that IPO is one of the few that provides simulation results for multiple constraints. The main drawback of IPO is that the initial policy must be feasible [
9]. This issue is addressed in later works by dividing the learning process into two phases [
25,
26]. In the first phase, the objective is totally ignored, and the cumulative costs are successively optimized to obtain a feasible policy. In the second phase, the original IPO algorithm is initiated with the feasible policy found at the end of the first phase. However, it is still not clear what should be performed if the agent gets stuck on an infeasible policy during the learning process of the second phase.
Although IPO demonstrates promising performances in empirical results, it does not provide adequate theoretical guarantees other than the performance bound. Comparatively, Triple-Q [
27] is the first model-free and simulator-free RL algorithm for CMDP with proof on sublinear regret and zero constraint violation. It has the same low computational complexity as SARSA [
28]. Although it is claimed that Triple-Q can be extended to accommodate multiple constraints, the corresponding simulation results are not provided in the paper. Triple-Q is designed for episodic CMDPs with discounted sum constraints only. In later works, it is integrated with optimistic Q-learning [
29] to obtain another model-free algorithm named Triple-QA for infinite-horizon CMDPs with mean-valued constraints. Triple-QA also provides sublinear regret and zero constraint violations. In general, thorough performance bounds are usually provided by model-based methods like [
30,
31]. Triple-Q and Triple-QA are among the few exceptions.
Projection-based Constrained Policy Optimization (PCPO) [
32] is an algorithm for expected cumulative constraints. It learns optimal and feasible policies iteratively in two steps. In the first step, it uses TRPO to learn an intermediate policy, which is better in terms of the objective but may be infeasible. In the second step, it projects the intermediate policy back into the constraint set to get the nearest feasible policy. The scheme of projection ensures improvement of the policy as well as satisfaction of the constraints. The main drawbacks of PCPO are expensive computation and limited generality, which are similar to those of CPO since they both use TRPO to perform policy updates [
9].
Backward value functions (BVF) are another useful tool for solving CMDP problems. In typical RL settings [
1], value functions are “forward,” representing expected discounted cumulative rewards from the current state to the terminal state or the infinite end. Comparatively, BVF describes the expected sum of returns or costs collected by the agent
so far. It builds upon the concept of the backward Markov chain, which is first discussed in [
33]. Pankayaraj and Varakantham [
34] employed BVF to tackle safety in hierarchical RL problems. Satija et al. [
20] proposed a method for translating trajectory-level constraints into instantaneous state-dependent ones. This approach respects constraints both during and after learning. It requires fewer approximations as compared to other methods, and the only approximation error is from the function approximation. As a result, it is computationally efficient. One problem that has not been addressed well by [
20], but is critical to the practical application, as has been discussed before, is the recovery mechanism from infeasible policies in the case of multiple constraints. This paper aims to fill this gap.
State augmentation is also another promising solution for CMDP problems. Calvo-Fullana et al. [
35] proposed a systematic procedure to augment the state with Lagrange multipliers to solve RL problems with constraints. They also demonstrated that CMDP and regularized RL problems are
not equivalent, meaning that there exist some constrained RL problems that cannot be solved by using a weighted linear combination of rewards (the method of which is called lumped performances in this paper). McMahan and Zhu [
36] proposed augmenting the state space to take constraints into consideration. They emphasized anytime constraint satisfaction in their methods, which requires the agent to never violate the constraint both during and after the learning process.
Primal-dual approaches are also popular. Bai et al. [
37] proposed a conservative stochastic primal-dual algorithm that is able to achieve
-optimal cumulative reward with zero constraint violations. However, it has also been demonstrated that classic primal-dual methods cannot solve all constrained RL problems [
35].
Model error may significantly influence the ability of the agent to satisfy the constraints. Ma et al. [
38] proposed a model-based safe RL framework named Conservative and Adaptive Penalty (CAP), which considers model uncertainty by calculating it and adaptively using it to trade off optimality and feasibility.
For safe RL applications, learning from offline data is also attractive since it avoids the dangerous actions of trial and error online. Xu et al. [
39] proposed constraints penalized Q-learning (CPQ) to solve the distributional shift problem in offline RL.
Gaps: In RL problems with multiple cumulative constraints, the final learned policy should have two properties, which are optimality and feasibility. In other words, the return should be maximized, whereas the constraints should be satisfied. The two requirements are usually in opposite directions, however, meaning that purely pursing one would cause the other to fail. The learning process thus consists of two kinds of components, namely, optimization and recovery. The former is to drive the policy towards a larger return. The latter is to make it more feasible. For the existing literature, one point that has not gained much attention but is vital to practical applications of the algorithms, however, is the mechanism of recovery from infeasible policies. In other words, most algorithms are expected to work with feasible policies. They operate under the assumption that updating the current feasible policy would result in another feasible one. This property is called
consistent feasibility [
18,
20]. For example, it is theoretically proven that CPO, Lyapunov-based, and BVF-based algorithms all maintain the feasibility of the policy upon updates once the base policies being updated are feasible [
18,
20,
23]. However, the problem remains: what should be performed if the initial policy is infeasible, or if it is feasible at the beginning but turns infeasible in the middle of learning due to effects like function approximation error. In these cases, a mechanism to recover the infeasible policy back to a feasible one is important. The design of the recovery mechanism is not the focus of the existing literature but rather an implementation issue. A recovery method was originally proposed along with CPO in [
21], which performs policy updates to purely optimize the constraints, ignoring the objective temporarily. This strategy is also adopted by the Lyapunov-based algorithm [
19] and the BVF-based one [
20]. However, the recovery method originally proposed with CPO only covers the case of a single constraint. It is unclear how to extend it to accommodate multiple constraints. Chow et al. [
19] suggest extending this recovery update to the multiple-constraint scenario by doing gradient descent over the constraint that has the worst violation but provides simulation results on the case of single constraint only. This paper aims to fill the gap by proposing a systematic mechanism for policy recovery that is applicable to the case of multiple cumulative constraints and accompanied by corresponding simulation results.
Contributions: A simple method and algorithm named Q-sorting are proposed for CMDP problems with discounted sum constraints in a tabular and episodic setting with deterministic environments and policies. It is similar to the BVF-based algorithm in terms of the way to predict whether a certain action potentially violates a constraint, but additionally provides a systematic mechanism for recovering from infeasible policies. Compared to existing recovery methods used in CPO, Lyapunov-based, and BVF-based algorithms, it covers cases of multiple constraints. It also provides the possibility to rank the constraints according to their importance and specify the order in which they are to be considered, enabling finer control and configuration of the learning process. It is model-free and can be applied online. It pursues constraint satisfaction both during and after learning. Although Q-sorting was originally developed in a tabular and episodic setting, it can be extended to methods with function approximation and discounted settings, as long as they are value-based. By using the BVF to estimate cumulative costs incurred so far, it can also be extended to accommodate stochastic environments and policies.
The rest of this paper is organized as follows.
Section 2 introduces the problem.
Section 3 discusses the proposed Q-sorting algorithm.
Section 4 presents simulation results of Q-sorting on problems of Gridworld and motor speed synchronization control with one and two constraints and compares it to the conventional method of lumped performances.
Section 5 gives a conclusion.
3. Q-Sorting
RL problems with one objective and multiple cumulative constraints are analogous to those with multiple objectives. The core of the learning algorithm is to allocate learning resources, for example, computing time and service, between different constraints/objective. Due to safety requirements, it is also desired that the times when constraints are violated be as few as possible, both during and after learning. These problems could be solved by imposing some predefined rules specifying at each time step which objective/constraint should be solely considered.
The idea is more obvious by supposing a value-based RL algorithm like Monte-Carlo or Q-learning. Naturally, one Q table could be learned for each objective/constraint. And if no constraints are imposed, the action is typically produced according to some
-greedy mechanism:
where
represents a uniform random number in
,
is the exploration rate, and
is the set of all possible actions. The subscript in
emphasizes that the Q table being used corresponds to the objective, namely, the return of which we are seeking to maximize.
Now consider the problem with one objective and multiple cumulative constraints. To predict the effects of a certain action on satisfying or violating constraints, it is necessary to record the rewards “up until now” and have them summed/accumulated. For example, suppose that the cumulative constraint refers to the fact that the fuel consumption on a trip should be within a certain amount. At each time step, to predict whether a future route satisfies the constraint, one should first check out how much fuel has been consumed. By subtracting the fuel already consumed from the total available amount (the constraint), one gets the surplus quota. And by comparing the surplus quota to the predicted fuel consumption from now on till the end, one gets a (predicted) conclusion on whether a certain route (action) violates the constraint.
To make it clear, suppose that only one cumulative constraint exists. When making decisions (choosing actions), two circumstances are possible. First, there is at least one action satisfying the constraint (in terms of prediction rather than reality). To maximize the return of the objective, one simply filters all actions violating the constraint out of to get , which represents the set of all feasible actions, and then replace with in the greedy component of Equation (3) to get . Next, consider the second circumstance, where no actions satisfy the constraint. In this case, the greedy action regarding the objective violates the constraint and thus cannot be used. Rather, if the “constraint-first” principle is adopted, the greedy action regarding the constraint should be used, which means that in Equation (3) should be replaced with . In other words, the focus of optimization is switched from the objective to the constraint when no actions are feasible. This seems natural if one observes Equation (2): requirements state that the value of cumulated be greater than or equal to some threshold, and not satisfying the constraint implies that this cumulative value is too small. To move the policy in the direction of satisfying the constraint, it is reasonable to pick the action maximizing .
In the presence of multiple cumulative constraints, however, things get complicated. On each time step, one has to decide the “focus of optimization”, not between one objective and
one constraint but among one objective and
multiple constraints.
Figure 1 illustrates the idea, using an example with one objective and four constraints. In a specific state,
, suppose that there are five action candidates. The Q values of each candidate are queried for different constraints/objectives. The satisfaction of a certain action candidate regarding a certain constraint is evaluated using the following equation:
where
is for “return till now”, that is, the cumulated rewards of the constraint up until now. Equation (4) is called a “test” for a certain
action candidate regarding a certain
constraint on the time step
.
The table in
Figure 1 shows a possible case of the test results, where a check mark is for satisfying the constraint and a cross mark is for violating it. Each column (except the last one) corresponds to a specific constraint, and each row corresponds to an action candidate. The last column corresponds to the objective.
The procedure is to test and filter all action candidates with each of the constraints, one by one, starting from the first. For example, for the first constraint, , and pass the test, whereas fails and is filtered out right away. Then, calculate for the four survivors and test them with Equation (4). and pass the second test, whereas fails. Abandon and repeat the process until no candidates pass the test or the last column (the objective) is reached. The column where all survivors settle on becomes the focus of optimization, and all survivors become candidates to pick. In this example, the focus is constraint3 and the candidates to pick are . Among the three, the action that maximizes is ultimately picked. Specifically, , , and are sorted in descending order, and the action candidate corresponding to the first is picked. With the learning process going on, the focus of optimization shall move from constraint1 to constraint2, constraint3, … consecutively, and finally settle on the objective. The agent focuses on one constraint/objective at a time and strikes to find a policy that maximizes the objective performance while satisfying all the cumulative constraints.
A typical optimization process for the policy is illustrated in
Figure 2.
The pseudocode of the Q-sorting algorithm is summarized in Algorithm 1.
Algorithm 1. Q-sorting |
Algorithm parameter: small |
Initialize for the objective and each constraint arbitrarily except that , for all , |
Loop for each episode: |
Initialize |
Initialize an empty array |
Loop for each step of the episode: |
Generate a uniform random number |
IF |
Randomly pick |
ELSE |
1. Test and filter action candidates, starting from the first constraint, until no candidates pass a specific test or all tests are passed. If all candidates fail on a constraint, the constraint becomes the focus of optimization; on the other hand, if there is at least one candidate satisfying all constraints (passing all the tests), the objective becomes the focus of optimization. Record the index of focus as , . 2. Record indices of candidates reaching as , . 3. Sort in descending order and pick the action corresponding to the first as . If multiple actions attain the maximum value at the same time, randomly pick one from them. |
Take , observe and |
Append the vector to : |
Update the current state: |
until the terminal state is reached |
Update with Monte Carlo, according to the trajectory recoded |