1. Introduction
Many real-world problems are selfish multi-agent systems with shared facilities, referred to as the
common pool resource problems [
1]. The agents choose from a number of available actions to use the shared facility and accordingly earn “rewards” from the environment. Using the common resource to maximize its own reward, each agent’s goal will be in potential conflict with that of its fellows. Examples include water resource management [
2], real-time traffic flow control [
3], coordination of autonomous vehicles [
4], and multi-player video games [
5]. In the autonomous vehicle problem, streets, intersections, and highways are shared common resources. Each vehicle plans a path to reach its given destination and is rewarded based on how fast it gets there. Ideally, each vehicle should choose the shortest path, but this may lead to congestion. So, to maximize their individual rewards, they need to cooperate. Such situations raise a type of Sequential Social Dilemma (SSD) [
6] where the most rewarding strategy for each agent (in the short-term) is to “defect” and exploit the common resource as much as possible, resulting in mutual defection among all agents; however, this results in the depletion of the common resource, and each agent earns (in the long-term) much less than if they would have all “cooperated” and used the common resource up to a certain limit. In SSDs, cooperative or defective behaviors exist not only as atomic actions but also as being temporally extended over each agent’s decision-making “policy”, that is, a strategy that determines what action to choose at each environment state. The question, then, is how to design the agents’ policies to have cooperative agents that maximize the collective reward, i.e., to maximize total reward while minimizing unfairly low rewards earned by cooperators who facilitate the high rewards earned by the “free-riders” who defect.
The many often stochastic factors impacting the environment and their unknown underlying mechanisms challenge the use of traditional modeling techniques in solving SSDs. For example, in the autonomous vehicle problem, the number of vehicles in each lane is variable, and great uncertainty is involved in the drivers’ behavior. Moreover, the observation and action spaces of each agent are too large to search through exhaustively or near exhaustively to find the optimal policies. Even if the optimal policies are found, they are not generalizable—a new environment would need new policies.
To tackle these modeling complexities, in addition to model-based structures, Multi-Agent Reinforcement Learning (MARL) proposes a model-free structure where intelligent agents require no prior knowledge about the environment and “learn” the optimal policies directly from interacting with the environment. During the interactions, each agent observes the environment state and executes one of the available actions based on its policy. Consequently, the environment transfers to a new state and provides the agent an
extrinsic reward indicating the immediate desirability of the environment state. The agent utilizes these interactions to estimate a
state value function that assigns to each state the long-term expected reward that the agent obtains when starting from that state and following its policy. According to this value function, the agent improves its policy to choose those actions that result in the highest-value states [
7]. The learned policies may be generalized to or sometimes readily used in unseen environments of the problem.
However, learning policies in SSDs by using only extrinsic rewards has limited performance because each extrinsic reward is agent (rather than collective)-specific and typically selfishly maximized by exploiting other agents, which can lead to mutual defection. This is because a cooperative equilibrium either does not exist or is difficult to find [
8,
9]. One way to mitigate social dilemmas is to reshape the rewards so that defecting behaviors are no longer an equilibrium. To this end, researchers have enriched extrinsic rewards by the so-called
intrinsic rewards [
10] that capture aspects of the agents’ behavior that are not necessarily encoded by the extrinsic reward. For each agent
k at time step
t, this results in its
reshaped reward , defined as a linear combination of its extrinsic reward
and intrinsic reward
:
for constant scalers
and
.
Insights into the nature of human social behavior have led to the introduction of two main categories of intrinsic rewards. The first focuses on self-preferencing attributes, including empowerment [
11], social influence [
12], and curiosity [
13,
14]. The second is motivated by social welfare emotions, including envy and guilt [
15], and empathy [
16]. For example, in the inequity aversion (IA) model [
15], which is based on envy and guilt, each agent compares its own and fellows’ extrinsic rewards to detect inequities and balances its selfish desire for earning rewards by keeping the differences as small as possible. A
disadvantageous inequity occurs when agent
k experiences “envy” by earning a lower reward than another agent
j, that is, if
. Conversely, an
advantageous inequity happens when the agent experiences “guilt” by earning a higher reward than another agent
j, that is, if
. To incorporate its feelings of envy and guilt, agent
k averages these comparisons over all other agents, resulting in the following intrinsic reward:
where
N is the total number of agents and parameters
,
control the agent’s aversion to disadvantageous and advantageous inequities, respectively. Regardless of whether it earns more or less than others, the agent decreases its obtained reward.
Here, the feeling of guilt can be interpreted as a feeling of social responsibility; that is, the agent wants to revise its policy to reduce the reward inequity of itself compared to changing the others who are “worse-performing”. However, the IA model measures the agents’ performance only by their obtained rewards and ignores their roles in establishing the current environment state that resulted in those rewards. This may lead to equal treatments of defectors and cooperators that earn the same reward despite their different contributions to the obtained rewards. Namely, rather than only those who earned less than itself, each agent should also “feel responsible” for those who earned more but did not contribute to reaching the current rewarding environment state. Even among others who are equally less-earned, a higher social responsibility should be dedicated to those who also contributed less to reaching the current state. Similarly, compared to other higher-earning agents, the agent feels that it should work “harder” (by finding a better policy) to earn as much as they do, and this feeling should be stronger toward those who played a more effective role in changing the state.
To take into effect the agents’ roles in reaching the environment states, we propose an
environmental impact of each agent defined by the difference between the current local state and the hypothetical one where that agent would have been absent in the previous state. In the autonomous vehicle example, a vehicle’s impact can be the difference in the current congestion in the presence and absence of that vehicle. When performing the comparisons in reshaping the reward function, each agent
k computes the impact
of every other agent
j and scales that agent’s reward
by its impact
. We demonstrate the effectiveness of the proposed impact criterion through experiments in the Cleanup and Harvest environments [
15,
17,
18]. The results demonstrate that agents trained by the EMuReL method can learn to cooperate more effectively compared with the IA and
Social Influence (SI) [
12] methods, that is, they earn a higher collective reward and have a slightly higher cooperation level.
The rest of this paper is organized as follows: the MARL setting in an SSD environment and the basic formulation of the proposed Environmental Impact-based Multi-Agent Reinforcement Learning (EMuReL) approach are explained in
Section 2 and
Section 3, respectively. Related work is presented in
Section 4.
Section 5 describes the experiments and comparison results. The discussion is given in
Section 6.
2. Background: MARL and
Markov Games
A Markov game [
19] is a standard framework for modeling MARL problems such as SSDs. It is defined on a state space
, a collection of
N agents’ action sets
, and a state
transition distribution preserving the Markov property, that is, the next state of the environment is independent of the past states and depends only on the current environment state and the agents’ applied actions [
20,
21]. The global state of the environment at time step
t is given by
. Each agent
k observes the environment to some extent as a local state
and selects an action
based on its
policy , that is, the probability distribution of selecting each of the available actions. A
joint action is the stacking of all agents’ actions
. Applying the joint action
at the global state
causes a transition in the environment state according to the transition distribution
. Each agent
k then receives a reward
, that is, originally, the agents’ extrinsic reward
, or can be reshaped by utilizing some intrinsic reward
. Using a Reinforcement Leaning (RL) algorithm, the agent evaluates its policy by calculating its associated
value function over every state
, where
is the cumulative discounted future rewards starting from the current local state
following policy
, where
is the
discount factor. Then, agent
k learns to improve its policy such that in each state
, the agent chooses the action that transfers the environment to the next state
where
is maximized.
3. The EMuReL Approach
To define intrinsic rewards, we use the same equation as in the IA model (
1) with two main differences. First, as in [
15], to take into account the temporally distributed, rather than the single-instant reward, we replace the extrinsic reward
with the
temporal smoothed extrinsic rewards defined by
where
is a hyperparameter. Second, when performing the IA comparisons, agent
k scales the reward of every agent
j by
, that is, agent
j’s
environmental impact in establishing agent
k’s local state at time
t. Hence, agent
k’s intrinsic reward is
where
. The definition of the impact
lies in the answer to the following question:
Agent k: “How impactful was the presence of agent j in reaching my current local environment state ?”
So, agent
k needs to compare its current local state
in the presence and absence of agent
j. This requires estimating the typically large size and detailed state
in the absence of agent
j with a reasonable accuracy, which may not be practical. Moreover,
is often an image with many features that may not have been caused by the agents’ actions. Therefore, rather than
, we use a reduced-dimension
feature encoding function , briefly denoted as
, that encodes the raw state
into a feature vector by using a deep neural network (
Figure A2). This function encodes only those features of the environment state that are influenced by the agents’ actions [
14].
Now, inspired by the dropouts technique used in neural network regularization [
22] to be able to extract the features of the local state
in the absence of agent
j, we need to estimate
when agent
j is omitted. Thus, we first define the
estimated feature encoding function that, instead of
, takes the previous joint action
and features
as its input and estimates
as its output, i.e.,
Then, by omitting
, resulting in the reduced joint action denoted by
, we obtain the estimated features at state
in the absence of agent
j (
Figure 1). Taking the norm of the difference of the two estimated feature results in the impact of agent
j in view of agent
k:
where
is the Euclidean norm and
is the same as
but when agent
j is eliminated. These disparities are scaled using unity-based normalization to bring all values into the range
. We emphasize that the elimination of an agent is different from when the agent is present but performs no operation (NOOP).
To learn the feature encoding function
and its estimation
, inspired by the neural network structure of the Social Curiosity Module (SCM) [
8] for multi-agent systems, we extend the Intrinsic Curiosity Module (ICM) [
14] in the single-agent setup to our multi-agent RL system. We propose the Extended ICM (EICM) by embedding two associated predictors inside the architecture of each agent
k (
Figure A2).
The first is the
forward model representing
by a neural network function
f with parameter
that predicts the encoded state
based on the previous joint action
and encoded state
. However, the state
is the result of applying the agents’ actions
on the previous global state
, not local state
. Hence, the forward model needs the global state
or an estimation of it. To this end, we exploit the Model of Other Agents (MOA) [
12], which is a neural network embedded in each agent’s architecture, to predict the other agents’ actions based on their previous actions and the previous state, i.e.,
. Via its internal LSTM state
, MOA implicitly models the state transition function
to estimate the global state
(the green nodes in the gray rectangle in
Figure A2) or
when the time is shifted negatively by one unit, i.e.,
. By passing this internal LSTM state to the forward model, we implicitly provide an estimate of the global state
. Therefore, the estimation of the encoded state
can be written as follows in terms of the neural network function
f:
During the training of the forward model, parameters
and
are learned to minimize
defined as the discrepancy between the predicted and actual encoded local states:
Second, to lead the forward model to extract those features that are more relevant to the agents’ actions, the actions are estimated using the so called
inverse model and parameters
are tuned by minimizing the loss of the actions estimation. The inverse model is a neural network function
g with parameters
that predicts the applied joint action
given
and
as follows:
Then, parameters
are learned to minimize
, where
is a cross entropy over the predicted and actual actions of the agents:
The inverse model assists in learning a feature space that encodes only the relevant information to predict the actions of the agents, and the forward model makes the representation of these learned features more predictable. Thus, the feature space has no incentive to encode environmental features that are not affected by the agents’ actions.
4. Related Work
There are several MARL approaches to design intrinsic rewards in order to improve cooperation in social dilemmas in the literature. An emotional intrinsic reward is defined based on each agent’s perception about the cooperativeness of its neighbors in social dilemmas [
23,
24]. This approach assumes the agents’ actions are distinguished in two cooperative and non-cooperative groups.
The IA model [
15] is based on inequity aversion, allowing each agent to know others’ rewards, and penalizing those with a much lower or higher reward. Causal influence in the SI method [
12] measures the influence of each agent’s actions. It computes the policy change of an agent with respect to the action of another agent. An intrinsic reward based on the episodic memory is defined to learn exploratory policies [
25]. Each agent is encouraged to revisit all states in its environment by utilizing k-nearest neighbors over the agent’s recent experience in a memory of all controllable states visited in the current episode. ICM [
14] is a curiosity-driven exploration model that predicts the feature representation of the next environment states and uses the prediction error in the feature space as the curiosity-based intrinsic reward.
Other types of environment “impacts”, such as auxiliary reward functions [
26] and measures of deviation for some baseline state [
27], are defined in the literature to avoid undesired irreversible changes to the environment.
Two influence-based methods are proposed by Wang et al. [
28] to solve the multi-agent exploration problem. These methods exploit the interactions among agents and measure the amount of one agent’s influence on another agent’s exploration processes by using mutual information between agents’ trajectories.
6. Discussion
Researchers have developed several MARL methods to train cooperative agents in SSD problems [
32,
33]. They introduced intrinsic rewards as a stimulus that is internally computed by agents to accelerate and improve their learning process [
34]. IA and SI are two state-of-the-art examples of reward reshaping methods that are based on two social concepts: inequity aversion and social influence [
12,
15]. We propose the EMuReL method based on the social responsibility concept. To this end, we define the environmental impact as a criterion to measure the role of each agent in reaching the current environment state and, in turn, making the agents continuously measure the cooperativeness of their fellows. We incorporate these impacts into the reward function of the IA method. So, in the advantageous case, the more the agents play an impactful role in reaching the current environment state, the less each agent feels socially responsible for them and, as a result, the less it penalizes itself with negative internal rewards induced by the IA method.
To compute the agents’ impacts, inspired by the SCM method [
8], we propose the EICM structure, which extends the single-agent curiosity setup of the ICM method [
14] to the MARL setting. The EICM structure utilizes the representation function
to isolate the impact of each agent’s action on the current environment state and to make the agents “curious” about the behavior of others by assessing their environmental impacts. The EMuReL method achieved better results compared to the IA and SI methods in the Cleanup and Harvest environments.
In the Cleanup environment, as [
15] explains, the (advantageous) IA method encourages the agents to contribute and clean up the river. Cleaning up the waste causes producing more apples in the field and makes other agents more successful in collecting more apples. So, it reduces the negative rewards fed by the IA method. Here, incorporating the environmental impacts in the IA method improves the agents’ policies. Depending on what action each agent chooses, its impact on the environment is remarkably different in this environment. If the agent chooses one of the movement actions without collecting apples, it creates the minimum change in the environment state, which is simply the change of its location. Such agents are subject to social responsibility by the other agents. If, however, the movement action is with collecting an apple, the field conditions additionally change. The most impactful agents are those who clean the river. Such an agent, in addition to changing the condition of the river on a large scale, changes the condition of the field by producing several apples depending on the growth rate of the apples.
After completing the learning phase of the EMuReL method in the Cleanup environment, the behavior of the agents were investigated in the rendered videos for the polices learned in one of the last episodes. It was observed that one of the agents learned a circular movement pattern in the environment so that it traveled the entire length of the river and then entered the apple field. On the way back to the river, it collected the apples. This agent also followed a specific pattern to clean up the river. It cleaned the maximum possible width of the river whenever it applied the cleaning beam and cleaned the whole width of the river every time it crossed the river. When returning to the river area, it chose the shortest path according to the position of the apples in the field. The other agents maneuvered in the apple field and collected apples. None of the agents in their behavior pattern used the fining beam; neither did they prevent each other from moving in the direction of the apples.
Limitations and Future Work
The proposed algorithm is distributed in the sense that no central agent is needed to learn the policies–each agent has its own MOA model. Nevertheless, the agents require access to all agents’ actions. Relaxing this constraint to the case where the state in the absence of a certain agent is predicted using the actions of the local agents only is the subject of future work. Other limitations include lacking comparisons to other more recent MARL algorithms and environments, long required training time, and considering only a discrete (rather than continuous) action space.