Next Article in Journal
Modeling and Parameter Optimization of Multi-Step Horizontal Salt Cavern Considering Heat Transfer for Energy Storage
Previous Article in Journal
Using Enhanced Representations to Predict Medical Procedures from Clinician Notes
Previous Article in Special Issue
Impulsive Control Discrete Fractional Neural Networks in Product Form Design: Practical Mittag-Leffler Stability Criteria
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Environmental-Impact-Based Multi-Agent Reinforcement Learning

by
Farinaz Alamiyan-Harandi
1,* and
Pouria Ramazi
2
1
Department of Electrical & Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, Iran
2
Department of Mathematics & Statistics, Brock University, St. Catharines, ON L2S 3A1, Canada
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(15), 6432; https://doi.org/10.3390/app14156432
Submission received: 10 June 2024 / Revised: 4 July 2024 / Accepted: 17 July 2024 / Published: 24 July 2024
(This article belongs to the Special Issue Bio-Inspired Collective Intelligence in Multi-Agent Systems)

Abstract

:
To promote cooperation and strengthen the individual impact on the collective outcome in social dilemmas, we propose the Environmental-impact Multi-Agent Reinforcement Learning (EMuReL) method where each agent estimates the “environmental impact” of every other agent, that is, the difference in the current environment state compared to the hypothetical environment in the absence of that other agent. Inspired by the inequity aversion model, the agent then compares its own reward with that of its fellows multiplied by their environmental impacts. If its reward exceeds the scaled reward of one of its fellows, the agent takes “social responsibility” toward that fellow by reducing its own reward. Therefore, the less influential an agent is in reaching the current state, the more social responsibility is taken by other agents. Experiments in the Cleanup (resp. Harvest) test environment demonstrated that agents trained based on EMuReL learned to cooperate more effectively and obtained 54 % ( 39 % ) and 20 % ( 44 % ) more total rewards while preserving the same cooperation levels compared to when they were trained based on the two state-of-the-art reward reshaping methods: inequity aversion and social influence.

1. Introduction

Many real-world problems are selfish multi-agent systems with shared facilities, referred to as the common pool resource problems [1]. The agents choose from a number of available actions to use the shared facility and accordingly earn “rewards” from the environment. Using the common resource to maximize its own reward, each agent’s goal will be in potential conflict with that of its fellows. Examples include water resource management [2], real-time traffic flow control [3], coordination of autonomous vehicles [4], and multi-player video games [5]. In the autonomous vehicle problem, streets, intersections, and highways are shared common resources. Each vehicle plans a path to reach its given destination and is rewarded based on how fast it gets there. Ideally, each vehicle should choose the shortest path, but this may lead to congestion. So, to maximize their individual rewards, they need to cooperate. Such situations raise a type of Sequential Social Dilemma (SSD) [6] where the most rewarding strategy for each agent (in the short-term) is to “defect” and exploit the common resource as much as possible, resulting in mutual defection among all agents; however, this results in the depletion of the common resource, and each agent earns (in the long-term) much less than if they would have all “cooperated” and used the common resource up to a certain limit. In SSDs, cooperative or defective behaviors exist not only as atomic actions but also as being temporally extended over each agent’s decision-making “policy”, that is, a strategy that determines what action to choose at each environment state. The question, then, is how to design the agents’ policies to have cooperative agents that maximize the collective reward, i.e., to maximize total reward while minimizing unfairly low rewards earned by cooperators who facilitate the high rewards earned by the “free-riders” who defect.
The many often stochastic factors impacting the environment and their unknown underlying mechanisms challenge the use of traditional modeling techniques in solving SSDs. For example, in the autonomous vehicle problem, the number of vehicles in each lane is variable, and great uncertainty is involved in the drivers’ behavior. Moreover, the observation and action spaces of each agent are too large to search through exhaustively or near exhaustively to find the optimal policies. Even if the optimal policies are found, they are not generalizable—a new environment would need new policies.
To tackle these modeling complexities, in addition to model-based structures, Multi-Agent Reinforcement Learning (MARL) proposes a model-free structure where intelligent agents require no prior knowledge about the environment and “learn” the optimal policies directly from interacting with the environment. During the interactions, each agent observes the environment state and executes one of the available actions based on its policy. Consequently, the environment transfers to a new state and provides the agent an extrinsic reward indicating the immediate desirability of the environment state. The agent utilizes these interactions to estimate a state value function that assigns to each state the long-term expected reward that the agent obtains when starting from that state and following its policy. According to this value function, the agent improves its policy to choose those actions that result in the highest-value states [7]. The learned policies may be generalized to or sometimes readily used in unseen environments of the problem.
However, learning policies in SSDs by using only extrinsic rewards has limited performance because each extrinsic reward is agent (rather than collective)-specific and typically selfishly maximized by exploiting other agents, which can lead to mutual defection. This is because a cooperative equilibrium either does not exist or is difficult to find [8,9]. One way to mitigate social dilemmas is to reshape the rewards so that defecting behaviors are no longer an equilibrium. To this end, researchers have enriched extrinsic rewards by the so-called intrinsic rewards [10] that capture aspects of the agents’ behavior that are not necessarily encoded by the extrinsic reward. For each agent k at time step t, this results in its reshaped reward  r t k , defined as a linear combination of its extrinsic reward e t k and intrinsic reward i t k :
r t k = α e t k + β i t k .
for constant scalers α and β .
Insights into the nature of human social behavior have led to the introduction of two main categories of intrinsic rewards. The first focuses on self-preferencing attributes, including empowerment [11], social influence [12], and curiosity [13,14]. The second is motivated by social welfare emotions, including envy and guilt [15], and empathy [16]. For example, in the inequity aversion (IA) model [15], which is based on envy and guilt, each agent compares its own and fellows’ extrinsic rewards to detect inequities and balances its selfish desire for earning rewards by keeping the differences as small as possible. A disadvantageous inequity occurs when agent k experiences “envy” by earning a lower reward than another agent j, that is, if e t j e t k > 0 . Conversely, an advantageous inequity happens when the agent experiences “guilt” by earning a higher reward than another agent j, that is, if e t k e t j > 0 . To incorporate its feelings of envy and guilt, agent k averages these comparisons over all other agents, resulting in the following intrinsic reward:
i t k = α k N 1 j k max ( e t j e t k , 0 ) β k N 1 j k max ( e t k e t j , 0 ) ,
where N is the total number of agents and parameters α k , β k R control the agent’s aversion to disadvantageous and advantageous inequities, respectively. Regardless of whether it earns more or less than others, the agent decreases its obtained reward.
Here, the feeling of guilt can be interpreted as a feeling of social responsibility; that is, the agent wants to revise its policy to reduce the reward inequity of itself compared to changing the others who are “worse-performing”. However, the IA model measures the agents’ performance only by their obtained rewards and ignores their roles in establishing the current environment state that resulted in those rewards. This may lead to equal treatments of defectors and cooperators that earn the same reward despite their different contributions to the obtained rewards. Namely, rather than only those who earned less than itself, each agent should also “feel responsible” for those who earned more but did not contribute to reaching the current rewarding environment state. Even among others who are equally less-earned, a higher social responsibility should be dedicated to those who also contributed less to reaching the current state. Similarly, compared to other higher-earning agents, the agent feels that it should work “harder” (by finding a better policy) to earn as much as they do, and this feeling should be stronger toward those who played a more effective role in changing the state.
To take into effect the agents’ roles in reaching the environment states, we propose an environmental impact of each agent defined by the difference between the current local state and the hypothetical one where that agent would have been absent in the previous state. In the autonomous vehicle example, a vehicle’s impact can be the difference in the current congestion in the presence and absence of that vehicle. When performing the comparisons in reshaping the reward function, each agent k computes the impact d t k , j of every other agent j and scales that agent’s reward e t j by its impact d t k , j . We demonstrate the effectiveness of the proposed impact criterion through experiments in the Cleanup and Harvest environments [15,17,18]. The results demonstrate that agents trained by the EMuReL method can learn to cooperate more effectively compared with the IA and Social Influence (SI) [12] methods, that is, they earn a higher collective reward and have a slightly higher cooperation level.
The rest of this paper is organized as follows: the MARL setting in an SSD environment and the basic formulation of the proposed Environmental Impact-based Multi-Agent Reinforcement Learning (EMuReL) approach are explained in Section 2 and Section 3, respectively. Related work is presented in Section 4. Section 5 describes the experiments and comparison results. The discussion is given in Section 6.

2. Background: MARL and Markov Games

A Markov game [19] is a standard framework for modeling MARL problems such as SSDs. It is defined on a state space S , a collection of N agents’ action sets A = { A 1 , . . . , A N } , and a state transition distribution  T preserving the Markov property, that is, the next state of the environment is independent of the past states and depends only on the current environment state and the agents’ applied actions [20,21]. The global state of the environment at time step t is given by s t S . Each agent k observes the environment to some extent as a local state s t k and selects an action a t k A k based on its policy  π k , that is, the probability distribution of selecting each of the available actions. A joint action  a t is the stacking of all agents’ actions [ a t 1 , . . . , a t k , . . . , a t N ] . Applying the joint action a t at the global state s t causes a transition in the environment state according to the transition distribution T ( s t + 1 | s t , a t ) . Each agent k then receives a reward r t + 1 k , that is, originally, the agents’ extrinsic reward e t + 1 k , or can be reshaped by utilizing some intrinsic reward i t + 1 k . Using a Reinforcement Leaning (RL) algorithm, the agent evaluates its policy by calculating its associated value function  V π k ( s t k ) = E [ R t k | s t k , π k ] over every state s t k , where R t k = i = 0 γ i r t + i + 1 k is the cumulative discounted future rewards starting from the current local state s t k following policy π k , where γ [ 0 , 1 ] is the discount factor. Then, agent k learns to improve its policy such that in each state s t k , the agent chooses the action that transfers the environment to the next state s t + 1 k where V π k is maximized.

3. The EMuReL Approach

To define intrinsic rewards, we use the same equation as in the IA model (1) with two main differences. First, as in [15], to take into account the temporally distributed, rather than the single-instant reward, we replace the extrinsic reward e t j with the temporal smoothed extrinsic rewards  w t j defined by
w t j = γ λ w t 1 j + e t j t 1 , w 0 j = 0 ,
where λ [ 0 , 1 ] is a hyperparameter. Second, when performing the IA comparisons, agent k scales the reward of every agent j by d t k , j [ 0 , 1 ] , that is, agent j’s environmental impact in establishing agent k’s local state at time t. Hence, agent k’s intrinsic reward is
i t k = α k N 1 j k max ( d t k , j w t j w t k , 0 ) β k N 1 j k max ( w t k d t k , j w t j , 0 ) ,
where α k , β k R . The definition of the impact d t k , j lies in the answer to the following question:
Agent k: “How impactful was the presence of agent j in reaching my current local environment state s t k ?”
So, agent k needs to compare its current local state s t k in the presence and absence of agent j. This requires estimating the typically large size and detailed state s t k in the absence of agent j with a reasonable accuracy, which may not be practical. Moreover, s t k is often an image with many features that may not have been caused by the agents’ actions. Therefore, rather than s t k , we use a reduced-dimension feature encoding function  ϕ ( s t k ; θ ϕ ) , briefly denoted as ϕ ( s t k ) , that encodes the raw state s t k into a feature vector by using a deep neural network (Figure A2). This function encodes only those features of the environment state that are influenced by the agents’ actions [14].
Now, inspired by the dropouts technique used in neural network regularization [22] to be able to extract the features of the local state s t k in the absence of agent j, we need to estimate ϕ ( s t k ) when agent j is omitted. Thus, we first define the estimated feature encoding function  ϕ ^ that, instead of s t k , takes the previous joint action a t 1 and features ϕ ( s t 1 k ) as its input and estimates ϕ ( s t k ) as its output, i.e.,
ϕ ^ ( ϕ ( s t 1 k ) , a t 1 ) ϕ ( s t k ) .
Then, by omitting a t 1 j , resulting in the reduced joint action denoted by a t 1 j , we obtain the estimated features at state s t k in the absence of agent j (Figure 1). Taking the norm of the difference of the two estimated feature results in the impact of agent j in view of agent k:
d t k , j = 1 2 ϕ ^ ( ϕ ( s t 1 k ) , a t 1 ) ϕ ^ j ( ϕ ( s t 1 k ) , a t 1 j ) ) 2 2 ,
where · 2 is the Euclidean norm and ϕ ^ j is the same as ϕ ^ but when agent j is eliminated. These disparities are scaled using unity-based normalization to bring all values into the range [ 0 , 1 ] . We emphasize that the elimination of an agent is different from when the agent is present but performs no operation (NOOP).
To learn the feature encoding function ϕ and its estimation ϕ ^ , inspired by the neural network structure of the Social Curiosity Module (SCM) [8] for multi-agent systems, we extend the Intrinsic Curiosity Module (ICM) [14] in the single-agent setup to our multi-agent RL system. We propose the Extended ICM (EICM) by embedding two associated predictors inside the architecture of each agent k (Figure A2).
The first is the forward model representing ϕ ^ by a neural network function f with parameter θ F that predicts the encoded state ϕ ( s t k ) based on the previous joint action a t 1 and encoded state ϕ ( s t 1 k ) . However, the state s t k is the result of applying the agents’ actions a t 1 on the previous global state s t 1 , not local state s t 1 k . Hence, the forward model needs the global state s t 1 or an estimation of it. To this end, we exploit the Model of Other Agents (MOA) [12], which is a neural network embedded in each agent’s architecture, to predict the other agents’ actions based on their previous actions and the previous state, i.e., P ( a t s t 1 k , a t 1 ) . Via its internal LSTM state u t 1 k , MOA implicitly models the state transition function T ( s t | s t 1 , a t 1 ) to estimate the global state s t (the green nodes in the gray rectangle in Figure A2) or s t 1 when the time is shifted negatively by one unit, i.e., T ( s t 1 | s t 2 , a t 2 ) . By passing this internal LSTM state to the forward model, we implicitly provide an estimate of the global state s t 1 . Therefore, the estimation of the encoded state ϕ ( s t 1 k ) can be written as follows in terms of the neural network function f:
ϕ ^ ( ϕ ( s t 1 k ) , a t 1 ) = f ( ϕ ( s t 1 k ) , u t 1 k , a t 1 ; θ F ) .
During the training of the forward model, parameters θ F and θ ϕ are learned to minimize L F defined as the discrepancy between the predicted and actual encoded local states:
L F ( ϕ ( s t k ) , ϕ ^ ϕ ( s t 1 k ) , a t 1 = 1 2 ϕ ^ ( ϕ ( s t 1 k ) , a t 1 ) ϕ ( s t k ) 2 2 .
Second, to lead the forward model to extract those features that are more relevant to the agents’ actions, the actions are estimated using the so called inverse model and parameters θ ϕ are tuned by minimizing the loss of the actions estimation. The inverse model is a neural network function g with parameters θ I that predicts the applied joint action a t 1 given ϕ ( s t 1 k ) and ϕ ( s t k ) as follows:
a ^ t 1 = g ( ϕ ( s t 1 k ) , ϕ ( s t k ) , u t 1 k ; θ I ) .
Then, parameters θ I , θ ϕ are learned to minimize L I ( a ^ t 1 , a t 1 ) , where L I is a cross entropy over the predicted and actual actions of the agents:
L I ( a ^ t 1 , a t 1 ) = j = 1 N a t 1 j log ( a ^ t 1 j ) .
The inverse model assists in learning a feature space that encodes only the relevant information to predict the actions of the agents, and the forward model makes the representation of these learned features more predictable. Thus, the feature space has no incentive to encode environmental features that are not affected by the agents’ actions.

4. Related Work

There are several MARL approaches to design intrinsic rewards in order to improve cooperation in social dilemmas in the literature. An emotional intrinsic reward is defined based on each agent’s perception about the cooperativeness of its neighbors in social dilemmas [23,24]. This approach assumes the agents’ actions are distinguished in two cooperative and non-cooperative groups.
The IA model [15] is based on inequity aversion, allowing each agent to know others’ rewards, and penalizing those with a much lower or higher reward. Causal influence in the SI method [12] measures the influence of each agent’s actions. It computes the policy change of an agent with respect to the action of another agent. An intrinsic reward based on the episodic memory is defined to learn exploratory policies [25]. Each agent is encouraged to revisit all states in its environment by utilizing k-nearest neighbors over the agent’s recent experience in a memory of all controllable states visited in the current episode. ICM [14] is a curiosity-driven exploration model that predicts the feature representation of the next environment states and uses the prediction error in the feature space as the curiosity-based intrinsic reward.
Other types of environment “impacts”, such as auxiliary reward functions [26] and measures of deviation for some baseline state [27], are defined in the literature to avoid undesired irreversible changes to the environment.
Two influence-based methods are proposed by Wang et al. [28] to solve the multi-agent exploration problem. These methods exploit the interactions among agents and measure the amount of one agent’s influence on another agent’s exploration processes by using mutual information between agents’ trajectories.

5. Experiments

In this section, two SSD games and the result of applying the proposed EMuReL as well as its comparison with some baseline methods are presented.

5.1. Experimental Setup

Multi-agent sequential social dilemmas are divided into two main categories: (1) public goods dilemmas, where providing a shared resource requires each agent to pay a personal cost, and (2) commons dilemmas, where defecting causes the depletion of a shared resource [29]. The Cleanup and Harvest games are examples of these dilemmas, respectively (See Figure A1). In the Cleanup game, there are two geographically separated areas on a two-dimensional grid environment: an apple field and a river. A group of agents moves inside the field and collects apples to obtain rewards. For producing apples, the agents should use their cleaning beam and clean up some of the waste that appears in a river over time. A higher waste level means a lower apple reproducing-rate in the field. Similar to the Cleanup game, the Harvest game includes a discrete grid area as an apple field with a number of agents that collect apples to earn rewards. However, the growth rate of new apples is determined by the apple density in each area of the field. The higher the apple density in an area, the faster new apples grow in that area. When all apples in an area are harvested, none will ever grow back. So, agents should choose an appropriate harvesting rate to both maximize their reward and keep an ongoing apple reproduction rate. In both games, the extrinsic reward function is the same: + 1 is the reward for collecting each apple. The agents are equipped with a punishment beam as an action to fire others with the cost of −1. An agent hit by this beam will lose 50 rewards. The Cleanup and Harvest environments were developed by Vinitsky et al. [18] as open-source Python codes.
As illustrated in the Schelling diagrams by Hughes et al. [15], these games are SSDs. In the Cleanup game, if an agent defects by staying in the apple field longer without cleaning up the river, it can obtain a higher reward. However, too many agents defecting makes the apple field depleted, resulting in lower future rewards. So, increasing the number of cooperative agents improves the overall long-term reward for a single cooperator. In the Harvest game, if an agent defects and collects all nearby apples quickly, it can receive a higher reward. However, continuing in this manner causes an unproductive empty field in the long term.
To learn the policy and value function, a decentralized setting of MARL with no communication access is used. Each agent uses a neural network for each of the policy and value function based on the structure of the SI method [12], where each network consists of (i) a convolutional layer to handle image inputs ( 15 × 15 pixels images), (ii) some fully connected layers to mix signals of information between each input dimension and each output, (iii) a Long Short Term Memory (LSTM) recurrent layer to create an internal memory, and (iv) some linear layers to provide the full range of output values. To learn the parameters of the two neural networks, the policy gradient algorithm Proximal Policy Optimization (PPO) [30] and the Asynchronous Advantage Actor-Critic (A3C) [31] were used. These algorithms take an actor–critic structure that optimizes the parameters of neural networks of the policy (actor) and the value function (critic) with respect to the gradient of the actor estimated by the critic using gradient ascent.
Here, the results of four methods were compared: (i) the baseline method that uses only the extrinsic rewards; (ii) the IA method [15] that utilizes the reshaped rewards based on the IA intrinsic rewards (1) and the temporal smoothed extrinsic rewards (2), i.e., the same as (3) but when d t k , j = 1 ; (iii) the SI method [12] that enriches the extrinsic rewards with intrinsic rewards derived from social empowerment, having a causal influence on other agents to make them change their policies; and (iv) our proposed EMuReL method that enriches the IA intrinsic rewards with the environmental impacts. The SCM algorithm with the same parameter setting as in Heemskerk [8] was adapted to the proposed EMuReL method in the Cleanup and Harvest environments. Two commonly used algorithms in these environments are PPO and A3C. As the baseline and SI methods with the PPO algorithm are reported to achieve higher collective rewards than with the A3C algorithm in the Cleanup environment [8,12], we chose PPO for this environment. For the Harvest environment, we tested both PPO and A3C algorithms.
For the hyperparameters, the default values recommended by the authors of the PPO algorithm were used [30]. The training batch size and the PPO minibatch size as two effective hyperparameters were set to 96,000 and 24,000, respectively. According to the results reported by Hughes et al. [15], the advantageous-IA agents are more effective in the Cleanup game, and having more agents who are averse to inequity facilitates cooperation. Thus, in this environment, for the IA and EMuReL methods, we considered 5 advantageous-IA agents with β equal to 0.05 and set α = 0 . We conducted 15 experiments of each method with random seeds and without optimizing the hyperparameters. For the Harvest environment, we evaluated both advantageous-IA and disadvantageous-IA agents and conducted 5 experiments of each method. We considered disadvantageous-IA agents with α equal to 5 and set β = 0 . Every experiment was performed on a Linux server with 3 CPUs, a P100 Pascal GPU, and 100G RAM and took between 10 and 28 days considering different methods and environments.

5.2. Experimental Results

The total reward received by all agents is considered as a measure that explains how well the agents learned to cooperate [15]. According to Table 1, when trained based on the EMuReL method, the agents could earn 53.6 % and 38.6 % more total rewards than when they were trained based on the IA model, 19.7 % and 44.2 % more than when they were trained based on SI method, and 36.9 % and 10.2 % more than when they were trained by using only the extrinsic rewards in the Cleanup and Harvest environments, respectively ( p e r c e n t a g e i n c r e a s e = n e w v a l u e o r i g i n a l v a l u e o r i g i n a l v a l u e × 100 ). According to subplot (a) in Figure 2, the EMuReL method persistently outperformed the other methods starting from step 2.15 × 10 8 and had the minimum performance variance. Moreover, the EMuReL method improved the agents’ cooperation while maintaining the equality between the obtained rewards of the agents better than other methods (subplot (b) in Figure 2). The mean equality of all methods is calculated by using the Gini coefficient as E q u a l i t y = 1 i = 1 N j = 1 N | R i R j | 2 N i = 1 N R i [15].
All four methods performed almost identically in the Harvest environment using the PPO algorithm (see Figure A4 and Figure A5 in Appendix A.1). The same results were reported by Heemskerk [8], comparing the baseline, SI, and SCM methods when the PPO algorithm is applied. Since the results reported by Jaques et al. [12], Hughes et al. [15] for training the agents based on A3C algorithm achieved higher collective rewards compared to our PPO results, we also used the A3C instead of the PPO algorithm to examine the performance of the EMuReL method. We compare the results with four agents in Figure 3. This plot shows that the EMuReL method with advantageous-IA agents (called advantageous EMuReL method) outperformed other methods.

5.3. Ablation Results

An ablation system was also used to test the contribution of each key ingredient of the EMuReL method. The blue curve in Figure 4 shows the result of the EMuReL method where the internal LSTM state v t k of agent k’s actor–critic structure was used instead of the internal LTSM state u t k of its MOA structure. The red curve depicts the mean collective rewards of the EMuReL method where e t k (resp. e t j ) were used instead of w t k (resp. w t j ) in (3) (Figure A7 in Appendix A.1 shows the result of each individual experiment). According to this reward comparison, the inclusion of each of the key ingredients of the EMuReL approach, especially the MOA structure, is necessary and contributes to the obtained performance.

6. Discussion

Researchers have developed several MARL methods to train cooperative agents in SSD problems [32,33]. They introduced intrinsic rewards as a stimulus that is internally computed by agents to accelerate and improve their learning process [34]. IA and SI are two state-of-the-art examples of reward reshaping methods that are based on two social concepts: inequity aversion and social influence [12,15]. We propose the EMuReL method based on the social responsibility concept. To this end, we define the environmental impact as a criterion to measure the role of each agent in reaching the current environment state and, in turn, making the agents continuously measure the cooperativeness of their fellows. We incorporate these impacts into the reward function of the IA method. So, in the advantageous case, the more the agents play an impactful role in reaching the current environment state, the less each agent feels socially responsible for them and, as a result, the less it penalizes itself with negative internal rewards induced by the IA method.
To compute the agents’ impacts, inspired by the SCM method [8], we propose the EICM structure, which extends the single-agent curiosity setup of the ICM method [14] to the MARL setting. The EICM structure utilizes the representation function ϕ to isolate the impact of each agent’s action on the current environment state and to make the agents “curious” about the behavior of others by assessing their environmental impacts. The EMuReL method achieved better results compared to the IA and SI methods in the Cleanup and Harvest environments.
In the Cleanup environment, as [15] explains, the (advantageous) IA method encourages the agents to contribute and clean up the river. Cleaning up the waste causes producing more apples in the field and makes other agents more successful in collecting more apples. So, it reduces the negative rewards fed by the IA method. Here, incorporating the environmental impacts in the IA method improves the agents’ policies. Depending on what action each agent chooses, its impact on the environment is remarkably different in this environment. If the agent chooses one of the movement actions without collecting apples, it creates the minimum change in the environment state, which is simply the change of its location. Such agents are subject to social responsibility by the other agents. If, however, the movement action is with collecting an apple, the field conditions additionally change. The most impactful agents are those who clean the river. Such an agent, in addition to changing the condition of the river on a large scale, changes the condition of the field by producing several apples depending on the growth rate of the apples.
After completing the learning phase of the EMuReL method in the Cleanup environment, the behavior of the agents were investigated in the rendered videos for the polices learned in one of the last episodes. It was observed that one of the agents learned a circular movement pattern in the environment so that it traveled the entire length of the river and then entered the apple field. On the way back to the river, it collected the apples. This agent also followed a specific pattern to clean up the river. It cleaned the maximum possible width of the river whenever it applied the cleaning beam and cleaned the whole width of the river every time it crossed the river. When returning to the river area, it chose the shortest path according to the position of the apples in the field. The other agents maneuvered in the apple field and collected apples. None of the agents in their behavior pattern used the fining beam; neither did they prevent each other from moving in the direction of the apples.

Limitations and Future Work

The proposed algorithm is distributed in the sense that no central agent is needed to learn the policies–each agent has its own MOA model. Nevertheless, the agents require access to all agents’ actions. Relaxing this constraint to the case where the state in the absence of a certain agent is predicted using the actions of the local agents only is the subject of future work. Other limitations include lacking comparisons to other more recent MARL algorithms and environments, long required training time, and considering only a discrete (rather than continuous) action space.

Author Contributions

F.A.-H. contributed to the idea, algorithm, experiments, and analysis and took the lead in the writing. P.R. contributed to the idea, analysis, and writing. All authors have read and agreed to the published version of the manuscript.

Funding

Pouria Ramazi acknowledges an NSERC Discovery Grant (RGPIN-2022-05199).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code presented in the study are openly available in a GitHub repository at https://github.com/farinazAH/sequential_social_dilemma_games, accessed on 4 July 2024.

Acknowledgments

We would like to thank Digital Research Alliance of Canada for providing computational resources that facilitated our experiments.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
EMuReLEnvironmental-impact Multi-Agent Reinforcement Learning
SSDSequential Social Dilemma
MARLMulti-Agent Reinforcement Learning
IAInequity Aversion
SISocial Influence
RLReinforcement Learning
NOOPNo operation
SCMSocial Curiosity Module
ICMIntrinsic Curiosity Module
EICMExtended ICM
MOAModel of Other Agents
Convconvolutional layer
LSTMLong Short-Term Memory
PPOProximal Policy Optimization
A3CAsynchronous Advantage Actor–Critic

Appendix A

Appendix A.1

Figure A1. The SSD environments. (a) Cleanup and (b) Harvest games.
Figure A1. The SSD environments. (a) Cleanup and (b) Harvest games.
Applsci 14 06432 g0a1
Figure A2. The EICM network structure of the EMuReL method. The EICM of each agent k has two inputs: the previous local state s t 1 k , for example an image (represented on its top), and the previous joint action a t 1 . The EICM includes five distinctive networks: (i) the actor–critic structure that learns the policy and value function, (ii) the MOA network, (iii) the feature extraction network that contains a convolutional layer (Conv) as an encoder to represent the local state s t 1 k as q features, (iv) the forward model that learns ϕ ^ , and (v) the inverse model that predicts the applied actions. The Conv layer represented by a transparent gray rectangle inside the purple dashed rectangle indicates that the feature encodings of the local states s t 1 k and s t k are performed by using the same Conv layer but at two consequence time steps t 1 and t. The parameters k and f in Conv present the convolution kernel size and filters, respectively. Parameter u is the number of neurons in the FC layers. Each agent j applies an action a t j based on its local state s t j and internal LSTM state v t j of its actor–critic structure which serves as a memory of the previous states. Hence, to predict the action a t j of each agent j, agent k’s MOA implicitly models the local state s t j and internal LSTM state v t j of all agents, which are implicitly captured in its internal LTSM state u t 1 k (the green nodes in the gray rectangle). As the aggregation of the local states s t j forms the global state s t , the internal LSTM state u t 1 k can provide an estimation of the global state. The actor–critic and MOA structures are taken from Jaques et al. [12]. The forward and inverse models are based on the structures presented by Heemskerk [8].
Figure A2. The EICM network structure of the EMuReL method. The EICM of each agent k has two inputs: the previous local state s t 1 k , for example an image (represented on its top), and the previous joint action a t 1 . The EICM includes five distinctive networks: (i) the actor–critic structure that learns the policy and value function, (ii) the MOA network, (iii) the feature extraction network that contains a convolutional layer (Conv) as an encoder to represent the local state s t 1 k as q features, (iv) the forward model that learns ϕ ^ , and (v) the inverse model that predicts the applied actions. The Conv layer represented by a transparent gray rectangle inside the purple dashed rectangle indicates that the feature encodings of the local states s t 1 k and s t k are performed by using the same Conv layer but at two consequence time steps t 1 and t. The parameters k and f in Conv present the convolution kernel size and filters, respectively. Parameter u is the number of neurons in the FC layers. Each agent j applies an action a t j based on its local state s t j and internal LSTM state v t j of its actor–critic structure which serves as a memory of the previous states. Hence, to predict the action a t j of each agent j, agent k’s MOA implicitly models the local state s t j and internal LSTM state v t j of all agents, which are implicitly captured in its internal LTSM state u t 1 k (the green nodes in the gray rectangle). As the aggregation of the local states s t j forms the global state s t , the internal LSTM state u t 1 k can provide an estimation of the global state. The actor–critic and MOA structures are taken from Jaques et al. [12]. The forward and inverse models are based on the structures presented by Heemskerk [8].
Applsci 14 06432 g0a2
Figure A3. The results for the Cleanup environment. (ad) The mean reward obtained by 5 agents over individual experiments of the baseline, IA, SI, and EMuReL methods using the PPO algorithm. Each point of these curves shows the average collective reward over at least 96,000 environment steps (96 episodes of 1000 steps). The opaque curve is the mean of the results of 15 experiments. The shadows are the bands of the confidence interval obtained by estimating the unbiased variance of the collective rewards.
Figure A3. The results for the Cleanup environment. (ad) The mean reward obtained by 5 agents over individual experiments of the baseline, IA, SI, and EMuReL methods using the PPO algorithm. Each point of these curves shows the average collective reward over at least 96,000 environment steps (96 episodes of 1000 steps). The opaque curve is the mean of the results of 15 experiments. The shadows are the bands of the confidence interval obtained by estimating the unbiased variance of the collective rewards.
Applsci 14 06432 g0a3
Figure A4. The results for the Harvest environment using the PPO algorithm. The same setup as that of Figure A3 is used. The opaque curve is the mean of the results of 5 experiments.
Figure A4. The results for the Harvest environment using the PPO algorithm. The same setup as that of Figure A3 is used. The opaque curve is the mean of the results of 5 experiments.
Applsci 14 06432 g0a4aApplsci 14 06432 g0a4b
Figure A5. The comparison of the results for the Harvest environment using the PPO algorithm. The final more-stastidentical in this experiment.
Figure A5. The comparison of the results for the Harvest environment using the PPO algorithm. The final more-stastidentical in this experiment.
Applsci 14 06432 g0a5
Figure A6. The results for the Harvest environment using the A3C algorithm. The same setup as that of Figure A3 is used with the difference of using the A3C rather than the PPO algorithm and 4 instead of 5 agents. The opaque curve is the mean of the results of 5 experiments. The curves of all methods become almost fixed after 0.2 × 10 8 steps. The advantageous EMuReL method outperforms other methods after 0.1 × 10 8 steps and has the overall better result.
Figure A6. The results for the Harvest environment using the A3C algorithm. The same setup as that of Figure A3 is used with the difference of using the A3C rather than the PPO algorithm and 4 instead of 5 agents. The opaque curve is the mean of the results of 5 experiments. The curves of all methods become almost fixed after 0.2 × 10 8 steps. The advantageous EMuReL method outperforms other methods after 0.1 × 10 8 steps and has the overall better result.
Applsci 14 06432 g0a6aApplsci 14 06432 g0a6b
Figure A7. The results of the ablation studies for the Cleanup environment. (ac) The mean reward obtained by 5 agents over individual experiments of the EMuReL and its ablation system using the PPO algorithm. Each point of these curves shows the average collective reward over at least 96,000 environment steps (96 episodes of 1000 steps). The opaque curve is the mean of the results obtained by the best 4 experiments of each study. The shadows are the bands of the confidence interval obtained by estimating the unbiased variance of the collective rewards.
Figure A7. The results of the ablation studies for the Cleanup environment. (ac) The mean reward obtained by 5 agents over individual experiments of the EMuReL and its ablation system using the PPO algorithm. Each point of these curves shows the average collective reward over at least 96,000 environment steps (96 episodes of 1000 steps). The opaque curve is the mean of the results obtained by the best 4 experiments of each study. The shadows are the bands of the confidence interval obtained by estimating the unbiased variance of the collective rewards.
Applsci 14 06432 g0a7

References

  1. Gardner, R.; Ostrom, E.; Walker, J.M. The nature of common-pool resource problems. Ration. Soc. 1990, 2, 335–358. [Google Scholar] [CrossRef]
  2. Pretorius, A.; Cameron, S.; van Biljon, E.; Makkink, T.; Mawjee, S.; Plessis, J.d.; Shock, J.; Laterre, A.; Beguir, K. A game-theoretic analysis of networked system control for common-pool resource management using multi-agent reinforcement learning. arXiv 2020, arXiv:2010.07777. [Google Scholar]
  3. Chu, T.; Chinchali, S.; Katti, S. Multi-agent reinforcement learning for networked system control. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  4. Sallab, A.E.; Abdou, M.; Perot, E.; Yogamani, S. Deep reinforcement learning framework for autonomous driving. Electron. Imaging 2017, 29, 70–76. [Google Scholar] [CrossRef]
  5. Kempka, M.; Wydmuch, M.; Runc, G.; Toczek, J.; Jaśkowski, W. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In Proceedings of the 2016 IEEE Conference on Computational Intelligence and Games (CIG), Santorini, Greece, 20–23 September 2016; pp. 1–8. [Google Scholar]
  6. Leibo, J.Z.; Zambaldi, V.; Lanctot, M.; Marecki, J.; Graepel, T. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, International Foundation for Autonomous Agents and Multiagent Systems, Sao Paulo, Brazil, 8–12 May 2017; pp. 464–473. [Google Scholar]
  7. Sutton, R.S.; Barto, A.G. Introduction to Reinforcement Learning; MIT Press: Cambridge, UK, 1998; Volume 135. [Google Scholar]
  8. Heemskerk, H. Social Curiosity in Deep Multi-Agent Reinforcement Learning. Master’s Thesis, Utrecht University, The Netherlands, 2020. [Google Scholar]
  9. Hernandez-Leal, P.; Kartal, B.; Taylor, M.E. Is multiagent deep reinforcement learning the answer or the question? A brief survey. Learning 2018, 21, 22. [Google Scholar]
  10. Barto, A.G.; Simsek, O. Intrinsic motivation for reinforcement learning systems. In Proceedings of the Thirteenth Yale Workshop on Adaptive and Learning Systems; Yale University Press: New Haven, CO, USA, 2005; pp. 113–118. [Google Scholar]
  11. Klyubin, A.S.; Polani, D.; Nehaniv, C.L. Empowerment: A universal agent-centric measure of control. In Proceedings of the 2005 IEEE Congress on Evolutionary Computation, Edinburgh, UK, 2–5 September 2005; Volume 1, pp. 128–135. [Google Scholar]
  12. Jaques, N.; Lazaridou, A.; Hughes, E.; Gulcehre, C.; Ortega, P.; Strouse, D.; Leibo, J.Z.; De Freitas, N. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 3040–3049. [Google Scholar]
  13. Burda, Y.; Edwards, H.; Pathak, D.; Storkey, A.; Darrell, T.; Efros, A.A. Large-Scale Study of Curiosity-Driven Learning. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  14. Pathak, D.; Agrawal, P.; Efros, A.A.; Darrell, T. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, NSW, Australia, 6–11 August 2017; pp. 2778–2787. [Google Scholar]
  15. Hughes, E.; Leibo, J.Z.; Phillips, M.; Tuyls, K.; Dueñez-Guzman, E.; Castañeda, A.G.; Dunning, I.; Zhu, T.; McKee, K.; Koster, R.; et al. Inequity aversion improves cooperation in intertemporal social dilemmas. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 3330–3340. [Google Scholar]
  16. Salehi-Abari, A.; Boutilier, C.; Larson, K. Empathetic decision making in social networks. Artif. Intell. 2019, 275, 174–203. [Google Scholar] [CrossRef]
  17. Perolat, J.; Leibo, J.Z.; Zambaldi, V.; Beattie, C.; Tuyls, K.; Graepel, T. A multi-agent reinforcement learning model of common-pool resource appropriation. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/hash/2b0f658cbffd284984fb11d90254081f-Abstract.html (accessed on 1 January 2024).
  18. Vinitsky, E.; Jaques, N.; Leibo, J.; Castenada, A.; Hughes, E. An Open Source Implementation of Sequential Social Dilemma Games. GitHub Repository. 2019. Available online: https://github.com/eugenevinitsky/sequential_social_dilemma_games/issues/182 (accessed on 1 January 2024).
  19. Littman, M.L. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994; Elsevier: Amsterdam, The Netherlands, 1994; pp. 157–163. [Google Scholar]
  20. Van Der Wal, J. Stochastic Dynamic Programming: Successive Approximations and Nearly Optimal Strategies for Markov Decision Processes and Markov Games; Mathematical Centre Tracts, Mathematisch Centrum: Amsterdam, The Netherlands, 1981. [Google Scholar]
  21. Markov, A.A. The theory of algorithms. Tr. Mat. Instituta Im. Steklova 1954, 42, 3–375. [Google Scholar]
  22. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  23. Yu, C.; Zhang, M.; Ren, F. Emotional Multiagent Reinforcement Learning in Social Dilemmas. In Proceedings of the Prima, Dunedin, New Zealand, 1–6 December 2013. [Google Scholar]
  24. Yu, C.; Zhang, M.; Ren, F.; Tan, G. Emotional Multiagent Reinforcement Learning in Spatial Social Dilemmas. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 3083–3096. [Google Scholar] [CrossRef] [PubMed]
  25. Badia, A.P.; Sprechmann, P.; Vitvitskyi, A.; Guo, Z.D.; Piot, B.; Kapturowski, S.; Tieleman, O.; Arjovsky, M.; Pritzel, A.; Bolt, A.; et al. Never Give Up: Learning Directed Exploration Strategies. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  26. Turner, A.M.; Hadfield-Menell, D.; Tadepalli, P. Conservative agency via attainable utility preservation. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, New York, NY, USA, 7–8 February 2020; pp. 385–391. [Google Scholar]
  27. Krakovna, V.; Orseau, L.; Kumar, R.; Martic, M.; Legg, S. Penalizing side effects using stepwise relative reachability. arXiv 2018, arXiv:1806.01186. [Google Scholar]
  28. Wang, T.; Wang, J.; Wu, Y.; Zhang, C. Influence-based multi-agent exploration. arXiv 2019, arXiv:1910.05512. [Google Scholar]
  29. Kollock, P. Social dilemmas: The anatomy of cooperation. Annu. Rev. Sociol. 1998, 24, 183–214. [Google Scholar] [CrossRef]
  30. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
  31. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
  32. Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  33. Canese, L.; Cardarilli, G.C.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Re, M.; Spanò, S. Multi-Agent Reinforcement Learning: A Review of Challenges and Applications. Appl. Sci. 2021, 11, 4948. [Google Scholar] [CrossRef]
  34. Singh, S.; Barto, A.G.; Chentanez, N. Intrinsically motivated reinforcement learning. In Proceedings of the 17th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 13–18 December 2004; pp. 1281–1288. [Google Scholar]
Figure 1. The elimination process of computing impacts by agent k. In the iterative procedure of this process, each time, an agent is removed from the neural network computations of the forward model by setting the corresponding network weights of that agent in the input to zero. The disparity between the outputs of the forward model in the presence and absence of agent j is employed as a measure of agent j’s impact in predicting the features of the current local state from agent k’s viewpoint.
Figure 1. The elimination process of computing impacts by agent k. In the iterative procedure of this process, each time, an agent is removed from the neural network computations of the forward model by setting the corresponding network weights of that agent in the input to zero. The disparity between the outputs of the forward model in the presence and absence of agent j is employed as a measure of agent j’s impact in predicting the features of the current local state from agent k’s viewpoint.
Applsci 14 06432 g001
Figure 2. The results for the Cleanup environment. The mean reward obtained by 5 agents over individual experiments of the baseline, IA, SI, and EMuReL methods using the PPO algorithm. Each point of these curves shows the average collective reward over at least 96,000 environment steps (96 episodes of 1000 steps). The opaque curve is the mean of the results of 15 experiments. The shadows are the bands of the confidence interval obtained by estimating the unbiased variance of the collective rewards. (a) The comparison between the mean collective rewards of all methods computed by removing the best and worst results of each method. The IA and baseline methods have a sharp initial raise but then slow down, whereas the EMuReL and SI have a relatively constant growth rate (EMuReL as the smoothest) and that EMuReL grows with a greater slop compared to the SI. (b) The comparison between the mean equality of all methods.
Figure 2. The results for the Cleanup environment. The mean reward obtained by 5 agents over individual experiments of the baseline, IA, SI, and EMuReL methods using the PPO algorithm. Each point of these curves shows the average collective reward over at least 96,000 environment steps (96 episodes of 1000 steps). The opaque curve is the mean of the results of 15 experiments. The shadows are the bands of the confidence interval obtained by estimating the unbiased variance of the collective rewards. (a) The comparison between the mean collective rewards of all methods computed by removing the best and worst results of each method. The IA and baseline methods have a sharp initial raise but then slow down, whereas the EMuReL and SI have a relatively constant growth rate (EMuReL as the smoothest) and that EMuReL grows with a greater slop compared to the SI. (b) The comparison between the mean equality of all methods.
Applsci 14 06432 g002
Figure 3. The results for the Harvest environment using the A3C algorithm. The same setup as that of Figure 2 is used with the difference of using the A3C rather PPO algorithm and 4 instead of 5 agents. The opaque curve is the mean of the results of 5 experiments. The curves of all methods become almost fixed after 0.2 × 10 8 steps. The advantageous EMuReL method outperforms other methods after 0.1 × 10 8 steps and has the overall better result.
Figure 3. The results for the Harvest environment using the A3C algorithm. The same setup as that of Figure 2 is used with the difference of using the A3C rather PPO algorithm and 4 instead of 5 agents. The opaque curve is the mean of the results of 5 experiments. The curves of all methods become almost fixed after 0.2 × 10 8 steps. The advantageous EMuReL method outperforms other methods after 0.1 × 10 8 steps and has the overall better result.
Applsci 14 06432 g003
Figure 4. The result of the ablation experiment for the Cleanup environment. The mean reward is obtained by 5 agents over individual experiments of the EMuReL and its ablation system using the PPO algorithm. Each point of these curves shows the average collective reward over at least 96,000 environment steps (96 episodes of 1000 steps). The opaque curve is the mean of the results obtained by the best 4 experiments of each study. The shadows are the bands of the confidence interval obtained by estimating the unbiased variance of the collective rewards.
Figure 4. The result of the ablation experiment for the Cleanup environment. The mean reward is obtained by 5 agents over individual experiments of the EMuReL and its ablation system using the PPO algorithm. Each point of these curves shows the average collective reward over at least 96,000 environment steps (96 episodes of 1000 steps). The opaque curve is the mean of the results obtained by the best 4 experiments of each study. The shadows are the bands of the confidence interval obtained by estimating the unbiased variance of the collective rewards.
Applsci 14 06432 g004
Table 1. Collected rewards in Cleanup and Harvest environments. The collective reward obtained by agents in the last 192,000 and 1,600,000 steps is averaged over 13 and 5 experiments in the Cleanup and Harvest environments, respectively. The results of the IA and EMuReL methods are reported for the advantageous-IA agents. The largest number in each column is in boldface.
Table 1. Collected rewards in Cleanup and Harvest environments. The collective reward obtained by agents in the last 192,000 and 1,600,000 steps is averaged over 13 and 5 experiments in the Cleanup and Harvest environments, respectively. The results of the IA and EMuReL methods are reported for the advantageous-IA agents. The largest number in each column is in boldface.
MethodsEnvironment (Algorithm)
Cleanup (PPO)Harvest (A3C)
Baseline484.4622.2
IA431.7494.7
SI554.2475.5
EMuReL663.2685.5
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alamiyan-Harandi, F.; Ramazi, P. Environmental-Impact-Based Multi-Agent Reinforcement Learning. Appl. Sci. 2024, 14, 6432. https://doi.org/10.3390/app14156432

AMA Style

Alamiyan-Harandi F, Ramazi P. Environmental-Impact-Based Multi-Agent Reinforcement Learning. Applied Sciences. 2024; 14(15):6432. https://doi.org/10.3390/app14156432

Chicago/Turabian Style

Alamiyan-Harandi, Farinaz, and Pouria Ramazi. 2024. "Environmental-Impact-Based Multi-Agent Reinforcement Learning" Applied Sciences 14, no. 15: 6432. https://doi.org/10.3390/app14156432

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop