1. Introduction
Context, Literature and Limitations: With the gradual increase in electricity demand, the power grid system meets the needs of a more diversified market while ensuring safe, reliable and stable performance, which brings great challenges to the traditional power grid [
1,
2,
3]. Therefore, the next-generation power grid system, the smart grid, is designed to meet the above challenges [
4]. It is based on a more intelligent power distribution method to ensure the stability of users’ electricity consumption [
5], and requires the simultaneous cooperation of multiple distributors on the network to meet the customers’ needs [
6]. To achieve this, researcher begin to use multi-agent reinforcement learning algorithms to solve collaborative decision-making in the smart grid power systems [
7,
8,
9,
10]. In this framework, the distribution nodes in smart grid are modeled as the reinforcement learning agents to learn cooperative distribution strategy [
11].
Several collaborative reinforcement learning algorithms have been proposed recently to address the collaborative problem in power grids [
12,
13,
14,
15]. These algorithms model nodes on the smart grid, such as power distributors, charge and discharge controllers, and switches controllers, as agents, and improve the performance of the overall system by optimizing the cooperative strategy of multiple nodes (agents). A distributed reinforcement learning algorithm based on the representation of cross node distributed value function is proposed, it is demonstrated in the simulation of distributed control of power grid and has achieved good results [
12]. A multi-agent system (MAS) for managing distribution networks connected to regional substations is proposed. Using MAS, manage the orderly connection and disconnection of resources to minimize disruption to the balance of supply and demand within the grid network [
13]. A multi-agent algorithm formulated within the framework of dynamic programming to solve the controlling power flow problem proposed, where a group of individual agents can autonomously learn to cooperate and communicate through a global optimization objective [
14]. A fuzzy Q-Learning approach is proposed to treat microgrid components as independent learners while sharing state variables to coordinate their behavior. Experiments have proved the effectiveness of a single agent to control system components, and coordination can also systematically ensure a stable power supply and improve the reliability of microgrid systems [
15]. In these methods, there must be sufficient information exchange between agents, including but not limited to each agent’s local observation state, and accepted reward, so that the coordination of the agent system can be realized. However, in the actual smart grid system, the number of the nodes (agents) is large, resulting in a complex grid network, which requires a considerable communication bandwidth to meet the communication needs [
16]. As a result, there has been renewed interest in the academic community to reduce the communication bandwidth between agents [
17].
In reality, information exchange through communication enables humans to form a team, allowing everyone to perceive, identify, and coordinate actions in the physical world to complete complex tasks [
18]. This mechanism also applies to systems where multiple intelligent agents need to cooperate [
19]. In a multi-agent system, agents only have local observations, and communication allows them to share information and spread experience. In embodied agents [
19,
20,
21], for navigation tasks, the navigator agent needs additional communication from the “oracle agent”, which provides detailed information of the map. For the study of communication among agents, many algorithms have been proposed recently, including DIAL [
22], CommNet [
23], BiCNet [
24], TarMAC [
25], NDQ [
26], and CoMON [
19]. However, these methods require continuous communication among agents. As the number of agents and the communication frequency increase, the valuable information may be submerged in the ocean of information. In some practical applications, communication is expensive, and the increase in the amount of information causes an increase in bandwidth, computational complexity, and communication delays.
To allow the agents to identify valuable knowledge in the ocean of information, some studies applied the attention mechanism to learn the weight for the communication among agents to filter redundant information [
25,
27]. However, the time-series observation data makes the learned attention model unstable [
28]. Besides, the attention model does not essentially reduce the communication bandwidth among agents but only imposes a weight on the information [
9]. There is another option to reduce communication bandwidth, which is to use the selective sending mechanism. The key to designing a selective sending mechanism is managing to make the agent identify the critical state and moment to send a message. If we can predict the outcome of sending and not sending a message, then the agent can selectively send messages based on the predicted outcome. Fortunately, in causal inference, whether or not to send a message can be viewed as an intervention variable (binary variable), and the effectiveness of the intervention variable on the outcome can be predicted by estimating individual treatment effect (ITE) [
29,
30,
31]. Therefore, we propose a causal inference communication model to address the communication bandwidth problems among agents. By employing causal inference, we regard whether to or not communicate as a intervention variable. Based on the intervention variable, we can estimate the effect of communication on the outcome to determine the intervention variable and decide whether the agents need to communicate or not. The learned causal model allows agents to send messages selectively, which reduces the amount of communication exchanged between agents, thus reducing the communication bandwidth.
This paper proposes a collaborative multi-agent model with causal inference, referred as the Causal Inference Communication Model (CICM), to help each reinforcement learning agent in the system decide whether to communicate or not. Specifically, we implement CICM using the following steps. First, to connect reinforcement learning with causal model, CICM embeds the control problem into a Graphical model with Bayesian formula, which is a causal model at the same time. Then, we parameterize the causal graph as a latent variable model based on Variational Autoencoders (VAE) [
32]. By evaluating the ITE [
30], we can determine whether the communication is beneficial. Thus, agents can only communicate at critical moments and necessary states, reducing the communication bandwidth in the smart grid system.
Therefore, a summary of our objectives is listed below:
- (1)
To apply the causal model to optimize the communication selection in the problem of the smart grid, also in the collaborative agent tasks.
- (2)
To formulate a graphical model which connects the control problem and causal inference that decides the necessity of communication.
- (3)
To prove empirically that the proposed CICM model effectively reduces communication bandwidth in smart grid networks and collaborative agent tasks while improving task completion.
Contributions of this Paper: This paper makes the following original contributions to the literature in these following areas:
- (1)
This is the first study to apply the causal model to optimize the communication selection in the problem of the collaborative agent tasks in smart grid.
- (2)
A new graphical model is developed which connects the control problem and causal inference that decides the necessity of communication.
- (3)
Innovative numerical proof that the proposed CICM model effectively reduces communication bandwidth is provided.
Structure of the Paper:
Section 2 presents the existing literature on the smart grid, communications in cooperative agents, and causal inference and their limitations.
Section 3 discusses the concepts and structure of a graph model to adopt in this study, while
Section 4 presents the causal graphical model and the inference method, Causal Inference Communication Model (CICM). Data sets, the environment and computational experiments are reported in
Section 5 and the results of these experiments are analyzed in
Section 6. Research limitations and threat to validity are discussed in
Section 7. Conclusion and further research are stated in
Section 8.
4. Method: The Causal Inference Communication Model (CICM)
The challenge of distributor cooperation in smart grids is a natural multi-agent collaboration problem. In order to reduce the communication frequency in smart grids and further reduce the communication bandwidth, we propose the CICM. This section describes CICM in detail. We first discuss the reinforcement learning with the causal model and establish a graphcial model, which offers a flexible framework. Then, based on the graphcial model, we introduce ITE to determine whether agents need to send communication or not.
4.1. Connecting Causal Model and Reinforcement Learning
Strategies for distributors in a smart grid can be learned using reinforcement learning, and the communication strategies between distributors can be obtained using causal inference. To connect them, we integrate the graphical model of reinforcement learning and causal inference into a single graphical model. The reinforcement learning embedded into the graphical model is shown in
Figure 1b. The objective function can be obtained with maximum entropy [
59] through approximate inference. To integrate with the causal model, this paper extends the graphical model of reinforcement learning by introducing a latent variable
z and an intervention variable
h, as shown in
Figure 2. In the smart grid systems, the intervention variable
h refers to whether the distributor node accepts external messages. The latent variable
z adopts variational autoencoder (VAE) to learn a state representation vector in control problem. Through VAE, we can obtain an informative supervision signal. A latent vector
z representing any uncertainty state variables is quickly learned during training. The intervention variable
h controls the presence or absence of communication. The agent
i accepts
from the oracle agent when
, and rejects
when
, where the
m is the communication of agent
i and the
is the communication from other agents. The intervention variable allows us to employ a causal model to estimate the impact of communication on the distribution
.
Figure 2 presents the probabilistic graphical model, containing the latent variable
, the agent’s observation data
x, intervention variable
h, outgoing communication
m and communication from other agents
, and action
a. We first use the chain rule to express the relationship between the variables with the Bayesian formula:
The variational distribution of a graph model can be written as the product of recognition terms
and policy terms
:
Optimizing the evidence lower bound (ELBO) can obtain the maximum marginal likelihood
[
59]. From the posterior of the variational distribution (Equation (
5)) and the likelihood of the joint distribution Equation (
4), whose marginal likelihood can be soloved by Jensen’s inequality. The ELBO is:
The first term of the above equation (Equation (
6)) is the KL-divergence of the approximate from the true posterior. Since
is fixed and KL-divergence is positive, we convert the problem into optimizing ELBO [
32]:
We rewrite the ELBO as follows, and present the complete derivation of the ELBO in
Appendix A.
where
in control as inference framework. For simplicity, we omit the constant term, i.e., uniform action prior
, in ELBO. The first term Equation (
8) of the ELBO is the latent variable model about latent variable
z, which is marked with
. Besides, there are generative model
,
and
, as well as inference model
. The second term Equation (
9) is the maximum entropy objective function [
59].
Figure 3 shows the architecture of the model and inference network, which includes the encoder and decoder in the VAE.
4.2. Estimating Individual Treatment Effect
ITE is used to compute the difference in outcomes caused by intervention variable h. In actual cases, only the outcomes caused by treatment or control can be observed, and the counterfactual outcomes of unimplemented interventions are always missing. Similarly, during the training of RL agents, only the outcomes of specific communication choice are observed. We cannot obtain the individual-level effect directly from the observed trajectories of agent. The immediate challenge is to estimate the missing counterfactual outcomes based on the historical trajectories and then estimate the ITE.
According to Definition 2, the ITE
of an agent on the intervention variable
h is as follows. It is measured as the difference between the expected treatment effect when
(accept
from the other agents) and
(reject
from the other agents), which can be written as:
The
in the above formula refers to the intervention condition
. According to the backdoor criterion in Definition 3, we apply the rules of the do-calculus to
Figure 2. We can get:
where the transition from (12) to (13) is by the rules of do-calculus applied to the causal graph in
Figure 2. The
and
are independent of each other under the condition given by
,
, which transforms the formula from (13) to (14). Similarly,
.
We can obtain the ITE
in probabilistic form. The following formula can be calculated using the data distribution before the intervention. We use the backdoor criterion to estimate the ITE in the following form:
The following formula is used to determine the value of the binary variable
based on the prediction result of
. The value of the binary variable
refers to whether the agent needs to communicate.
We add a term to the latent variable model
to help us predict
.
Here, , are the actual observations. We use the relationship between the optimal distribution variable and the reward r, to calculate the label value corresponding to the distribution.
5. Experiments, Datasets and the Environment
We first introduce the power grid problem, a distributed distributors environment in smart grid. In addition, in order to fully prove the effectiveness of our method, we introduce StarCraft II and 3D environment habitation experiments. Both the Starcraft II and 3D environment habitation experiments have high-dimensional observation spaces, and they can validate the generalization ability of our model.
5.1. Datasets and Environment
We test our algorithm in power grid experiments [
12]. To facilitate the modeling and preliminary investigation of the problem, the simulation environment is not a common alternating current (AC) power grid network, but a direct current (DC) variant power grid network. Although the physical behavior of AC grids is quite different from that of DC grids, the method is suitable for a general learning scheme for distributed control problems.
As shown in
Figure 4, the regulation system involves three parts, voltage source, city, and distributor. The grey circles are the distributors, which we model as agents in the reinforcement learning algorithm. This allows it to interact with the environment, receive observations from the environment, perform actions to adjust voltages based on the observations, and then receive reward value feedback from the environment. We set the reward value fed back to the agent by the environment as the degree to which the distributor satisfies the city’s electricity consumption. If the voltage obtained by the city node is lower than the expected value, the environment feeds back a penalty value to the distributor connected to the city node. The cooperation of multiple distributors is required to divide the voltage reasonably in the city to meet the urban electricity demand. A distributor that is not connected to a city will reward the signal with 0.
The simulated power grid problem is solved using the reinforcement learning algorithm, where the action, state, and reward values are as follows:
(1) Local actions: The power grid system controls the current by controlling the variable resistance. Each distributor node can make a choice for the power line (variable resistance) connected to it. A distributor can perform three actions on the resistance, Same, Halved, and Doubled. If a line is connected to two distributors at the same time, it is affected by both distributors at the same time, and the final selection is performed according to
Table 1.
(2) Local state: The distributors receive state information from the lines connected to it. There are three types of connections for power lines: distributor-distributor, distributor-city, and distributor-voltage source. (1) Distributor-distributor: ➀ Whether it is higher than the neighbor voltage; ➁ The neighbor voltage changes, increasing, decreasing, or unchanged; ➂ The state of the resistance (maximum value, minimum value, or intermediate value) (2) Distributor-city: ➀ Whether the voltage is higher than the neighbors; ➁ Whether the city needs to increase the voltage; ➂ The state of the resistance. (3) Distributor-voltage source: ➀ Whether the voltage is higher than that of the neighbor; ➁ The state of the resistance.
(3) Local reward: When the city voltage connected to the distributor node is lower than the expected level, the environmental feedback a negative reward value to the distributor, and the reward value is equal to the difference between the actual city voltage and the expected voltage, Otherwise, the reward is 0. A distributor that is not connected to a city has a reward of 0.
We use the multi-object navigation (multiON) dataset [
60], which is adopted by artificial intelligence habitat simulators [
61]. This dataset is designed for navigation tasks with the following essential elements: agent start position, direction, and target position. Eight cylindrical target objects with different colours are set. The agents’ episodes are generated based on the Matterport3D [
62] scene. The data is split according to the standard scene-based Matterport3D train/val/test split [
60]. Based on the multiON dataset, a multi-target sequential discovery task is constructed. The agent needs to navigate to the designated target locations in sequence in an episode. The FOUND action is defined as finding the target, which should be taken when the agent is within a distance threshold to the target. If the FOUND action is called beyond the threshold, the game fails. If the target is not found within a specified limit of the number of steps, the game is also judged as a failure. We use m-ON to denote an episode with m sequential objects. In the task, we define two heterogeneous agents. One is an oracle agent
with a god perspective, and the other is an embodied navigator
, which performs specific navigation tasks and interacts with the environment.
’s observations are the position, orientation of
, the global information of the map, and the target position.
only observes self-centered visual images with depth information. If obstacles block the target position,
cannot perceive the target position and needs additional information from
. There is limited communication bandwidth for guidance information between
and
. Two agents share the same reward, so they must learn to cooperate to optimize their mutual reward together.
We design an experiment with two scenarios based on the StarCraft II environment [
63], as shown in
Figure 5. One is a maze task as shown in
Figure 5a, and the other is a tree task of depth search as shown in
Figure 5b. In the maze task, the navigator agent starts from the bottom centre point and aims to navigate to a target point whose position is randomly initialized on the road of the maze. It is noticed that the navigator agent does not know the target position. The oracle agent, on the contrary, has a god perspective that captures the target position. The oracle agent could send the relative target position (i.e., the target’s relative position to the navigator agent itself) information to the navigator agent. In the tree environment, the target point is initialized at a random leaf node in the map. Similarly, the oracle agent can pass the relative position of the target point to the navigator. We set two different depths in this scenario, depths 3 and 4. An increase in depth improves the game’s difficulty. An enemy agent is used to represent the target point for convenience. This enemy agent at the target point is inactive and has very little health, which a single attack can kill. The death of the enemy agent indicates the navigator agent successfully arrives at the target point, and the navigator agent will receive additional rewards.
5.2. Reward Structure and Evaluation Indicators
In Habitat and StarCraft II environment, the reward value is designed as: , where, is an indicator variable that takes value 1 when the navigator agent finds the current goal at time t. If the target is found, the agent receives reward . is the difference in the distance towards target position between time step t and . is the penalty received by the agent at time step t. There is a communication penalty for a message sending at step t. To compare our results with previous studies, the communication penalty is only used in training and is excluded from the total reward in testing. In the power grid problem, our reward value is defined as the degree to which a distributor satisfies the city’s electricity consumption. If the voltage obtained by the city node is lower than the expected value, the environment feeds back a penalty value to the voltage divider connected to the city node. Like the previous two environments, the penalty reward for communication is set in the power grid experiments.
We use the evaluation metrics in [
64] on navigation tasks. In multiON [
60], these metrics are extended to navigation tasks with sequential targets. We adopt the following two metrics in our experiments: PROGRESS indicates the fraction of target objects found in an episode, and PPL refers to the progress weighted by the path length.
5.3. Network Architecture and Baseline Algorithm
Similar to CoMON [
19], CICM adopts the network structure of TBONE [
65,
66]. In Habitat and StarCraft II environment,
encodes the information into a vector, containing the location of navigation agent
, the map information, and the target location. During the encoding process,
in the habitat environment crops and rotates the map to construct a self-centered map, implicitly encoding
’s orientation into the cropped map. Then, the initial belief of
is obtained through CNN and linear layers. This belief is encoded as a communication vector and sent to
[
65,
66]. In Starcraft II,
encodes inaccessible areas’ surrounding terrain and information. This information contains the target agent position and will be sent to
.
For Habitat environment, we use the algorithms in CoMON [
19] as our comparing baselines, that are NoCom (i.e., “No communication”), U-Comm (i.e., “ Unstructured communication”), and S-Comm (i.e., “Structured communication”). We design our algorithm with a causal inference communication model based on these baseline algorithms, which are U-Comm&CIC (i.e., “U-Comm with Causal Inference model”) and S-Comm&CIC (i.e., “S-Comm with Causal Inference model”).
For the StarCraft II environment, we design the following algorithms. To meet the maximum entropy item in the ELBO model, we use the SAC [
67] algorithm. Inspired by the SLAC algorithm [
68], the latent variable is used in the critic to calculate the Q function
, and the state input
of the agent is used to calculate the policy
during execution. We design the following algorithms: SACwithoutComm, model without communication; SACwithComm, only using SAC algorithm with the communication; SACwithLatentAndComm, adding latent variable model and communication using VAE to SAC algorithm; CICM (our method), leveraging causal inference and VAE model.
For the power grid environment, unlike the previous two environments, the power network involves communication between multiple agents. When the agent sends information, it is also the receiver of information. We encode the information sent by the distributors connected to the receiver, and finally take the average value as the received information. The algorithm design in the power grid network is the same as that used in StarCraft II environment.
6. Analysis of Experiment Results
In this section, we first analyze the computational complexity of CICM, and then analyze the performance of the algorithm in three experimental environments (power grid, StarCraft, and habitat).
6.1. Complexity Analysis
We theoretically analyze the complexity of our algorithm CICM. In the smart grid environment, we consider that all agents can communicate with each other, so that they can form a complete graph. If there are N agents, the computational complexity is . If there are no more than neighbor nodes, our computational complexity is less than . Therefore, the computational complexity is acceptable.
In the neural network, the computational complexity of our algorithm is related to updating parameters during training. Use U to represent the total number of training episodes. In each episode, there are T steps. We set the computational complexity of the ITE module to M and the computational complexity of the reinforcement learning (SAC) to W. During training, the update is made every C steps. The computational complexity of our algorithm is . We define is the dimension of states, is the dimension of communication, is the dimension of hidden layer, is the latent variable dimension, is the action dimension, and the binary variable dimension and is 1. In ITE, there are two modules, including encode and decoder. In the encoder module, the computational complexity is . The computational complexity of the decoder module is . The reinforcement learning (SAC) also includes two modules. The computational complexity of Critic is , which is involving two Critic. The calculation complexity in Actor is .
6.2. Power Grid Environment
Our algorithm has a communication penalty in the reward during training, and for a fair comparison, we do not calculate the penalty during testing.
Figure 6 shows the penalty values under two different grid structures.
Table 2 shows the communication probability of our algorithm CICM, which is calculated by dividing the number of time steps of communication by the number of communication steps of the full communication algorithm (communication is carried out at each time step). In
Figure 6a,b, we can see that the algorithm without communication, SACwithoutComm, receives significantly more penalties than the other three algorithms with communication. Among the three algorithms with communication, the algorithm SACwithComm, which directly uses communication, is better than the algorithm without communication. However, SACwithComm is not as high as the algorithm SACwithLatentAndComm which combines the latent variable model in the utilization of communication information. Our algorithm CICM, a communication model that combines latent variable model and causal inference for communication judgment, shows the best performance. The judgment of communication helps the agent filter unnecessary information, which reduces the penalty caused by the distribution voltage while reducing the communication cost.
We further analyze the communication probability in
Table 2. Since we set the communication penalty
, as long as the agents communicate with each other, the system will receive the communication penalty, and the communication penalty is included in the feedback reward of the system. The application of the communication penalty will reduce the feedback reward, which reduces the probability of the system getting optimal feedback,
. In order to increase the probability of optimal feedback, the system needs to reduce the communication probability, thereby reducing the communication penalty and increasing the probability of the system getting the optimal feedback. We test our model on grid a (
Figure 4a) and grid b (
Figure 4b) and obtain 37.4% and 32.9% communication probability, respectively. The experiment shows that our model uses a small number of communications to reduce the penalty value. However, the communication probability won’t be reduced to zero, since the power grid system requires certain communication to ensure cooperation among distributors.
6.3. StarCraft Experiment
Figure 7 shows the reward for our StarCraft II environment. There is a communication penalty in the reward in our algorithm during training, and we do not count the penalty generated by communication costs during testing for a fair comparison. All the graphs show that the reward learned by algorithms without communication is significantly lower than the algorithms with communication. This is because the oracle agent provides critical information about navigation, which includes the target position. We can see that the convergence speed of SACwithComm is fast, and it rises quickly in all of the three graphs at the beginning. In contrast, the models with latent variables (CICM and SACwithlatentAndcomm) have a slow learning speed initially. Because a certain amount of data is required to train a stable VAE model. After obtaining a stable VAE, the SACwithlatent algorithm rises rapidly, surpassing the performance of SACwithComm on the maze and slightly exceeding the performance of SACwithComm on the tree environment. It reveals that the latent model has improved the performance of the algorithm.
CICM integrates the latent model and causal inference for communication judgment. With the help of the latent model, even although our algorithm learns slowly at the beginning (
Figure 7b), it achieves the highest and most stable performance among all of the algorithms at the end. CICM’s final stage is higher than others because of the introduction of a communication strategy. It allows the agent to reduce unnecessary communication and memory information in the RNN network and thus obtain the most negligible variance in all three experiments.
Table 3 shows the communication probability of our algorithm and test result under different single-step communication penalties in the maze environment. From the table, we can see that an increase in communication penalty will decrease communication frequency. The performance difference between different communication probability is not very large, and CICM achieves the best performance when the single-step communication penalty is −0.5. Therefore, we adopt −0.5 as the default value in the experiments.
We further analyze
Table 3. From
Table 3, we can see that the smaller communication penalty, the smaller the impact on communication probability. This is because the system regards the communication penalty as a part of the feedback. When the communication penalty is small, the communication penalty will not play a big role in the probability of getting the optimal feedback
. But when the communication penalty becomes
= −0.5, we find that the communication probability drops significantly to 38.0%. The reason is that the communication penalty affects the probability of the system getting the optimal feedback, and the communication probability needs to be reduced to make the feedback optimal. At the same time, we can also notice that the communication penalty cannot be increased indefinitely. Although the communication penalty can make the algorithm reduce communication probability, the lack of communication will also affect the cooperation between agents. It can be seen from the table that when the communication penalty reaches −2, the communication volume is reduced to 27.4%, but at the same time, the obtained test result will also be reduced to 1.75.
6.4. Habitat Experiment
Below we analyze the algorithm performance in Habitat. Our algorithm is the first trained on the 3-ON setting, and then gets a stable model. We merge the counterfactual communication model based on the trained model and finally get our overall model. We test this model on 1-ON, 2-ON, and 3-ON, respectively, and the final results are presented in
Table 4. We test the effect of different hyperparameters
on the communication probability of our algorithm S-Comm&CIC, as shown in
Table 5.
NoCom provides our algorithm with a lower bound on what can be achieved without communication. Our algorithm adds the causal inference communication model on U-Comm and S-Comm (which we name as U-Comm&CIC and S-Comm&CIC). U-Comm&CIC and S-Comm&CIC are close to the effect of the original algorithm. At 3-ON, our algorithm slightly exceeds the original in the indicator PPL. A higher PPL metric indicates that our algorithm can successfully navigate over shorter distances.
8. Conclusions
As electricity demands are increasingly significant, it is necessary to develop a power grid system that can meet higher and more diversified market demands and is safe, reliable, and stable performance. The emerging smart grid can meet the current challenges faced by the traditional power grid system as it is based on more intelligent power distribution agents and methods to increase electricity generation and ensure safety, reliability, and stability. The distributed control of the smart grid system requires a large amount of bandwidth to meet the communication needs between distributor nodes. To ensure that the system performs well and reduces the communication bandwidth, we propose CICM, which adopts a causal inference model to help agents decide whether communication is necessary. The causal inference is constructed on a graphical model with the Bayesian formula. The graphical model connects optimal control and causal inference. Estimating the counterfactual outcomes of the intervention variables and then evaluating the ITE helps the agent make the best communication strategy. We conduct experiments on smart grid environment tasks. Our experiments show that CICM can reduce the communication bandwidth while improving performance in the power grid environment. In addition, we also conduct experiments on StarCraft II navigation tasks and the 3D Habitat environment, and the experimental results once again prove the effectiveness of the algorithm. This study serves as a first attempt to optimise control and reinforcement learning with causal inference models. We believe that the causal inference frameworks will play a more significant role in reinforcement learning problems.
Future research will focus on extending the current model in several directions, such as model multiple distributors, explicit modeling of game theoretical aspects of the graph theory modeling, distributed computing, causal modeling of other reinforcement learning agents, model-based multi-agent reinforcement learning, and variational inference for accelerating off-line multi-agent reinforcement learning.