Next Article in Journal
Fisher Information Perspective of Pauli’s Electron
Next Article in Special Issue
Shaped-Charge Learning Architecture for the Human–Machine Teams
Previous Article in Journal
Study of Bulk Properties of Strange Particles in Au+Au Collisions at sNN = 54.4 GeV
Previous Article in Special Issue
Providing Care: Intrinsic Human–Machine Teams and Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Mutual Information and Multi-Agent Systems

1
Naval Research Laboratory, Code 5580, Washington, DC 20375, USA
2
2022 SEAP Summer Intern at the Naval Research Laboratory, Washington, DC 20375, USA
3
Jackson Health System, 1500 NW, 12th Ave, Miami, FL 33136, USA
*
Author to whom correspondence should be addressed.
Entropy 2022, 24(12), 1719; https://doi.org/10.3390/e24121719
Submission received: 21 October 2022 / Revised: 17 November 2022 / Accepted: 19 November 2022 / Published: 24 November 2022

Abstract

:
We consider the use of Shannon information theory, and its various entropic terms to aid in reaching optimal decisions that should be made in a multi-agent/Team scenario. The methods that we use are to model how various agents interact, including power allocation. Our metric for agents passing information are classical Shannon channel capacity. Our results are the mathematical theorems showing how combining agents influences the channel capacity.

1. Introduction

Advances in machine intelligence have led to an increase in human-agent teaming. In this context, one or more machines act as semi-autonomous or autonomous agents interacting with other machine teammates and/or their human proxies. This phenomenon has led to cooperative work models where the role of an agent can be, interchangeably, a human, or machine, support system. Human counterparts that interact with automation become less like operators, supervisors, or monitors, and more like equal-authority peers.
Critical to the success of any team is efficient and effective communication. Multi-agent systems are no different. Information sharing is a key element in building collective cognition, and it enables agents to cooperate and ultimately achieve shared goals successfully. Information sharing, or communication, provides the foundation for a team’s success. In complex multi-agent engagements, information is not always universally available to all agents. Such engagements are often characterized by distributed entities with limited communication channels among them, where no agent has a complete view of the solution space, and information relevant to team goals only becomes available to team members in spontaneous, unpredictable and even unanticipated ways. Moreover, there is always a resource cost to inter-agent communication. Finding highly efficient and effective communication patterns is a recurring problem in any multi-agent system, particularly if the system agents are distributed.
We are concerned with how a Multi-agent System (MAS) [1], or Team, sends information between agents or teammates. By “how” we mean “how” in an information theoretic [2] sense—in particular, we do not concentrate on the mechanics or physics of the transmission other than how it impacts information theory. We are concerned with what strategy an agent can to use to maximize its information flow to another agent. From an information geometric standpoint, we only use a simple metric in this article, but lay the ground work for more complex Riemannian metrics. We are concerned with a transmitting agent sending a small amount of distinct symbols in a fixed time. In fact, we restrict ourselves to two symbols to develop our theory (A list of notation is at the end of the article.). We are using a mathematical approach to model the communication between two agents. The equations we present are based on a series of assumptions that we will explain.
We assume that an agent sends two symbols to another agent. We refer to the symbols as “0” or “1”. We are concerned with the fidelity of how the symbols are passed. All symbols take the same time to pass. We will be looking at the (Shannon) capacity as one agent attempts to send a symbol to another agent.
Our scenario is illustrated in Figure 1 and Figure 2. The first agent A X sends a 0 or 1 to the second agent A Y . We have a clock and the unit of time is t. Every t, A X transmits the symbol to A Y . We assume that the symbol is received within the same time unit (i.e., we assume instantaneous transmission speeds during each interval t). There is no feedback (which, for the channels we analyze, would not change the capacity anyway (p. 520 [3])) from A Y to A X , and the transmission is considered to be memoryless (quoting [4], “…channel is memoryless if the probability distribution of the output depends only on the input at that time and is conditionally independent of previous channel inputs or outputs”). Furthermore, it is implicit that the channel statistics never change (sometime the literature refers to this as a “stationary” condition).
To summarize the above, we have a Discrete Memoryless Channel (DMC) between A X and A Y . This channel measures information flow in terms of bits per symbol (since t does not vary). We let X represent the input distribution to this DMC, and we let Y denote the output random variable.
The probability for the random variable X is given by P ( X = i ) , i = 0 , 1 ; it is the probability that A X inputs symbol i, and P ( Y = j ) , j = 0 , 1 is the probability that A Y received symbol j. The input distribution X is determined by the transmission fidelity of A X . In particular,
x = P ( X = 0 ) = x , x ¯ : = P ( X = 1 ) = 1 x .
Whereas the output distribution Y is determined by the (assumed to be well-defined) conditional distribution between X and Y, and the input distribution. Thus,
P ( Y = j ) = i P ( Y = j | X = i ) · P ( X = i ) .
The approach presented in this paper follows from [2,5,6,7].
The conditional probabilities of the DMC is given by a 2 × 2 matrix M 1 , where (Please keep in mind the swapping of the indices, and, as we had above for x ¯ , that notationally ¯ : = 1 . Furthermore, the convention is that a conditional probability is fixed for all P ( X = i ) , even if that probability is 0. In the next footnote, we address the impact of this with respect to (w.r.t.) information theory).
m i , j : = P ( Y = j | X = i ) and
M 1 = m 0 , 0 m 0 , 1 m 1 , 0 m 1 , 1 = P ( Y = 0 | X = 0 ) P ( Y = 1 | X = 0 ) P ( Y = 0 | X = 1 ) P ( Y = 1 | X = 1 ) = : a a ¯ b b ¯ .
Note that ( a , b ) [ 0 , 1 ] × [ 0 , 1 ] .
Before we continue with the mathematics let us put this research into some more perspective. Von Neumann’s [8] seminal work had no concept of “Teamwork”, which is at the core of what we are discussing. Sliwa’s [9] review suggests that minimum communication channels are more important when context is understood during teamwork, a suggestion opposite to our work in this article which we hope to test in the future. Lawless [10] suggests that maximized channels become more important when Teams confront uncertainty in their environment. Schölkopf et al. [11] suggest that i.i.d. data are insufficient to reconstruct whatever social event is being captured, that something is missing and a new approach must be innovated, our goal in this article. Our results will be discussed in situ for maximum effect.

1.1. Entropy and Mutual Information

We extend our random variables to allow more than two possible outcomes, and give the following definitions with the most generality possible. We now have I + 1 possible inputs, and J + 1 possible outcomes.
Given a discrete random variable V, we define the entropy of V as (By convention log is the base 2 logarithm, and ln is the natural logarithm. Furthermore, we are able to extend the definitions (p. 19 [4]), as is standard, so that 0 log ( 0 ) = 0 log ( 0 / 0 ) = 0 . These conventions allows the most general derivation of (8) from (7)).
H ( V ) : = j P ( V = v j ) log P ( V = v j ) .
If z [ 0 , 1 ] , then we define the binary entropy function of z as
h ( z ) : = z log ( z ) ( 1 z ) log ( 1 z ) .
Note that if B is a binary random variable taking the values 0 or 1, then H ( B ) = h P ( B = 0 ) . In fact, we simplify the notation and express the probability of the event { V = v k } as
p v ( v k ) = P ( V = v k ) .
Furthermore, when it is clear which distribution we are using, we further simplify the notation and just write p ( v k ) . Thus,
H ( B ) = h p ( 0 ) .
Given two discrete random variables V , W , we define [2] the conditional entropy of V given W as
H ( V | W ) : = i p w ( w i ) j p v | w ( v j | w i ) log p v | w ( v j | w i ) ,
where, as in the 2 × 2 case
P ( v j | w i ) : = m i , j , i = 0 , 1 , , I ; j = 0 , 1 , , J ,
forming the channel matrix (Of course, as in the 2 × 2 case, conditional probability is only defined when p ( w i ) 0 . However, as we note below, such terms are dealt with by using the limiting value of the constant conditional probability term which makes our mutual information calculations consistent, keeping in mind that 0 log is always taken to be 0. Furthermore, keep in mind that a distribution that achieves capacity for a 2-input channel (the subject of this paper) never has either probability value as zero of course (Ref. [12] gives better bounds). There are, however, 3 × 2 channels for which this does not hold, for example 1 0 0.8 0.2 0 1 which has an optimizing input distribution of ( 0.5 , 0 , 0.5 ) .)
M = p ( v 0 | w 0 ) p ( v 1 | w 0 ) p ( v J | w 0 ) p ( v 0 | w 1 ) p ( v 1 | w 1 ) p ( v J | w 1 ) p ( v 0 | w I ) p ( v 1 | w I ) p ( v J | w I ) .
We define the mutual information between V and W by [2]
I ( V , W ) : = H ( V ) H ( V | W ) = H ( W ) H ( W | V ) = : I ( W , V ) .
Using (5) and (7), and some substitutions [4] (again, division by 0 is taken care of in the usual way by using limiting values ([Section 2.3] [4]), we find that
I ( V , W ) = j , i p ( v j , w i ) log p ( v j , w i ) p ( v j ) p ( w i ) .
We now give Shannon’s definition [2] of (channel) capacity. It has been well-studied since its inception. We will not delve into the Noisy Coding Theorem, or any of the other results which showcase its importance. Rather, we will assume in this paper that capacity is a standard measure of how much information a channel can transmit in an essentially noise-free manner [2,4]. The traditional units of capacity and mutual information are accepted in this article; they are bits per channel usage, which in our scenario is equivalent to bits per t.
Definition 1.
We consider W to be the input random variable to a DMC. The capacity C of the DMC is
C : = sup { p ( w i ) } I ( V , W ) .
The optimization is taken over all possible distributions of W with its fixed values w i . The supremum is actually achieved and can be taken as a maximum [2,4]. Note that when trying to compare the magnitude of the channel capacity (with the same number of inputs), it suffices to compare the mutual information for all x values. Of course the two channels may have different optimizing distributions. Note the principle (and similar principles) that if x , I ( C H 1 , x ) I ( C H 2 , x ) and if C H 1 achieves capacity at x , then C ( C H 1 ) = I ( C H 1 , x ) I ( C H 2 , x ) C ( C H 2 ) .
Of course swapping rows, or swapping columns from the channel matrix (6) is just notational and leaves capacity unchanged. However, we end this subsection with some interesting results in information theory—some obvious, some not so obvious.
Property 1.
Removing a row from the channel matrix (6) never increases the capacity.
Proof. 
Not using a channel input cannot increase mutual information. This is equivalent to using input probability distributions which are always zero for a particular index; therefore, the capacity can never be greater since capacity is the maximum over all input distributions. □
Property 2.
A
For any input probability, combining (by adding two columns to form one column hence reducing the channel matrix from n × m to n × m 1 as illustrated below with Q , Q ) two columns of a channel matrix will never increase mutual information.
B
For input probabilities with all terms non-zero, the mutual information will stay the same iff one of the combined columns is a multiple of the other. Otherwise, the uncombined channel has a larger mutual information and hence a larger capacity. (Note, that for a 2-input channel [12] has shown that the capacity achieving distribution has both probabilities in the interval [ 1 e , 1 1 e ] so we can apply this property to the capacity directly.)
Proof. 
A:
a b c d e f g h i · 1 0 1 0 0 1 = a + b c d + e f g + h i
The Data-Processing Inequality (Cascade of Channels) [3] shows that the capacity of the third channel above cannot be greater than that of the first channel. That is, processing one channel into another can never increase the information sent. The actual statement of the inequality is for mutual information. However, we use the probability that maximizes the mutual information of the first channel (which is its capacity), and therefore, it is less than or equal to the mutual information of the third channel which is less than or equal to the third channel’s capacity. This argument holds for any initial channel matrix (with adjustments to the second matrix), not just the 3 × 3 matrix, or the columns we chose, for simplicity above.
B: Without loss of generality (WLOG), combine the first two columns of n by m channel matrix (note how the indices are reversed as compared to (6))
Q = q 11 q 12 q 1 m q 21 q 22 q 2 m q n 1 q n 2 q n m ( uncombined )
to make
Q = q 11 + q 12 q 1 m q 21 + q 22 q 2 m q n 1 + q n 2 q n m ( combined ) .
For Q, the output symbols are y j , where j goes from 1 to m. For Q , they are the same, but with y 1 and y 2 replaced by y 1 y 2 . For both channels, the input symbols are x i , with input probability vector p defined as
p i : = p ( x i ) .
Therefore,
p ( y 1 ) = i = 1 n p i q i 1
and
p ( y 2 ) = i = 1 n p i q i 2 .
If either of these last two relations are 0, WLOG we assume p ( y 1 ) = 0 . This assumption means column 1 of Q must be a 0 column (since the input probabilities are positive), so it contributes 0 to the mutual information. Therefore, the mutual informations are equal, and one column is a constant multiple of the other. Now that we have dealt with this case, we can assume y 1 and y 2 are positive for the remainder of this proof. For fixed p, the mutual information of an n by m channel is
I = i = 1 n j = 1 m p ( x i ) p ( y j | x i ) log p ( y j | x i ) p ( y j ) .
Columns 3 through m of Q and 2 through m 1 of Q are the same, so their contributions to mutual information are the same. Therefore, we only need to consider columns 1 and 2 of Q and column 1 of Q . Let I ¯ be these columns’ mutual information, that is,
I ¯ ( Q ) = i = 1 n p i q i 1 log q i 1 p ( y 1 ) + i = 1 n p i q i 2 log q i 2 p ( y 2 ) , and
I ¯ ( Q ) = i = 1 n p i ( q i 1 + q i 2 ) log q i 1 + q i 2 p ( y 1 ) + p ( y 2 ) , sin ce p ( y 1 y 2 ) = p ( y 1 ) + p ( y 2 ) , etc .
Note that I ¯ ( Q ) can also be written as
I ¯ ( Q ) = i = 1 n p i q i 1 log q i 1 p ( y 1 ) + q i 2 log q i 2 p ( y 2 ) .
The log sum inequality [4] states that, for a series of non-negative numbers a k and b k with sums a and b, respectively, where k goes from 1 to K, then
i = 1 K a i log a k b k a log a b ,
with equality iff a k b k are equal for all i. By applying this inequality to the above terms in square braces, we have that I ¯ ( Q ) I ¯ ( Q ) , with equality iff q 1 i p ( y 1 ) = q 2 i p ( y 2 ) for all i. Since p ( y 1 ) and p ( y 2 ) are nonzero and independent of i, this is true iff column 1 of Q is a constant multiple p ( y 1 ) / p ( y 2 ) of column 2. In fact, this also shows that p ( y 1 ) is a constant multiple of p ( y 2 ) , regardless of the all of the positive input probabilities. □

1.2. Back to Our Binary-Input Binary-Output DMC, the (2,2) Channel

Restating (1) and following the approach of [13]:
x = P ( X = 0 ) , x ¯ = P ( X = 1 ) and we define y : = P ( Y = 0 ) , thus y ¯ = P ( Y = 1 ) .
The above expressions simplify for our DMC under investigation. Using (1) and (2), we have that the distribution of Y is
( y , y ¯ ) = ( x , x ¯ ) a a ¯ b b ¯ = ( a b ) x + b , 1 ( a b ) x + b
We now define a differentiable function f ( x ) , x [ 0 , 1 ] by
f ( x ) : = ( a b ) x + b = a x + b x ¯ ,
which gives us
( y , y ¯ ) = ( f ( x ) , f ( x ) ¯ ) .
Thus,
H ( Y ) = h ( y ) = h ( f ( x ) ) .
From (5), we have that
H ( Y | X ) = x a log a + a ¯ log a ¯ + x ¯ b log b + b ¯ log b ¯ = x · h ( a ) + x ¯ · h ( b ) .
Putting the above together gives us
I ( Y , X ) = h ( f ( x ) ) x · h ( a ) x ¯ · h ( b ) .
Using (9), we have that the capacity of the (2,2) channel is
C 2 , 2 = max x I ( Y , X ) = max x h ( f ( x ) ) x · h ( a ) x ¯ · h ( b ) .
So, for the (2,2) channel, the capacity calculation boils down to a (not so simple) calculus problem. Silverman [14] was the first to express the closed form result (see also [5,13] and ([Equation (5)] [7]) for derivations and alternate expressions).
C 2 , 2 ( a , b ) = log 2 a ¯ · h ( b ) b ¯ · h ( a ) a b + 2 b · h ( a ) a · h ( b ) a b , where C ( a , a ) : = 0 ,
which is a continuous function on the unit square [ 0 , 1 ] × [ 0 , 1 ] . It is trivial to show that capacity is continuous on the unit square without the main diagonal a = b . However, to prove continuity on the entire unit square requires some work and uses the fact that (15) is continuous in a, b, and x see ([Section 2.4] [15]).
One can easily show that (see Figure 3)
C 2 , 2 ( a , b ) = C 2 , 2 ( b , a ) = C 2 , 2 ( a ¯ , b ¯ ) ,
by simple algebraic substitution. Additionally, this tells us that C 2 , 2 ( a , b ) = C 2 , 2 ( b ¯ , a ¯ ) also.
C 2 , 2 ( a , b ) = C 2 , 2 ( b , a ) is equivalent to capacity being symmetric across the line b = a , and C 2 , 2 ( a , b ) = C 2 , 2 ( b ¯ , a ¯ ) is equivalent to capacity being symmetric when across the line b = a + 1 (simple geometry proves this). This result is illustrated in Figure 3. Thus, capacity has a quadrant of the unit square as its principal domain (see ([Figure 1] [14])).

1.3. Power/Fidelity Constraints of C 2 , 2

We consider the situation where we attempt to increase the capacity by adjusting the terms a and b. Ideas like this for a Team’s interdependence, with a different measurement and no mention of information theory, were discussed in [1]. However, the values of a , b are a function of the transmitting environment from A X to A Y . If the agents were all-powerful, that could simply adjust a to be 1, and b to be 0 (or visa versa) to achieve a channel of maximal capacity C 2 , 2 = 1 .

1.3.1. Positive Channels

Let us start by considering positive channels [6], that is a > b . Note if a < b , we have a negative channel, and if a = b , we have a 0-capacity channel. Of course, no matter what C ( a , b ) , 0 . However, if we are at a point (a,b), is it better to increase a, decrease b, or some combination thereof? Implicit in this question is that we stay in the domain of positive channels (under the line b = a ).
Definition 2.
We say that we have a power constraint P when we are at the channel given by (a,b) and the most we can adjust the channel is to ( a , b ) where the standard Euclidean distance (its l 2 norm) between ( a , b ) and ( a , b ) is no more that P.
In terms of Information Geometry [1], our distance is obtained from the Riemannian metric
d s 2 = d a 2 + d b 2 .
Of course we can generalize this to a more general metric of the form
d s 2 = E d a 2 + F d a · d b + G d b 2 ,
which would put us in a non-Euclidean situation. This non-trivial situation may be necessary if a and b relate differently to various transmission characteristics.
It is shown in ([Theorem 4.9] [6]) that if we restrict ourselves to positive channels, that the capacity increases as a increases, and decreases as b increases. This result makes physical sense in terms of adding or decreasing noise. Now consider the (closed) disk of radius r about the point ( a , b ) , denoted as D r ( a , b ) . We assume that r is small enough so that D r ( a , b ) is composed only of positive channels.
Example 1.
We illustrate this situation in Figure 4 by the channels that are in the disk or radius 0.15 about the point (0.6,0.2).
Theorem 1.
Given a closed disk D r ( a , b ) consisting of positive channels, the maximum capacity is achieved and occurs on the boundary circle D r ( a , b ) .
Proof. 
Since C 2 , 2 ( a , b ) is a continuous function on the compact set D r ( a , b ) , it is has a maximum denoted as C M . Assume that the maximum is achieved at an interior point ( a , b ) D r ( a , b ) . By ([Theorem 4.9] [6]) we know that increasing a increases capacity, which contradicts C M being achieved at the interior point ( a , b ) . □
We note that the above theorem still holds for non-positive channels by a simple adjustment of the proof.
Example 1 is illustrated in Figure 4 and is examined again in Figure 5 and Figure 6, where we can see the level sets of C 2 , 2 and the surface plot of capacity. Furthermore, numerical calculations show that the maximum of capacity for the closed disk is obtained at the boundary points ( 0.68 , 0.07 ) and has a value of 0.32.
Of course, as the center of the disk and the radius vary, so does the relative position of the point on the circle that capacity is achieved at. What is interesting is that it is not obvious where this point should be. We will explain this further. For a positive channel, increasing a brings increased capacity, whereas decreasing b results in increased capacity. So, considering our example of the disk centered at ( 0.6 , 0.2 ) with radius 0.15, one might think that this critical point is when b is decreased by the amount that a is increased—this being the point on the boundary circle at 2 π π 4 = 5.50 radians, which only gives us a capacity of 0.31. However, numerical methods tell us that the actual maximum occurs are 5.25 radians with a value, as noted, of 0.32. Of course, for this example, the difference is not much, but this result is relative to the size of the disk. What is important is that the actual critical point depends on the disk’s position to the two lines b = a and b = 1 a . We do note that when the disk is centered on the line b = 1 a , that 2 π π 4 radians is the correct position for the critical point. One can also see this by examining the capacity level sets in Figure 5.
Of course, we are using an l 2 metric which has a metric ball of a disk. If, for example, we used an l 1 metric, the ball would be a square rotated by 45 degrees.

1.3.2. Power

We assume that the transmitting agent A X has adjustable power P. This power allows the transmission capabilities of A X to vary. By way of example, say that A X transmits with fidelity a = 0.6 , b = 0.2 . Now, A X is given an increase in its transmitting power that allows it to change ( a , b ) to ( a , b ) such that the “distance” between the two points is less than P. Consider that we use the L 2 Euclidean norm and set P = 0.15 . This tells us that all such points ( a , b ) are in the disk of radius 0.15 about the center ( 0.6 , 0.2 ) . We note that this is a rudimentary concept of power. Power helps a transmission when we are restricted to the bottom quarter of this disk and where a is increasing (giving more transmission fidelity) and b is decreasing (more transmission fidelity—recall that b gives us the probability of a 1 going to the opposite symbol 0). However, the conclusion is still the same point that we made and illustrated above.

1.3.3. Results and Discussion

We end this section with a brief summary. We have discussed how one agent can pass Shannon information to another and how changing the transmission characteristics can increase or decrease this information transfer. We have used capacity as our metric for information transfer. Let us now progress to multiple agents. We have also proven some information theoretic properties for the reader (Properties 1 & 2).
In the situation that we discussed in this section where there are two transmitting agents and one receiving agent, we denote the channel as M 1 , which is given by the channel matrix (In this article, we freely identify a channel with its matrix. Furthermore, for a 2 × 2 channel, we identify the channel as the ordered 2-tuple ( a , b ) also.) M 1 described earlier (4). We denote that channel capacity as C ( M 1 ) which we have analyzed as C 2 , 2 in this section.

2. Two Transmitting Agents

Say we have two transmitting agents, A X 1 and A X 2 acting independently with respect to each other. Assume they have the same transmitting characteristics; that is, the channel matrices are the same. The receiving agent A Y gets symbols from both transmitting agents. How does this impact the information flow to A Y ?
In our scenario, A X 1 and A X 2 both sense the same environment. That is, they both wish to send a 0 or they both wish to send a 1. So, as before, the possible inputs are 0 or 1, but the outputs are of the form
( 0 , 0 ) , ( 0 , 1 ) , ( 1 , 0 ) , ( 1 , 1 )
since we are assuming that the noise affects each transmitting agent independently. Keep in mind that both A X 1 and A X 2 are both attempting to transmit the same symbol.
The output that A Y uses is given by the random variable Y.
( 0 , 0 ) is taken to be the symbol Y = O 0 , 0 ( 0 , 1 ) is taken to be the symbol Y = O 0 , 1 ( 1 , 0 ) is taken to be the symbol Y = O 1 , 0 ( 1 , 1 ) is taken to be the symbol Y = O 1 , 1 .
We denote P ( Y = O i , j ) = : y i , j . Our channel matrix is 2 × 4 and is
M 2 = P ( Y = O 0 , 0 | X = 0 ) P ( Y = O 0 , 1 | X = 0 ) P ( Y = O 1 , 0 | X = 0 ) P ( Y = O 1 , 1 | X = 0 ) P ( Y = O 0 , 0 | X = 1 ) P ( Y = O 0 , 1 | X = 1 ) P ( Y = O 1 , 0 | X = 1 ) P ( Y = O 1 , 1 | X = 1 ) = a 2 a a ¯ a ¯ a a ¯ 2 b 2 b b ¯ b ¯ b b ¯ 2 .
We note that the second and third columns of the above channel matrix are identical. This has implications for the mutual information and, of course, the capacity of the channel.
Let us look at this in more generality. Say we have two channel matrices
M 3 = α 2 ϵ δ β 2 γ ϕ and M 4 = α ϵ ϵ δ β γ γ ϕ .
Both channels have the same input random variable X as above. The output random variables are Y 3 and Y 4 , respectively.
Let us consider the M 3 channel first. Y 3 has probability values y i : = P ( Y 3 = i ) as follows
( y 1 , y 2 , y 3 ) = ( α x + β x ¯ , 2 ϵ x + 2 γ x ¯ , δ x + ϕ x ¯ ) . So , H ( Y 3 )
= ( α x + β x ¯ ) log ( α x + β x ¯ ) + ( 2 ϵ x + 2 γ x ¯ ) log ( 2 ϵ x + 2 γ x ¯ ) + ( δ x + ϕ x ¯ ) log ( δ x + ϕ x ¯ ) ,
H ( Y 3 | X ) = x α log ( α ) + 2 ϵ log ( 2 ϵ ) + δ log ( δ ) x ¯ β log ( β ) + 2 γ log ( 2 γ ) + ϕ log ( ϕ ) .
The mutual information is I ( Y , X ) = H ( Y ) H ( Y | X ) . We expand the mutual information into the sum of two functions. The first function is from the first and last columns, and the second function is from the middle column. That is
I ( Y 3 , X ) = F 1 3 ( α , β , δ , ϕ , x ) + F 2 3 ( ϵ , γ , x ) , where
F 2 3 = 2 ϵ x log ( 2 ϵ x + 2 γ x ¯ ) 2 γ x ¯ log ( 2 ϵ x + 2 γ x ¯ ) + 2 ϵ x log ( 2 ϵ ) + 2 γ x ¯ log ( 2 γ ) = 2 ϵ x log 2 ϵ 2 ϵ x + 2 γ x ¯ + 2 γ x ¯ log 2 γ 2 ϵ x + 2 γ x ¯ = 2 ϵ x log ϵ ϵ x + γ x ¯ + 2 γ x ¯ log γ ϵ x + γ x ¯ .
Now let us consider the M 4 channel. As above
( y 1 , y 2 , y 3 , y 4 ) = ( α x + β x ¯ , ϵ x + γ x ¯ , ϵ x + γ x ¯ , δ x + ϕ x ¯ ) .
H ( Y 4 ) = [ ( α x + β x ¯ ) log ( α x + β x ¯ ) + ( ϵ x + γ x ¯ ) log ( ϵ x + γ x ¯ )
+ ( ϵ x + γ x ¯ ) log ( ϵ x + γ x ¯ ) + ( δ x + ϕ x ¯ ) log ( δ x + ϕ x ¯ ) ] = [ ( α x + β x ¯ ) log ( α x + β x ¯ ) + 2 ( ϵ x + γ x ¯ ) log ( ϵ x + γ x ¯ )
+ ( δ x + ϕ x ¯ ) log ( δ x + ϕ x ¯ ) ] .
H ( Y 4 | X ) = x α log ( α ) + ϵ log ( ϵ ) + ϵ log ( ϵ ) + δ log ( δ )
x ¯ β log ( β ) + γ log ( γ ) + γ log ( γ ) + ϕ log ( ϕ ) = x α log ( α ) + 2 ϵ log ( ϵ ) + δ log ( δ )
x ¯ β log ( β ) + 2 γ log ( γ ) + ϕ log ( ϕ ) .
As above we expressthe mutual information as
I ( Y 3 , X ) = F 1 3 ( α , β , δ , ϕ , x ) + F 2 3 ( ϵ , γ , x )
and we have that
F 2 4 = 2 ϵ x log ( ϵ x + γ x ¯ ) 2 γ x ¯ log ( ϵ x + γ x ¯ ) + 2 ϵ x log ( ϵ ) + 2 γ x ¯ log ( γ ) = 2 ϵ x log ϵ ϵ x + γ x ¯ + 2 γ x ¯ log γ ϵ x + γ x ¯ = F 2 3 .
A quick inspection tells us that F 1 4 = F 1 3 ; thus, the mutual information of both channels is the same. This result is not surprising because if we combine output symbols where the channel matrix has identical rows, we lose nothing as far as the output information is concerned—there is no extra value in looking at the output symbols separately. This makes sense, and is also what our mathematics have shown.
Let us keep in mind that we wish to find C ( M 2 ) , the capacity of the Shannon channel when there are two transmitting agents. (To keep our notation consistent, C ( a , b ) is the capacity given by the corresponding 2 × 2 channel matrix as in (4), whereas C ( ) is the capacity of the channel given by *).
Theorem 2.
C ( M 2 ) C ( M 1 ) .
Proof. 
M 2 has four output symbols which are in essence 2-vectors. We ignore the second component of the vector. Therefore, we collapse the first and third symbol to a, and the second and fourth to a ¯ . This results in M 1 , and since using more output symbols never lowers capacity, by Property 2 (also, a code that works for M 1 works for M 2 as well by collapsing the symbols), we are done. (Later in the paper we do better than this result with Corollary 1 to Theorem 6.) □
We now form another channel related to what we discussed above. Say now that the receiving agent receives the symbols without any order. Therefore, instead of a 2-vector, the output is one of the three multisets [ 0 , 0 ] , [ 1 , 0 ] , [ 1 , 1 ] with
P ( Y = [ 0 , 0 ] ) = a 2 , P ( Y = [ 1 , 0 ] ) = 2 a ¯ a , P ( Y = [ 1 , 1 ] ) = a ¯ 2 .
We call this channel M 2 , and its channel matrix is
M 2 = a 2 2 a ¯ a a ¯ 2 b 2 2 b ¯ b b ¯ 2 .
From what we discussed above with M 4 and M 3 , we see that
Theorem 3.
C ( M 2 ) = C ( M 2 ) .
Let us examine the bounds in Theorem 1 above. We will see that, not surprisingly except for special cases, C ( M 2 ) > C ( M 1 ) . Figure 7 is a plot of C ( M 2 ) C ( M 1 ) as a function of ( a , b ) .
From Figure 7, we see that except for the line b = a (where both channels M 1 and M 2 have 0 capacity), and at ( a , b ) = ( 1 , 0 ) or ( a , b ) = ( 0 , 1 ) (where both channels have capacity 1), that C ( M 2 ) > C ( M 1 ) . We note that for M 2 and the other higher dimensional channels that we will discuss, there is to our knowledge no closed form as there is for M 1 . Therefore, for our calculations of capacity, we rely upon numerical results from the Blahut-Arimoto capacity algorithm [16,17].

Results and Discussion

In this section, we have laid the groundwork for n transmitting agents. We derived some capacity results. We concentrated on the effects of going from 1 to 2 transmitting agents. What happens as we go to three or more transmitting agents?

3. Multiple Transmitting Agents

We have the canonical representation for the channel of n transmitting agents, and we denote this canonical channel matrix as M n ̲ , which is formed by taking the output of channel M n 1 ̲ (Note, due to the simplicity of the construction for “small” channels, we have that M 1 ̲ = M 1 , M 2 ̲ = M 2 .) and adding a 0 or a 1 to it. For M 3 ̲ this results in
M 3 ̲ = a 3 a 2 a ¯ a 2 a ¯ a a ¯ 2 a 2 a ¯ a a ¯ 2 a a ¯ 2 a ¯ 3 b 3 b 2 b ¯ b 2 b ¯ b b ¯ 2 b 2 b ¯ b b ¯ 2 b b ¯ 2 b ¯ 3 .
This comes from taking the output for two agents as given in canonical form by (21) and extending it to
( 0 , 0 , 0 ) , ( 0 , 0 , 1 ) , ( 0 , 1 , 0 ) , ( 0 , 1 , 1 ) , ( 1 , 0 , 0 ) , ( 1 , 0 , 1 ) , ( 1 , 1 , 0 ) , ( 1 , 1 , 1 ) .
Theorem 4.
Rearranging outputs/columns of a channel matrix does not affect capacity.
Proof. 
By looking at the expression for mutual information, we see that changing the order of arithmetic operations leaves it unchanged. This result follows, since capacity is the maximum of mutual information. □
Therefore, we can permute the columns of M n ̲ and obtain a new matrix M n , which has the same capacity, that is C ( M n ) = C ( M n ̲ ) , and is given below.
M n = a n a n 1 a ¯ a n 1 a ¯ a n 2 a ¯ 2 a n 1 a ¯ a ¯ n b n b n 1 b ¯ b n 1 b ¯ b n 2 b ¯ 2 b n 1 b ¯ b ¯ n .
Look at the above theorem in terms of the columns of M n . Let us use M 3 as an example.
M 3 = a 3 a 2 a ¯ a 2 a ¯ a 2 a ¯ a a ¯ 2 a a ¯ 2 a a ¯ 2 a 3 b 3 b 2 b ¯ b 2 a ¯ b 2 b ¯ b b ¯ 2 b b ¯ 2 b b ¯ 2 b 3 .
Collapsing the output in this situation is equivalent to interchanging the 4th and 5th columns (which does not change capacity) and forming the matrix M 3 c .
M 3 c = a 3 a 2 a ¯ a 2 a ¯ a a ¯ 2 a 2 a ¯ a a ¯ 2 a a ¯ 2 a 3 b 3 b 2 b ¯ b 2 a ¯ b b ¯ 2 b 2 b ¯ b b ¯ 2 b b ¯ 2 b 3 .
As above when we looked at M 3 and M 4 , we see that we may form the channel where we identify output symbols with the same conditional probabilities for both inputs. This give us the channel M n , where
M n = a n n a n 1 a ¯ n 2 a n 2 a ¯ 2 n a a ¯ n 1 a ¯ n b n n b n 1 b ¯ n 2 b n 2 b ¯ 2 n b b ¯ n 1 b ¯ n .
Theorem 5.
C ( M n ) = C ( M n )
Proof. 
As above for M 2 in Theorem 3, or we can just use Property 2 repeatedly. □
The reason we introduce M n is that it is a cleaner way to express the channel, and the calculations are simpler than that of M n . For example, M 8 is a 2 × 256 matrix, whereas M 8 is a 2 × 9 matrix. This obviously makes the coding issues easier. Now we examine Figure 8, which is the difference between C ( M 8 ) and C ( M 1 ) .
When we compare Figure 8 to Figure 7, we easily see that C ( M n ) grows, except for the endpoints and the line b = a (which stay at 0) as n grows.
Nota Bene We now look at the prior illustrative results in terms of a more general encompassing theory. We included much of Section 2 so that the reader who is not familiar with some of the “tricks” will have a feel for why the more general results hold.
Theorem 6.
C ( M n + 1 ) C ( M n ) for any positive integer n.
Proof.
(The proof is the same as for the above when n = 1 .) M n can be obtained from M n + 1 by combining certain columns together; the result follows from Property 2. □
Corollary 1.
C ( M n + 1 ) > C ( M n ) , except for ( 1 , 0 ) and ( 0 , 1 ) where they both have capacity 1, and the line b = a where they both have capacity 0.
Proof. 
We show the proof in three steps.
  • If a = b , C ( M n ) = C ( M n + 1 ) = 0 since the rows are identical. In this case, it is trivial to show that H ( Y ) = H ( Y | X ) (the output has no idea what the channel input was). One can see this by the fact that x · a q a ¯ n q + x ¯ · a q a ¯ n q = a q a ¯ n q . In short, the capacities are equal.
  • If ( a , b ) = ( 1 , 0 ) or ( a , b ) = ( 0 , 1 ) , both M n and M n + 1 are both the 2 × 2 identity matrix with zero columns added in; hence, C ( M n ) = C ( M n 1 ) = 1 . In short, the channel capacities are equal.
  • Now, excluding the special cases where a = b , ( a , b ) = ( 1 , 0 ) , or ( a , b ) = ( 0 , 1 ) , by Property 2, we only have to show that here are two combined columns that are not multiples of each other.
By excluding the special cases, we cannot use the endpoints of the unit square; therefore, a or b must be in ( 0 , 1 ) . WLOG, we assume that 0 < a < 1 .
Consider a generic column of M n ; it is of the form c = a e a ¯ n e b e b ¯ n e , e { 0 , , n } . By construction, M n + 1 has two columns, c 1 = a · a e a ¯ n e b · b e b ¯ n e and c 2 = a ¯ · a e a ¯ n e b ¯ · b e b ¯ n e , that when combined result in column c. If c 1 is not a constant multiple of c 2 , we will have shown that C ( M n + a ) > C ( M n ) . Assume the opposite—that is, c 1 = k · c 2 ; since neither a or a ¯ is 0 we have that a = k a ¯ . Then a = k a ¯ is equivalent to a = k k + 1 , k 0 . We now have three cases for b.
  • b = 0 . In this case, b ¯ = 1 and we only look at the last column of M n , so we let c = a ¯ n b ¯ n = a ¯ n 1 . Since we are assuming that c 1 = k · c 2 , we have that
    0 = 0 · 1 = b · 1 = k · b ¯ · 1 = k , which is impossible.
  • b = 1 . Using the same argument as above, just replace the last column of M n with the first. So again, it is impossible that the columns are multiples.
  • 0 < b < 1 . As above for a, we also have that b = k k + 1 . This tells us that a = b which has been ruled out.
Thus, we have shown the existence of two columns of M n + 1 that are not multiples of each other and combine them into a column of M n . □
Theorem 7.
lim n C ( M n ) = 1 , except for when b = a , and in that case, the channel capacity is 0.
Proof. 
WLOG, we assume a > b . We can do this because of the constraint a b and the fact that the rows of a channel matrix can be interchanged without affecting its capacity. Take a positive ε < < a b 2 be fixed. For a large enough N, we can always find a rational number m ( n ) for any n > N such that a ¯ + ε < m < b ¯ ε < 1 and n m Z . (The ε padding prevents m from converging to a ¯ or b ¯ ). This result is guaranteed to exist for sufficiently large N.
Given 0 b < a 1 , let x = a ¯ + ε , y = b ¯ ε , giving us 0 x < y 1 . Certainly there exists a positive integer N such that 1 / N < y x . Therefore, for any integer n N , we have that 1 / n < y x . Consider ( x , y ) as a sub-interval of [ 0 , 1 ] . For any n N , consider the largest integer W such that W ( 1 / n ) x . Look at ( W + 1 ) ( 1 / n ) ; by the definition of W, this must be greater than x. However, since 1 / n < y x , we have that ( W + 1 ) ( 1 / n ) < y . We let m = ( W + 1 ) ( 1 / n ) . Keep in mind two characteristics of m as a function of n:
  • Since W is an integer, m n Z , and,
  • m n < n , since m < 1 .
Let M n be the channel matrix M n , but modified as follows: all outputs y k for k m n are combined into y 0 , and all of the other outputs are combined into y 1 . The channel matrix then looks like this:
M n = P ( y 0 | x 0 ) P ( y 1 | x 0 ) P ( y 0 | x 1 ) P ( y 1 | x 1 ) ,
where
( Y = y 0 ) = ( Y = y 0 ) ( Y = y 1 ) ( Y = y m n ) ( Y = y o ) ( Y = y n ) and
P ( y 0 | x 0 ) = i = 0 m n P ( y i | x 0 ) , with P ( y i | x 0 ) = n i a n i a ¯ i .
(Keep in mind that we are dealing with the binomial random variable S n , where i is the number of successes in n Bernoulli trials, with the probability of success a ¯ , P ( S n = i ) = n i a n i a ¯ i ).
P ( y 0 | x 0 ) = i = 0 m n n i a n i a ¯ i .
If we let Φ ( x ) be the cumulative standard normal distribution function, the De-Moivre Laplace limit theorem [18] states that (when we take c , d as integers)
P c < S n n a ¯ n a a ¯ < d Φ ( d ) Φ ( c ) as n ; thus , P c a ¯ n a a ¯ < S n n a ¯ n a a ¯ < d a ¯ n a a ¯ Φ d a ¯ n a a ¯ Φ c a ¯ n a a ¯ as n , and P c S n d Φ d a ¯ n a a ¯ Φ c a ¯ n a a ¯ as n .
This step leaves us with
i = c d n i a n i a ¯ i Φ d n a ¯ n a a ¯ Φ c n a ¯ n a a ¯ as n .
Thus, the De-Moivre Laplace limit theorem gives us (with c = 0 , d = m n ):
lim n P ( y 0 | x 0 ) = lim n Φ m n n a ¯ n a a ¯ Φ n a ¯ n a a ¯ = lim n Φ n m a ¯ a a ¯ lim n Φ n a ¯ a a ¯ .
Since a and a ¯ are positive, then a ¯ a a ¯ is negative, giving
lim n n a ¯ a a ¯ = , and
lim n Φ n a ¯ a a ¯ = 0 .
If m < a ¯ , then m a ¯ a a ¯ is negative. However, if m > a ¯ , it is positive, giving (Even though m changes as n changes, the value of n m a ¯ a a ¯ remains greater than or equal to n ε a a ¯ for m > a ¯ + ε . Since n ε a a ¯ approaches , so does n m a ¯ a a ¯ . The same logic can also be used for the m < a ¯ ε case.)
lim n n m a ¯ a a ¯ = if m < a ¯ ε if m > a ¯ + ε ; and
lim n Φ n m a ¯ a a ¯ = 0 if m < a ¯ ε 1 if m > a ¯ + ε
lim n P ( y 0 | x 0 ) = 1 0 = 1 .
Thus, we have that
lim n P ( y 1 | x 0 ) = 0 .
P ( y 0 | x 1 ) beahves the same, but with a replaced by b. Since a ¯ + ε < m < b ¯ ε , then the lim n P ( y 0 | x 0 ) = 1 and lim n P ( y 0 | x 1 ) = 0 ; thus,
lim n M n = 1 0 0 1 .
which has a channel capacity of 1. Since M n was formed by combining the outputs of M n , then C ( M n ) C ( M n ) 1 . Therefore, by the squeeze theorem, lim n C ( M n ) = 1 . □

Results and Discussion

The theorems presented in this section shows what happens as the number of transmitters grows. The ultimate result of this section was Theorem 7, which used a rather non-trivial application of the Central Limit Theorem. At this point, the seemingly obvious but difficult result that we proved, i.e., that as the number of transmitting agents grows, so does the reliability of the channel in terms of its capacity. This result, of course, is in line with the similar result that if we have a code that consisted of repeating a symbol many times the error rate is small (the transmission rate may be low, but this does not apply to our agent examples).

4. Non-Identical Transmitting Agents

In a shift, say we start with only two transmitting agents, but their noise characteristics are different. Of course, keep in mind that in this situation, we have assumed that there is a master transmitter using the X agent to communicate with Y. The master transmitter picks the input symbols and the transmitting agents do their best to communicate by forming one encompassing Shannon channel. We have shown above that, if all of the agents share the same assumption for ( a , b ) , the channel capacity increases as the number of agents increase. However, what happens if the ( a , b ) are different for the various agents? Are we better off only using a subset of agents, or is it still best to use as many agents as possible? We partially answer those questions below.
Let M 1 1 be the channel matrix for agent 1, and M 1 2 be the channel matrix for agent 2.
M 1 1 = a a ¯ b b ¯ ,
M 1 2 = c c ¯ d d ¯ .
The output is such that the receiving agent uses the ordering of agent 1 first, then agent 2. If the agents wish to send a signal of 0, the possible outputs, expressed via their probabilities, are
P ( 0 , 0 ) = a c P ( 0 , 1 ) = a c ¯ P ( 1 , 0 ) = a ¯ c P ( 1 , 1 ) = a ¯ c ¯
If the agents wish to send a signal of 1 instead, we have
P ( 0 , 0 ) = b d P ( 0 , 1 ) = b d ¯ P ( 1 , 0 ) = b ¯ d P ( 1 , 1 ) = b ¯ d ¯ .
This gives us a combined channel matrix for both agents who are transmitting as M 2 1 , 2 , where
M 2 1 , 2 = a c a c ¯ a ¯ c a ¯ c ¯ b d b d ¯ b ¯ d b ¯ d ¯ .
We use our own notation to express the above channel as the tensor product,
( a , b ) ( c , d ) .
We know, by Property 2, that collapsing output symbols does not increase capacity. However, if we collapse y 1 and y 2 into y 1 and y 3 and y 4 into y 2 , we have a channel matrix of M 2 1 , 2 :
M 2 1 , 2 = a c + a c ¯ a ¯ c + a ¯ c ¯ b d + b d ¯ b ¯ d + b ¯ d ¯ = a a ¯ b b ¯ .
Thus, C ( M 2 1 , 2 ) C ( M 2 1 , 2 ) = C ( M 1 1 ) .
Now let us combine the first and third outputs of M 2 1 , 2 into y 1 and the second and fourth outputs into y 2 . This gives us a channel matrix M 2 1 , 2 .
M 2 1 , 2 = a c + a ¯ c a c ¯ + a ¯ c ¯ b d + b ¯ d b d ¯ + b ¯ d ¯ = c c ¯ d d ¯ .
Thus, C ( M 2 1 , 2 ) C ( M 2 1 , 2 ) = C ( M 1 2 ) . This result leads us to the next theorem:
Theorem 8.
As the number of agents increase, no matter if they have different channel noises, the total channel capacity is non-decreasing.
Proof. 
In the above discussion we have show that
C ( M 2 1 , 2 ) C ( M 2 1 , 2 ) = C ( M 1 1 ) C ( M 2 1 , 2 ) C ( M 2 1 , 2 ) = C ( M 1 2 ) .
Therefore, by repeating the same argument we see that as we add extra agents the capacity can never decrease. □
In fact, as before when the agents had identical characteristics, the channel capacity, except for special cases (dependent columns, a capacity 0 or 1, etc.), is greater than that for separate agents. One can see this by examining the channel matrix—if you unpack the outputs and find that the statistics are different, extra information is learned. Let us now look at the special case of combining a channel with a 0-channel.
Theorem 9.
For any zero channel given by ( e , e ) , e [ 0 , 1 ] , we find that
C ( a , b ) ( e , e ) = C ( a , b ) .
Proof. 
If we can show that the mutual information of ( a , b ) ( e , e ) is given by (15), we are done. The channel matrix for this situation is
a e a e ¯ a ¯ e a ¯ e ¯ b e b e ¯ b ¯ e b ¯ e ¯ .
Let u : = a x + b x ¯ , and we find that u ¯ = a ¯ x + b ¯ x ¯ . Further,
Y = ( y 1 , y 2 , y 3 , y 4 ) = ( a e x + b e x ¯ , a e ¯ x + b e ¯ x ¯ , a ¯ e x + b ¯ e x ¯ , a ¯ e ¯ x + b ¯ e ¯ x ¯ ) = ( u e , u e ¯ , u ¯ e , u ¯ e ¯ ) ,
H ( Y ) = u e log ( u e ) + u e ¯ log ( u e ¯ ) + u ¯ e log ( u ¯ e ) + u ¯ e ¯ log ( u ¯ e ¯ ) = u e ( log ( u ) + log ( e ) ) + u e ¯ ( log ( u ) + log ( e ¯ ) ) + u ¯ e ( log ( u ¯ ) + log ( e ) ) + u ¯ e ¯ ( log ( u ¯ ) + log ( e ¯ ) ) = u log ( u ) + u ¯ log u ¯ + e log ( e ) + e ¯ log ( e ¯ ) = h ( u ) + h ( e ) , and
H ( Y | X ) = x a e log ( a e ) + a e ¯ log ( a e ¯ ) + a ¯ e log ( a ¯ e ) + a ¯ e ¯ log ( a ¯ e ¯ ) x ¯ b e log ( b e ) + b e ¯ log ( b e ¯ ) + b ¯ e log ( b ¯ e ) + b ¯ e ¯ log ( b ¯ e ¯ ) .
Now again using the log of a product as the sum of the logs, then grouping like log terms, this results in
H ( Y | X ) = x h ( a ) + h ( e ) + x ¯ h ( b ) + h ( e ) = x · h ( a ) + x ¯ · ( b ) + h ( e ) ,
and we see that H ( Y ) H ( Y | X ) = h ( a x + b x ¯ ) x · h ( a ) x ¯ · h ( b ) .

Results and Discussion

In this section, we showed what happens when two transmitting agents with different noise characteristics are used. Our important result was that as the number of agents increase, no matter if they have different channel noises, the total channel capacity is non-decreasing. As with many of our results it relied upon the algebra of mutual information giving common sense answers. However, without proofs we just have intuition to rely upon.

5. Resource Allocation

We now concern ourselves with the physical limitations of the receiving agent. We assume that the receiving agent has a limited resource R that it can use to receive messages. To the extent possible, the receiving resource, R , may be measured in terms of various antennas or various allocations of frequencies, etc. It is not our goal in this article to discuss the engineering of the receiving agent in general. Rather, we accept it as a given.
Upon completion of the mathematics in this section, the results do not seem surprising. That is good! It shows that our intuition is correct and it lays a foundation for dealing with many agents and non-linear allocation schemes (where we lose elements of intuition). Furthermore, aside from linearity, we based our allocation scheme on a Euclidean metric; it is not at all clear if an information geometric-style Riemannian metric be used instead. That is beyond the scope of the article.
Let us take the simplest case where there are two transmitting agent A X 1 and A X 2 . As before, A X i has channel matrix M i . We model noise affecting each channel in a linear manner. Suppose that an agent A X is given, as before, by its channel matrix
M 1 = a a ¯ b b ¯ .
How does noise, which results from the receiving agent not allocating enough of its resources to A X , change this channel matrix? The channel ( a , b ) is a point in [ 0 , 1 ] × [ 0 , 1 ] . Consider the shortest path from ( a , b ) to the main diagonal (which consists of zero-capacity channels). View [ 0 , 1 ] × [ 0 , 1 ] as sitting R 2 and consider the straight line y = x + ( a + b ) . This line is orthogonal to the straight diagonal line of zero-capacity channels, goes through the point ( a , b ) , and intersects the line for the zero-capacity channels at ( a + b 2 , a + b 2 ) . The line segment of interest is given parametrically for t [ 0 , 1 ] as
( 1 t ) a , b + t a + b 2 , a + b 2 .
We model noise as moving on this new line segment from the point ( a , b ) to the point ( a + b 2 , a + b 2 ) . No noise corresponds to t = 0 , total noise to t = 1 ; that is, we use t as a measure of the noise normalized in a linear manner between 0 and 1.
EXAMPLE: Let ( a , b ) = ( 0.8 , 0.4 ) . If t = 0 , the channel is given as ( 0.8 , 0.4 ) and the capacity is 0.12. If t = 1 , the channel is given as ( 0.6 , 0.6 ) and the capacity is 0. Let t = 0.9 , then the channel is given by 0.1 ( 0.8 , 0.4 ) + 0.9 ( 0.6 , 0.6 ) = ( 0.08 , 0.04 ) + ( 0.54 , 0.54 ) = ( 0.62 , 0.58 ) , which has a capacity of 0.001.
Now, let t = 0.1 , then the channel is given by 0.9 ( 0.8 , 0.4 ) + 0.1 ( 0.6 , 0.6 ) = ( 0.72 , 0.36 ) + ( 0.06 , 0.06 ) = ( 0.78 , 0.42 ) , which has a capacity of .10. Note that, unsurprisingly, the cleaner channel has C ( 0.8 , 0.4 ) = 0.1246 > C ( 0.78 , 0.42 ) .
What we have been discussing motivates the following our modeling definition.
Definition 3.
An agent A X with channel matrix ( a , b ) requires the receiving resource R for its channel matrix to be unchanged. If the receiving agent only allocates A , 0 A R to A x , the channel matrix is modified from ( a , b ) in the following manner,
( a A , b A ) = A R ( a , b ) + 1 A R a + b 2 , a + b 2 .
Thus, A = R corresponds to t = 0 above, and A = 0 corresponds to t = 1 above. As A decreases, the capacity “travels” the shortest path in the Euclidean metric to the line of the 0-capacity channels. This is the essence of our modeling assumption.
Note that a channel is a 0-capacity channel iff a = b . However, if we let b = a , then A , ( a A , a A ) = ( a , a ) .
Theorem 10.
For a non-zero channel ( a , b ) , that is, a b , C ( a A , b A ) decreases as A decreases from R to 0.
Proof. 
If ( a , b ) is a positive channel, that is, if a > b , we have that a A decreases and b A increases as A goes from R to 0. This result is easily shown with algebra, but even more simply by observation of the line segment. From ([Theorem 4.9] [6]), if ( a , b ) is a negative channel, then by symmetry of capacity about the line b = a , that completes the proof. □
Corollary 2.
If we have a 0-capacity channel ( a , b ) = ( e , e ) , then the C ( e A , e A ) is constant at 0 as A decreases.
Proof. 
Trivial, since the line segment reduces to the point ( e , e ) is this situation. □

5.1. Resource Allocation Amongst Different Transmitters

Assume that there are two transmitting agents A X 1 with matrix ( a , b ) , and A X 2 with matrix ( c , d ) . The difference from before is that the receiver can only allocate total resource R to the reception by the agents and, further, each agent requires resource R to prevent degradation to its channel matrix.
If A Y allocates A to A X 1 , we have the resulting channel matrix Equation (33) as given above. Then it allocates the remainder R A to A X 2 , resulting in this channel matrix
( c R A , d R A ) = 1 A R ( c , d ) + A R c + d 2 , c + d 2 .
Note that
( a R , b R ) = ( a , b ) , with C ( a R , b R ) = C ( a , b ) , and ( a 0 , b 0 ) = a + b 2 , a + b 2 , with C ( a 0 , b 0 ) = 0 .
As we have shown in the previous section, we arrive at:
M 2 1 , 2 A = a A · c R A a A · c R A ¯ a A ¯ · c R A a A ¯ · c R A ¯ b A · d R A b A · d R A ¯ b A ¯ · d R A b A ¯ · d R A ¯ .
Consider the situation when all of the resource is allocated to one channel; then, without the loss of generality, we let A = R , giving
M 2 1 , 2 A = R = a c + d 2 a 1 c + d 2 a ¯ c + d 2 a ¯ 1 c + d 2 b c + d 2 b 1 c + d 2 b ¯ c + d 2 b ¯ 1 c + d 2 .
Keep in mind that the above result is the channel matrix when we combine a 0-capacity channel with ( a , b ) . Intuitively, this should not change the capacity from that of C ( a , b ) . Looking at the channel matrix and thinking in terms of coding, we see that we are affecting the first and second outputs; as much as the third and fourth. Below, we present the mathematical details.
Theorem 11.
C M 2 1 , 2 A = R = C ( a , b ) .
Proof. 
Let us calculate C M 2 1 , 2 A . We let c + d 2 : = γ and q : = ( a x + b x ¯ ) . Thus,
( y 1 , y 2 , y 3 , y 4 ) = γ ( a x + b x ¯ ) , γ ¯ ( a x + b x ¯ ) , γ ( a ¯ x + b ¯ x ¯ ) , γ ¯ ( a ¯ x + b ¯ x ¯ ) . Then if
( y 1 , y 2 , y 3 , y 4 ) = γ q , γ ¯ q , γ q ¯ , γ ¯ q ¯ , we find that
H ( Y ) = h ( γ ) + h ( q ) .
Next we examine the conditional entropy:
H ( Y | X ) = x a γ log ( a γ ) + a γ ¯ log ( a γ ¯ ) + a ¯ γ log ( a ¯ γ ) + a ¯ γ ¯ log ( a ¯ γ ¯ ) .
Again use the rule that the log of a product is the sum of the logs to arrive at:
H ( Y | X ) = H ( Y ) H ( Y | X ) = h ( a x + b x ¯ ) x h ( a ) x ¯ h ( b ) .
This result is the same as the mutual information of ( a , b ) . Thus, the maximum of the mutual information for both cases remains the same. □
Corollary 3.
C M 2 1 , 2 A = 0 = C ( c , d ) .
Proof. 
If we swap the two transmitting agents we establish the proof (details are left to the reader). □
Note that any 0-capacity channel is some ( a , b ) channel witha 0 resource allocation. Thus,
Corollary 4.
Combining ( a , b ) with a 0-capacity channel results in a channel with the same capacity as ( a , b ) .
We arrive at the question at hand—what happens with a partial allocation to each channel? That is, in general, how does C M 2 1 , 2 A compare to C ( a , b ) and C ( c , d ) ? Our answer follows.

Allocate Resources to ( a , b ) and a 0-Capacity Channel

In this situation, we know that C M 2 1 , 2 A = R = C ( a , b ) and that C M 2 1 , 2 A = 0 = C ( c , d ) . What happens for 0 < A < R ? Not surprisingly, we get the following theorem:
Theorem 12.
Through allocation if we combine ( a , b ) , the first channel, with ( e , e ) , the second channel, we find that C M 2 1 , 2 A = C ( a A , b A ) .
Proof. 
Trivial from Theorem 9. □

5.2. More Examples

We will find the capacity of C M 2 1 , 2 A by using (35) for various A and agent matrices.
E X A M P L E Given a 90 / 10 allocation The first agent M 1 1 = ( 0.8 , 0.4 ) , the sec ond agent M 1 2 = ( 0.7 , 0.3 ) , A = 0.9 C ( M 1 1 ) = 0.1246 , C ( M 1 2 ) = 0.1187 ( a A , b A ) = ( 0.78 , 0.42 ) ( c R A , c R A ) = ( 0.52 , 0.48 ) C M 2 1 , 2 A = 0.1012 C M 2 1 , 2 A < C ( M 1 1 ) C M 2 1 , 2 A < C ( M 1 2 )
E X A M P L E Given a 10 / 90 allocation , with the same agents as above The first agent M 1 1 = ( 0.8 , 0.4 ) , the sec ond agent M 1 2 = ( 0.7 , 0.3 ) , A = 0.1 C ( M 1 1 ) = 0.1246 , C ( M 1 2 ) = 0.1187 ( a A , b A ) = ( 0.62 , 0.58 ) ( c R A , c R A ) = ( 0.68 , 0.32 ) C M 2 1 , 2 A = 0.0967 C M 2 1 , 2 A < C ( M 1 1 ) C M 2 1 , 2 A < C ( M 1 2 )
E X A M P L E Given a 90 / 10 allocation , sec ond agent has little noise The first agent M 1 1 = ( 0.7 , 0.3 ) , the sec ond agent M 1 2 = ( 0.99 , 0.01 ) , A = 0.9 C ( M 1 1 ) = 0.1287 , C ( M 1 2 ) = 0.9192 ( a A , b A ) = ( 0.6 , 0.4 ) ( c R A , c R A ) = ( 0.745 , 0.255 ) C M 2 1 , 2 A = 0.2030 C M 2 1 , 2 A > C ( M 1 1 ) C M 2 1 , 2 A < C ( M 1 2 )
From these results, we see that both
C M 2 1 , 2 A < min C ( M 1 1 ) , C ( M 1 2 ) , and min C ( M 1 1 ) , C ( M 1 2 ) < C M 2 1 , 2 A < max C ( M 1 1 ) , C ( M 1 2 )
are possible. In fact, equalities are also possible by using the special cases examined at the beginning of this section. However, max C ( M 1 1 ) , C ( M 1 2 ) < C M 2 1 , 2 A is not possible. (We show this by a re-wording and then proving that M 2 1 , 2 A cannot be larger than both C ( M 1 1 ) and C ( M 1 2 ) .) Thus, we need a lemma.
Lemma 1.
For channels ( a , b ) and ( c , d ) , we find that
C ( ( a , b ) ( c , d ) ) C ( a , b ) + C ( c , d ) ,
with equality if a = b or c = d .
Proof. 
The product channel ( a , b ) × ( c , d ) is given by channel matrix
a c a c ¯ a ¯ c a ¯ c ¯ a d a d ¯ a ¯ d a ¯ d ¯ b c b c ¯ b ¯ c b ¯ c ¯ b d b d ¯ b ¯ d b ¯ d ¯ .
The capacity of this product channel equals the sum of the capacities of its component channels ( a , b ) and ( c , d ) (p. 85 [5]). Removing the middle two rows gives us ( a , b ) ( c , d ) , and, since removing a row never increases capacity, we find that
C ( ( a , b ) ( c , d ) ) C ( ( a , b ) × ( c , d ) ) = C ( a , b ) + C ( c , d ) .
Theorem 13.
If we combine through an allocation ( a , b ) , the first channel, with ( c , d ) , the second channel, then C ( M 2 1 , 2 | A ) cannot be greater than both of the individual channel’s component capacities.
Proof. 
Let
M 1 1 | A = a A a A ¯ b A b A ¯ ,
M 1 2 | R A = c R A c R A ¯ d R A d R A ¯ ,
so that M 2 1 , 2 | A = M 1 1 | A M 1 2 | R A . For any input probability distribution held constant, the mutual information is convex with respect to the elements of the channel matrix ([Theorem 2.7.4] [4]). That is, for any given input probability distribution x, for all a 1 , a 2 , b 1 , b 2 , t [ 0 , 1 ] ,
I ( t a 1 + t ¯ a 2 , t b 1 + t ¯ b 2 , x ) t · I ( a 1 , b 1 , x ) + t ¯ · I ( a 2 , b 2 , x ) ,
where I ( α , β , x ) is the mutual information of channel ( α , β ) with input distribution x; thus,
C ( α , β ) = max x I ( α , β , x ) , and
x , C ( α , β ) I ( α , β , x ) .
If we let a 1 = a , b 1 = b , a 2 = b 2 = a + b 2 , t = A R , we have from convexity that
I ( a A , b A , x ) = I A R a + ( 1 A R ) ( a + b 2 ) , A R b + ( 1 A R ) ( a + b 2 ) , x A R I ( a , b , x ) + ( 1 A R ) I a + b 2 , a + b 2 , x ( this last term is 0 )
for any input probability distribution x, because I ( e , e , x ) always equals 0. Now, we let χ be a capacity achieving input probability (unique except for 0-channels) distribution for ( a A , b A ) , giving
C ( a A , b A ) = I ( a A , b A , χ ) A R I ( a , b , χ ) A R C ( a , b ) .
Therefore,
C ( M 1 1 | A ) A R C ( M 1 1 ) ,
and by replacing A R with 1 A R and repeating the above convexity argument, we find that
C ( M 1 2 | R A ) R A R C ( M 1 2 ) .
By Lemma 1,
C ( M 2 1 , 2 | A ) = C ( M 1 1 | A M 1 2 | R A ) C ( M 1 1 | A ) + C ( M 1 2 | R A ) . Thus ,
C ( M 2 1 , 2 | A ) A R C ( M 1 1 ) + R A R C ( M 1 2 ) A R + R A R max ( C ( M 1 1 ) , C ( M 1 2 ) ) .
Resulting in , C ( M 2 1 , 2 | A ) max ( C ( M 1 1 ) , C ( M 1 2 ) ) .
Thus, we have shown that C M 2 1 , 2 A max C ( M 1 1 ) , C ( M 1 2 ) and, by using Theorem 11 and Corollary 3, equality can be obtained by letting A = R or 0, the choice depending on the underlying original channels.

Results and Discussion

In this section, we showed what happens when we have limited transmission power and want to distribute it among two transmitting agents. The theorems of this section capture the physical properties of the power allocation and happily agree with intuition.

6. Conclusions

We considered the use of Shannon information theory, and its various entropic terms to aid in reaching optimal decisions that should be made in a multi-agent/Team scenario. Our metric for agents passing information are classical Shannon channel capacity. Our results are the mathematical theorems in this article showing how combining agents influences the channel capacity.
We have put the idea forward of multi-agent communication on a firm information theoretic foundation. We examined simple scenarios in this paper to lay that strong foundation. We obtained results that may seem obvious, but are quite difficult to prove. We ask the reader to keep in mind that there is a big difference between “it is obvious” and “it has been shown”.
From our perspective we have shown that, except for certain boundary cases, one can achieve near perfect transmission of Shannon information, provided one has a large enough number of agents.
We have used most information versus resource (power) allocation as an optimizing criterion. With regard to resource allocation, our results tell us that the best thing to do is to just use the strongest channel. This result is not surprising. However, without the mathematics to prove it, we would be relying on intuition. Furthermore, note that we only used a simple linear allocation scheme in this section, and we only combined two agents. Future work will consider non-linear allocation schemes and multiple agents to continue what we have started in this paper. Going forward, this path is especially meaningful if we adjust the Riemannian metric to influence the power allocated to each channel. For example, a geometric region with high noise levels can be reflected in the Riemannian metric by acknowledging that the E , F , G terms of the metric are functions of a and b. We will explore this direction in future work.
In addition, in future work, we will also consider more than two agents competing for the available resources, non-Euclidean Riemannian metrics, and more complicated signaling alphabets and schemes. We are also interested in information flow in the Vicsek [19] bird flocking model.

7. Notation

We include some of the notation that is used repeatedly throughout the article. The other notation is variants of what we give here with changes to the indices and is made clear in its first usage.
MASMulti-agent System
A x Agent X
MA channel matrix, that is every row contains non-negative numbers that sum to 1
M n 2 × 2n channel matrix, representing n (transmitting) Agents
H ( V ) Entropy of the (discrete) random variable V
H ( V | W ) Conditional Entropy of the random variable V conditioned on W
I ( V , W ) Mutual information between the random variables V and W
CCapacity of a generic channel
C 2 , 2 Specifically the capacity of a 1 (transmitting) agent channel
M 1 1 A specific 1-agent channel a a ¯ b b ¯ . Note: C(a,b):=C( M 1 1 )
M 1 2 Another 1-agent channel c c ¯ c d ¯
M 2 1 , 2 The combined channel ( a , b ) ⊗ ( c , d ) with channel matrix a c a c ¯ a ¯ c a ¯ c ¯ b d b d ¯ b ¯ d b ¯ d ¯
M 2 1 , 2 A Combined power allocated channel with channel matrix
= a A · c R A a A · c R A ¯ a A ¯ · c R A a A ¯ · c R A ¯ b A · d R A b A · d R A ¯ b A ¯ · d R A b A ¯ · d R A ¯
M 2 = a 2 2 a ¯ a a ¯ 2 b 2 2 b ¯ b b ¯ 2 , formed from the ( a , b ) channel

Author Contributions

Conceptualization, I.S.M.; Methodology, I.S.M. and S.R.; Software, I.S.M. and P.R.; Investigation, I.S.M., P.R. and S.R.; Writing, I.S.M., P.R. and S.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

We thank Hans Haucke for his assistance. We are especially grateful to Ruth Irene for her helpful comments on the draft versions of this paper. A special thanks to the reviewers who encouraged us to expand the background literature citations and pointed out what was lacking in some of our explanations and discussions. We also thank them for catching typos and points that needed clarification. We thank Katarina Doctor for her discussions on domain focused interpretable machine learning. A very special thanks to the special issue editor William Lawless for his assistance.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Moskowitz, I.S. A Cost Metric for Team Efficiency. Front. Phys. Interdiscip. Phys. 2022, 212, 861633. [Google Scholar] [CrossRef]
  2. Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef] [Green Version]
  3. Gallager, R.G. Information Theory and Reliable Communication; Wiley: New York, NY, USA, 1968. [Google Scholar]
  4. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley: New York, NY, USA, 2006. [Google Scholar]
  5. Ash, R.B. Information Theory; Dover Publications: New York, NY, USA, 1965. [Google Scholar]
  6. Martin, K.; Moskowitz, I.S.; Allwein, G. Algebraic Information Theory For Binary Channels. Electron. Notes Theor. Comput. Sci. 2006, 158, 289–306. [Google Scholar] [CrossRef] [Green Version]
  7. Moskowitz, I.S.; Cotae, P.; Safier, P.N. Algebraic Information Theory and Stochastic Resonance for Binary-Input Binary-Output Channels. In Proceedings of the 46th Annual Conference on Information Science and Systems (CISS), Princeton, NJ, USA, 21–23 March 2012. [Google Scholar]
  8. Neumann, J.V. Theory of Self-Reproducing Automata; Burks, A.W., Ed.; University of Illinois Press: Urbana, IL, USA, 1966. [Google Scholar]
  9. Sliwa, J. Toward Collective Animal Neuroscience. Science 2021, 374, 397–398. [Google Scholar] [CrossRef] [PubMed]
  10. Lawless, W.F. Risk Determination versus Risk Perception: A New Model of reality for Human–Machine Autonomy. Informatics 2022, 9, 30. [Google Scholar] [CrossRef]
  11. Schölkopf, B.; Locatello, F.; Bauer, S.; Ke, N.R.; Kalchbrenner, N.; Goyal, A.; Bengio, Y. Toward Causal Representation Learning. Proc. IEEE 2021, 109, 612–634. [Google Scholar] [CrossRef]
  12. Majani, E.E.; Rumsey, H. Two Results on Binary-Input Discrete Memoryless Channels. In Proceedings of the 1991 IEEE International Symposium on Information Theory, Budapest, Hungary, 24–28 June 1991. [Google Scholar]
  13. Martin, K.; Moskowitz, I.S. Noisy Timing Channels with Binary Outputs. In International Workshop on Information Hiding 2006; LNCS 4437; Springer: Berlin/Heidelberg, Germany, 2007; pp. 124–144. [Google Scholar]
  14. Silverman, R.A. On Binary Channels and their Cascades. Ire Trans. Inf. Theory 1955, 1, 19–27. [Google Scholar] [CrossRef] [Green Version]
  15. Moskowitz, I.S.; Newman, R.E.; Crepeau, D.P.; Miller, A. A Detailed Mathematical Analysis of a Class of Covert Channels Arising in Certain Anonymizing Networks; Naval Research Laboratory Memorandum Report, NR/MR/5540–03-8691; Naval Research Laboratory: Washington, DC, USA, 2003. [Google Scholar]
  16. Arimoto, S. An Algorithm for Computing the Capacity of Arbitrary Discrete Memoryless Channels. IEEE Trans. Inf. Theory 1972, 18, 14–20. [Google Scholar] [CrossRef] [Green Version]
  17. Blahut, R. Computation of Channel Capacity and Rate-Distortion Functions. IEEE Trans. Inf. Theory 1972, 18, 460–473. [Google Scholar] [CrossRef] [Green Version]
  18. Ross, S. A First Course in Probability; Macmillan: New York, NY, USA, 1976. [Google Scholar]
  19. Vicsek, T.; Czirok, A.; Ben-Jacob, E.; Cohen, I.; Shochet, O. Novel type of Phase Transition in a System of Self-Driven Particles. Phys. Rev. Lett. 1995, 75, 1226–1229. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Heuristic figure of A X transmitting a bit to A Y .
Figure 1. Heuristic figure of A X transmitting a bit to A Y .
Entropy 24 01719 g001
Figure 2. The noisy channel diagram corresponding to the first figure.
Figure 2. The noisy channel diagram corresponding to the first figure.
Entropy 24 01719 g002
Figure 3. Plot of C 2 , 2 ( a , b ) along with its level set contours. This figure shows the symmetries (18) about the lines y = x and y = x + 1 as seen by how the countours can be folded onto each other across the two lines. C is the capacity.
Figure 3. Plot of C 2 , 2 ( a , b ) along with its level set contours. This figure shows the symmetries (18) about the lines y = x and y = x + 1 as seen by how the countours can be folded onto each other across the two lines. C is the capacity.
Entropy 24 01719 g003
Figure 4. Closed disk D of radius 0.15, about the point (0.6,0.2), that consists only of positive channels. The boundary of the disk is the circle D .
Figure 4. Closed disk D of radius 0.15, about the point (0.6,0.2), that consists only of positive channels. The boundary of the disk is the circle D .
Entropy 24 01719 g004
Figure 5. Example 1 illustrated with level sets of capacity with more detail than Figure 4.
Figure 5. Example 1 illustrated with level sets of capacity with more detail than Figure 4.
Entropy 24 01719 g005
Figure 6. Same as Figure 5, but with a 3D perspective.
Figure 6. Same as Figure 5, but with a 3D perspective.
Entropy 24 01719 g006
Figure 7. The plot C ( M 2 ) C ( M 1 ) , of course the C axis is now measuring the difference in the capacities (in units of bits per t).
Figure 7. The plot C ( M 2 ) C ( M 1 ) , of course the C axis is now measuring the difference in the capacities (in units of bits per t).
Entropy 24 01719 g007
Figure 8. C ( M 8 ) C ( M 1 ) .
Figure 8. C ( M 8 ) C ( M 1 ) .
Entropy 24 01719 g008
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Moskowitz, I.S.; Rogers, P.; Russell, S. Mutual Information and Multi-Agent Systems. Entropy 2022, 24, 1719. https://doi.org/10.3390/e24121719

AMA Style

Moskowitz IS, Rogers P, Russell S. Mutual Information and Multi-Agent Systems. Entropy. 2022; 24(12):1719. https://doi.org/10.3390/e24121719

Chicago/Turabian Style

Moskowitz, Ira S., Pi Rogers, and Stephen Russell. 2022. "Mutual Information and Multi-Agent Systems" Entropy 24, no. 12: 1719. https://doi.org/10.3390/e24121719

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop