1. Introduction
The goal of this paper is to quantify the notion of global correlations as it pertains to inductive inference. This is achieved by designing a set of functionals from first principles to rank entire probability distributions according to their correlations. Because correlations are relationships defined between different subspaces of propositions (variables), the ranking of any distribution , and hence the type of correlation functional one arrives at, depends on the particular choice of “split” or partitioning of the variable space. Each choice of “split” produces a unique functional for quantifying global correlations, which we call the n-partite information (NPI).
The term correlation may be defined colloquially as being a relation between two or more “things”. While we have a sense of what correlations are, how do we quantify this notion more precisely? If correlations have to do with “things” in the real world, are correlations themselves “real?” Can correlations be “physical?” One is forced to address similar questions in the context of designing the relative entropy as a tool for updating probability distributions in the presence of new information (e.g., “What is information?”) [
1]. In the context of inference, correlations are broadly defined as being statistical relationships between propositions. In this paper we adopt the view that whatever correlations may be, their effect is to influence our beliefs about the natural world. Thus, they are interpreted as the information which constitutes statistical dependency. With this identification, the natural setting for the discussion becomes inductive inference.
When one has incomplete information, the tools one must use for reasoning objectively are probabilities [
1,
2]. The relationships between different propositions
x and
y are quantified by a joint probability density,
, where the conditional distribution
quantifies what one should believe about
y given information about
x, and vice-versa for
. Intuitively, correlations should have something to do with these conditional dependencies.
In this paper, we seek to quantify a
global amount of correlation for an entire probability distribution. That is, we desire a scalar functional
for the purpose of ranking distributions
according to their correlations. Such functionals are not unique since many examples, e.g., covariance, correlation coefficient [
3], distance correlation [
4], mutual information [
5], total correlation [
6], maximal-information coefficient [
7], etc.,
measure correlations in different ways. What we desire is a principled approach to designing a family of measures
according to specific design criteria [
8,
9,
10].
The idea of designing a functional for
ranking probability distributions was first discussed in Skilling [
9]. In his paper, Skilling designs the relative entropy as a tool for ranking posterior distributions,
, with respect to a prior,
, in the presence of new information that comes in the form of constraints (
15) (see
Section 2.1.3 for details). The ability of the relative entropy to provide a
ranking of posterior distributions allows one to choose the posterior that is
closest to the prior while still incorporating the new information that is provided by the constraints. Thus, one can choose to update the prior in the most minimalist way possible. This feature is part of the overall
objectivity that is incorporated into the design of relative entropy and in later versions is stated as the guiding principle [
11,
12,
13].
Like relative entropy, we desire a method for ranking joint distributions with respect to their correlations. Whatever the value of our desired quantifier gives for a particular distribution , we expect that if we change through some generic transformation , , that our quantifier also changes , and that this change of reflects the change in the correlations, i.e., if changes in a way that increases the correlations, then should also increase. Thus, our quantifier should be an increasing functional of the correlations, i.e., it should provide a ranking of ’s.
The type of correlation functional one arrives at depends on a choice of the splits within the proposition space , and thus the functional we seek is . For example, if one has a proposition space , consisting of N variables, then one must specify which correlations the functional should quantify. Do we wish to quantify how the variable is correlated with the other variables? Or do we want to study the correlations between all of the variables? In our design derivation, each of these questions represent the extremal cases of the family of quantifiers , the former being a bi-partite correlation (or mutual information) functional and the latter being a total correlation functional.
In the main design derivation we will focus on the the case of
total correlation which is designed to quantify the correlations between every variable subspace
in a set of variables
. We suggest a set of design criteria (DC) for the purpose of designing such a tool. These DC are guided by the Principle of Constant Correlations (PCC), which states that “the amount of correlations in
should not change unless required by the transformation,
.” This implies our design derivation requires us to study equivalence classes of
within statistical manifolds
under the various transformations of distributions
that are typically performed in inference tasks. We will find, according to our design criteria, that the global quantifier of correlations we desire in this special case is equivalent to the total correlation [
6].
Once one arrives at the TC as the solution to the design problem in this article, one can then derive special cases such as the
mutual information [
5] or, as we will call them, any
n-partite information (NPI), which measures the correlations shared between generic
n-partitions of the proposition space. The NPI and the mutual information (or bi-partite information) can be derived using the same principles as the TC except with one modification, as we will discuss in
Section 5.
The special case of NPI when
is the
bipartite (or mutual) information, which quantifies the
amount of correlations present between two subsets of some proposition space
. Mutual information (MI) as a measure of correlation has a long history, beginning with Shannon’s seminal work on communication theory [
14] in which he first defines it. While Shannon provided arguments for the functional form of his entropy [
14], he did not provide a derivation of (MI). Despite this, there has still been no principled approach to the design of MI or for the total correlation TC. Recently however, there has been an interest in characterizing entropy through a category theoretic approach (see the works of Baez et al. [
15]). The approach by Baez et al. shows that a particular class of functors from the category
FinStat, which is a finite set equipped with a probability distribution, are scalar multiples of the entropy [
15]. The papers by Baudot et al. [
16,
17,
18] also take a category theoretical approach however their results are more focused on the topological properties of information theoretic quantities. Both Baez et al. and Baudot et al. discuss various information theoretic measures such as the relative entropy, mutual information, total correlation, and others.
The idea of designing a tool for the purpose of inference and information theory is not new. Beginning in [
2], Cox showed that probabilities are the functions that are designed to quantify “reasonable expectation” [
19], of which Jaynes [
20] and Caticha [
10] have since improved upon as “degrees of rational belief”. Inspired by the method of maximum entropy [
20,
21,
22], there have been many improvements on the derivation of entropy as a tool designed for the purpose of updating probability distributions in the decades since Shannon [
14]. Most notably they are by Shore and Johnson [
8], Skilling [
9], Caticha [
11], and Vanslette [
12,
13]. The entropy functionals in [
11,
12,
13] are designed to follow the Principle of Minimal Updating (PMU), which states, for the purpose of enforcing objectivity, that “a probability distribution should only be updated to the extent required by the new information.” In these articles, information is defined operationally
as that which induces the updating of the probability distributions,
.
An important consequence of deriving the various NPI as tools for ranking is their immediate application to the notion of
statistical sufficiency. Sufficiency is a concept that dates back to Fisher, and some would argue Laplace [
23], both of whom were interested in finding statistics that
contained all relevant information about a sample. Such statistics are called
sufficient, however this notion is only a binary label, so it does not quantify an
amount of sufficiency. Using the result of our design derivation, we can propose a new definition of sufficiency in terms of a normalized NPI. Such a quantity gives a sense of how close a set of functions are to being sufficient statistics. This topic will be discussed in
Section 6.
In
Section 2 we will lay out some mathematical preliminaries and discuss the general transformations in statistical manifolds we are interested in. Then in
Section 3, we will state and discuss the design criteria used to derive the functional form of TC and the NPI in general. In
Section 4 we will complete the proof of the results from
Section 3. In
Section 5 we discuss the
n-partite (NPI) special cases of TC of which the bipartite case is the mutual information, which is discussed in
Section 5.2. In
Section 6 we will discuss sufficiency and its relation to the Neyman-Pearson lemma [
24]. It should be noted that throughout this article we will be using a probabilistic framework in which
denotes propositions of a probability distribution rather than a statistical framework in which
x denotes random numbers.
2. Mathematical Preliminaries
The arena of any inference task consists of two ingredients, the first of which is the subject matter, or what is often called
the universe of discourse. This refers to the actual propositions that one is interested in making inferences about. Propositions tend to come in two classes, either
discrete or
continuous. Discrete proposition spaces will be denoted by calligraphic uppercase Latin letters,
, and the individual propositions will be lowercase Latin letters
indexed by some variable
, where
is the number of distinct propositions in
. In this paper we will mostly work in the context of
continuous propositions whose spaces will be denoted by bold faced uppercase Latin letters,
, and whose elements will simply be lowercase Latin letters with no indices,
. Continuous proposition spaces have a much richer structure than discrete spaces (due to the existence of various differentiable structures, the ability to integrate, etc.) and help to generalize concepts such as
relative entropy and
information geometry [
10,
25,
26] (Common examples of discrete proposition spaces are the results of a coin flip or a toss of a die, while an example of a continuous proposition space is the position of a particle [
27].).
The second ingredient that one needs to define for general inference tasks is the space of models, or the space of probability distributions which one wishes to assign to the underlying proposition space. These spaces can often be given the structure of a manifold, which in the literature is called a
statistical manifold [
10]. A statistical manifold
, is a manifold in which each point
is an entire probability distribution, i.e.,
is a space of maps from subsets of
to the interval
,
. The notation
denotes the
power set of
, which is the set of all subsets of
, and has cardinality equal to
.
In the simplest cases, when the underlying propositions are discrete, the manifold is finite dimensional. A common example that is used in the literature is the three-sided die, whose distribution is determined by three probability values
. Due to positivity,
, and the normalization constraint,
, the point
lives in the 2-simplex. Likewise, a generic discrete statistical manifold with
n possible states is an
-simplex. In the continuum limit, which is often the case explored in physics, the statistical manifold becomes infinite dimensional and is defined as (Throughout the rest of the paper, we use the Greek
to represent a generic distribution in
, and we use the Latin
to refer to an individual density.),
When the statistical manifold is parameterized by the densities
, the zeroes always lie on the boundary of the simplex. In this representation the statistical manifolds have a trivial topology; they are all simply connected. Without loss of generality, we assume that the statistical manifolds we are interested in can be represented as (
1), so that
is simply connected and does not contain any holes. The space
in this representation is also smooth.
The symbol
defines what we call a
state of knowledge about the underlying propositions
. It is, in essence, the quantification of our
degrees of belief about each of the possible propositions
[
2]. The
correlations present in any distribution
necessarily depend on the conditional relationships between various propositions. For instance, consider the binary case of just two proposition spaces
and
, so that the joint distribution factors,
The correlations present in
will necessarily depend on the form of
and
since the conditional relationships tell us how one variable is statistically dependent on the other. As we will see, the correlations defined in Equation (
2) are quantified by the
mutual information. For situations of many variables however, the global correlations are defined by the
total correlation, which we will design first. All other measures which break up the joint space into conditional distributions (including (
2)) are special cases of the total correlation.
2.1. Some Classes of Inferential Transformations
There are four main types of transformations we will consider that one can enact on a state of knowledge
. They are: coordinate transformations, entropic updating (This of course includes Bayes rule as a special case [
28,
29]), marginalization, and
products. This set of transformations is not necessarily exhaustive, but is sufficient for our discussion in this paper. We will indicate whether or not each of these types of transformations can presumably cause changes to the amount of global correlations, or not, by evaluating the response of the statistical manifold under these transformations. Our inability to describe
how much the amount of correlation changes under these transformations motivates the design of such an objective global quantifier.
The types of transformations we will explore can be identified either with maps from a particular statistical manifold to itself, (type I), to a subset of the original manifold (type II), or from one statistical manifold to another, (type III and IV).
2.1.1. Type I: Coordinate Transformations
Type
I transformations are coordinate transformations. A coordinate transformation
, is a special type of transformation of the proposition space
that respects certain properties. It is essentially a continuous version of a reparameterization (A reparameterization is an isomorphism between discrete proposition spaces,
which identifies for each proposition
, a unique proposition
so that the map
g is a bijection.). For one, each proposition
must be identified with one and only one proposition
and vice versa. This means that coordinate transformations must be bijections on proposition space. The reason for this is simply by design, i.e., we would like to study the transformations that leave the proposition space invariant. A general transformation of type
I on
which takes
to
, is met with the following transformation of the densities,
Like we already mentioned, the coordinate transforming function
must be a bijection in order for (
3) to hold, i.e., the map
is such that
and
. While the densities
and
are not necessarily equal, the probabilities defined in (
3) must be (according to the rules of probability theory, see the
Appendix A). This indicates that
is in the same location in the statistical manifold. That is, the global state of knowledge has not changed—what has changed is the way in which the local information in
has been expressed, which must be invertible in general.
While one could impose that the transformations
f be diffeomorphisms (i.e., smooth maps between
and
), it is not necessary that we restrict
f in this way. Without loss of generality, we only assume that the bijections
are continuous. For discussions involving diffeomorphism invariance and statistical manifolds see the works of Amari [
25], Ay et al. [
30] and Bauer et al. [
31].
For a coordinate transformation (
3) involving two variables,
and
, we also have that type one transformations give,
A few general properties of these type
I transformations are as follows: First, the density
is expressed in terms of the density
,
where
is the determinant of the Jacobian [
10] that defines the transformation,
For a finite number of variables
, the general type
I transformations
are written,
and the Jacobian becomes,
One can also express the density
in terms of the original density
by using the inverse transform,
In general, since coordinate transformations preserve the probabilities associated to a joint proposition space, they also preserve several structures derived from them. One of these is the Fisher-Rao (information) metric [
25,
31,
32], which was proved by Čencov [
26] to be the unique metric on statistical manifolds that represents the fact that the points
are probability distributions and not structureless [
10] (For a summary of various derivations of the information metric, see [
10] Section 7.4).
2.1.2. Split Invariant Coordinate Transformations
Consider a class of coordinate transformations that result in a diagonal Jacobian matrix, i.e.,
These transformations act within each of the variable spaces independently, and hence they are guaranteed to preserve the definition of the split between any
n-partitions of the propositions, and because they are coordinate transformations, they are invertible and do not change our state of knowledge,
. We call such special types of transformations (
10)
split invariant coordinate transformations and denote them as type
Ia. From (
10), it is obvious that the marginal distributions of
are preserved under split invariant coordinate transformations,
If one allows generic coordinate transformations of the joint space, then the marginal distributions may depend on variables outside of their original split. Thus, if one redefines the split after a coordinate transformation to new variables
, the original problem statement changes as to what variables we are considering correlations
between and thus Equation (
11) no longer holds. This is apparent in the case of two variables
, where
, since,
which depends on
y. In the situation where
x and
y are independent, redefining the split after the coordinate transformation (
12) breaks the original independence since the distribution that originally factors,
, would be made to have conditional dependence in the new coordinates, i.e., if
and
, then,
So, even though the above transformation satisfies (
3), this type of transformation may change the correlations in
by allowing for the potential redefinition of the split
. Hence, when designing our functional, we identify split invariant coordinate transformations as those which preserve correlations. These restricted coordinate transformations help isolate a single functional form for our global correlation quantifier.
2.1.3. Type II: Entropic Updating
Type
II transformations are those induced by updating [
10],
in which one maximizes the relative entropy,
subject to constraints and relative to the prior,
. Constraints often come in the form of expectation values [
10,
20,
21,
22],
A special case of these transformations is Bayes’ rule [
28,
29],
In (
14) and throughout the rest of the paper we will use log base
e (natural log) for all logarithms, although the results are perfectly well defined for any base (the quantities
and
will simply differ by an overall scale factor when using different bases). Maximizing (
14) with respect to constraints such as (
15) induces a jump in the statistical manifold. Type
II transformations, while well defined, are not necessarily continuous, since in general one can map nearby points to disjoint subsets in
. Type
II transformations will also cause
in general as it jumps within the statistical manifold. This means, because different
’s may have different correlations, that type
II transformations can either increase, decrease, or leave the correlations invariant.
2.1.4. Type III: Marginalization
Type
III transformations are induced by marginalization,
which is effectively a quotienting of the statistical manifold,
, i.e., for any point
, we equivocate all values of
. Since the distribution
changes under type
III transformations,
, the amount of correlations can change.
2.1.5. Type IV: Products
Type
IV transformations are created by products,
which are a kind of inverse transformation of type
III, i.e., the set of propositions
becomes the product
. There are many different situations that can arise from this type, a most trivial one being an embedding,
which can be useful in many applications. The function
in the above equation is the
Dirac delta function [
33] which has the following properties,
We will denote such a transformation as type
IVa. Another trivial example of type
IV is,
which we will call type
IVb. Like type
II, generic transformations of type
IV can potentially create correlations, since again we are changing the underlying distribution.
2.2. Remarks on Inferential Transformations
There are many practical applications in inference which make use of the above transformations by combining them in a particular order. For example, in machine learning and dimensionality reduction, the task is often to find a low-dimensional representation of some proposition space
, which is done by combining types
I,
III and
IVa in the order,
. Neural networks are a prime example of this sequence of transformations [
34]. Another example of
IV,
I,
III transformations are
convolutions of probability distributions, which takes two proposition spaces and combines them into a new one [
5].
In
Appendix C we discuss how our resulting design functionals behave under the aforementioned transformations.
3. Designing a Global Correlation Quantifier
In this section we seek to achieve our design goal for the special case of the total correlation,
Design Goal:Given a space of N variables and a statistical manifold Δ, we seek to design a functional which ranks distributions according to their total amount of correlations.
Unlike deriving a functional, designing a functional is done through the process of eliminative induction. Derivations are simply a means of showing consistency with a proposed solution whereas design is much deeper. In designing a functional, the solution is not assumed but rather achieved by specifying design criteria that restrict the functional form in a way that leads to a unique or optimal solution. One can then interpret the solution in terms of the original design goal. Thus, by looking at the “nail”, we design a “hammer”, and conclude that hammers are designed to knock in and remove nails. We will show that there are several paths to the solution of our design criteria, the proof of which is in
Section 4.
Our design goal requires that
be scalar valued such that we can rank the distributions
according to their correlations. Considering a continuous space
of
N variables, the functional form of
is the functional,
which depends on each of the possible probability values for every
(In Watanabe’s paper [
6], the notation for the
total correlation between a set of variables
is written as
, where
is the Shannon entropy of the subspace
. For a proof of Watanabe’s theorem see
Appendix B).
Given the types of transformations that may be enacted on , we state the main guiding principle we will use to meet our design goal,
Principle of Constant Correlations (PCC):The amount of correlations in should not change unless required by the transformation, .
While simple, the PCC is incredibly constraining. By stating when one should not change the correlations, i.e., , it is operationally unique (i.e., that you do not do it) rather than stating how one is required to change them, , of which there are infinitely many choices. The PCC therefore imposes an element of objectivity into . If we are able to complete our design goal, then we will be able to uniquely quantify how transformations of type I-IV affect the amount of correlations in .
The discussion of type
transformations indicate that split invariant coordinate transformations do not change
. This is because we want to not only maintain the relationship among the joint distribution (
3), but also the relationships among the marginal spaces,
Only then are the relationships between the n-partitions guaranteed to remain fixed and hence the distribution remains in the same location in the statistical manifold. When a coordinate transformation of this type is made, because it does not change , we are not explicitly required to change , so by the PCC we impose that it does not.
The PCC together with the design goal implies that,
Corollary 1 (Split Coordinate Invariance). The coordinate systems within a particular split are no more informative about the amount of correlations than any other coordinate system for a given ρ.
This expression is somewhat analogous to the statement that “coordinates carry no information”, which is usually stated as a design criterion for relative entropy [
8,
9,
11] (This appears as axiom two in Shore and Johnson’s derivation of relative entropy [
8], which is stated on page 27 as “II.
Invariance: The choice of coordinate system should not matter.” In Skilling’s approach [
9], which was mainly concerned with image analysis, axiom two on page 177 is justified with the statement “We expect the same answer when we solve the same problem in two different coordinate systems, in that the reconstructed images in the two systems should be related by the coordinate transformation.” Finally, in Caticha’s approach [
11], the axiom of coordinate invariance is simply stated on page 4 as “
Criterion 2: Coordinate invariance.
The system of coordinates carries no information.”).
To specify the functional form of
further, we will appeal to special cases in which it is apparent that the PCC should be imposed [
9]. The first involves local, subdomain, transformations of
. If a subdomain of
is transformed then one may be required to change its amount of correlations by some specified amount. Through the PCC however, there is no explicit requirement to change the amount of correlations outside of this domain, hence we impose that those correlations outside are not changed. The second special case involves transformations of an independent subsystem. If a transformation is made on an independent subsystem then again by the PCC, because there is no explicit reason to change the amount of correlations in the other subsystem, we impose that they are not changed. We denote these two types of transformation independences as our two design criteria (DC).
Surprisingly, the PCC and the DC are enough to find a general form for (up to an irrelevant scale constant). As we previously stated, the first design criteria concerns local changes in the probability distribution .
Design Criterion 1 (Locality). Local transformations of ρ contribute locally to the total amount of correlations.
The term
locality has been invoked to mean many different things in different fields (e.g., physics, statistics, etc.). In this paper, as well as in [
8,
9,
11,
12,
13], the term
local refers to transformations which are constrained to act only within a particular subdomain
, i.e., the transformations of the probabilities are
local to
and do not affect probabilities outside of this domain. Essentially, if new information does not require us to change the correlations in a particular subdomain
, then we do not change the probabilities over that subdomain. While simple, this criterion is incredibly constraining and leads (
22) to the functional form,
where
F is some undetermined function of the probabilities and possibly the coordinates. We have used
to denote the measure for brevity. To constrain
F further, we first use the corollary of split coordinate invariance (1) among the subspaces
and then apply special cases of particular coordinate transformations. This leads to the following functional form,
which demonstrates that the integrand is independent of the actual coordinates themselves. Like coordinate invariance, the axiom DC1 also appears in the design derivations of relative entropy [
8,
9,
11,
12,
13] (In Shore and Johnson’s approach to relative entropy [
8], axiom four is analogous to our locality criteria, which states on page 27 “IV.
Subset Independence: It should not matter whether one treats an independent subset of system states in terms of a separate conditional density or in terms of the full system density.” In Skilling’s approach [
9] locality appears as axiom one which, like Shore and Johnson’s axioms, is called
Subset Independence and is justified with the following statement on page 175, “Information about one domain should not affect the reconstruction in a different domain, provided there is no constraint directly linking the domains.” In Caticha [
11] the axiom is also called
Locality and is written on page four as “
Criterion 1: Locality.
Local information has local effects.” Finally, in Vanslette’s work [
12,
13], the subset independence criteria is stated on page three as follows, “Subdomain Independence: When information is received about one set of propositions, it should not effect or change the state of knowledge (probability distribution) of the other propositions (else information was also received about them too).”).
This leaves the function to be determined, which can be done by imposing an additional design criteria.
Design Criterion 2 (Subsystem Independence). Transformations of ρ in one independent subsystem can only change the amount of correlations in that subsystem.
The consequence of DC2 concerns independence among subspaces of
. Given two subsystems
which are independent, the joint distribution factors,
We will see that this leads to the global correlations being additive over each
subsystem,
Like locality (DC1), the design criteria concerning subsystem independence appears in all four approaches to relative entropy [
8,
9,
11,
12,
13] (In Shore and Johnson’s approach [
8], axiom three concerns subsystem independence and is stated on page 27 as “III.
System Independence: It should not matter whether one accounts for independent information about independent systems separately in terms of different densities or together in terms of a joint density.” In Skillings approach [
9], the axiom concerning subsystem independence is given by axiom three on page 179 and provides the following comment on page 180 about its consequences “This is the crucial axiom, which reduces S to the entropic form. The basic point is that when we seek an uncorrelated image from marginal data in two (or more) dimensions, we need to multiply the marginal distributions. On the other hand, the variational equation tells us to add constraints through their Lagrange multipliers. Hence the gradient
must be the logarithm.” In Caticha’s design derivation [
11], axiom three concerns subsystem independence and is written on page 5 as “
Criterion 3: Independence.
When systems are known to be independent it should not matter whether they are treated separately or jointly.” Finally, in Vanslette [
12,
13] on page 3 we have “Subsystem Independence: When two systems are a priori believed to be independent and we only receive information about one, then the state of knowledge of the other system remains unchanged.”); however, due to the difference in the design goal here, we end up imposing DC2 closer to that of the work of [
12,
13] as we do not explicitly have the Lagrange multiplier structure in our design space.
Imposing DC2 leads to the final functional form of
,
with
being the split dependent marginals. This functional is what is typically referred to as the
total correlation (The concept of total correlation TC was first introduced in Watanabe [
6] as a generalization to Shannon’s definition of mutual information. There are many practical applications of TC in the literature [
35,
36,
37,
38].) and is the unique result obtained from imposing the PCC and the corresponding design criteria.
As was mentioned throughout, these results are usually implemented as design criteria for relative entropy as well. Shore and Johnson’s approach [
8] presents four axioms, of which III and IV are
subsystem and
subset independence.
Subset independence in their framework corresponds to Equation (
24) and to the Locality axiom of Caticha [
11]. It also appears as an axiom in the approaches by Skilling [
9] and Vanslette [
12,
13]. Subsystem independence is given by axiom three in Caticha’s work [
11], axiom two in Vanslette’s [
12,
13] and axiom three in Skilling’s [
9]. While coordinate invariance was invoked in the approaches by Skilling, Shore and Johnson and Caticha, it was later found to be unnecessary in the work by Vanslette [
12,
13] who only required two axioms. Likewise, we find that it is an obvious consequence of the PCC and does not need to be stated as a separate axiom in our derivation of the total correlation.
The work by Csiszár [
39] provides a nice summary of the various axioms used by many authors (including Azcél [
40], Shore and Johnson [
8] and Jaynes [
21]) in their definitions of information theoretic measures (A list is given on page 3 of [
39] which includes the following for conditions on an entropy function
; (1) Positivity (
), (2) Expansibility (“expansion” of
P by a new component equal to 0 does not change
, i.e., embedding in a space in which the probabilities of the new propositions are zero), (3) Symmetry (
is invariant under permutation of the probabilities), (4) Continuity (
is a continuous function of
P), (5) Additivity (
), (6) Subadditivity (
), (7) Strong additivity (
), (8) Recursivity (
) and (9) Sum property (
for some function
g).). One could associate the design criteria in this work to some of the common axioms enumerated in [
39], although some of them will appear as consequences of imposing a specific design criterion, rather than as an ansatz. For example, the
strong additivity condition (see
Appendix C.1.3 and
Appendix C.1.4) is the result of imposing DC1 and DC2. Likewise, the condition of
positivity (i.e.,
) and
convexity occurs as a consequence of the design goal, split coordinate invariance (SCI) and both of the design criteria.
Continuity of
with respect to
is imposed through the design goal, and
symmetry is a consequence of DC1. In summary,
Design Goal→
continuity,
DC1→
symmetry, (
DC1 + DC2)→
strong additivity, (
Design Goal + SCI + DC1 + DC2)→
positivity + convexity. As was shown by Shannon [
14] and others [
39,
40], various combinations of these axioms, as well as the ones mentioned in footnote 11, are enough to characterize entropic measures.
One could argue that we could have merely imposed these axioms at the beginning to achieve the functional
, rather than through the PCC and the corresponding design criteria. The point of this article however, is to design the correlation functionals by using principles of inference, rather than imposing conditions on the functional directly (This point was also discussed in the conclusion section of Shore and Johnson [
8] see page 33.). In this way, the resulting functionals are consequences of employing the inference framework, rather than postulated arbitrarily.
One will recognize that the functional form of (
28) and the corresponding
n-partite informations (
88) have the form of a relative entropy. Indeed, if one identifies the product marginal
as a
prior distribution as in (
14), then it may be possible to find constraints (
15) which update the product marginal to the desired joint distribution
. One can then interpret the constraints as the generators of the correlations. We leave the exploration of this topic to a future publication.
5. The n-Partite Special Cases
In the previous sections of the article, we designed an expression that quantifies the global correlations present within an entire probability distribution and found this quantity to be identical to the total correlation (TC). Now we would like to discuss partial cases of the above in which one does not consider the information shared by the entire set of variables , but only information shared across particular subsets of variables in . These types of special cases of TC measure the n-partite correlations present for a given distribution . We call such functionals an n-partite information, or NPI.
Given a set of
N variables in proposition space,
, an
n-partite subset of
consists of
subspaces
which have the following collectively exhaustive and mutually exclusive properties,
The special case of (
83) for any
n-partite splitting will be called the
n-partite information and will be denoted by
with
semi-colons separating the partitions. The largest number
n that one can form for any variable set
is simply the number of variables present in
and for this largest set the
n-partite information coincides with the total correlation,
Each of the
n-partite informations can be derived in a manner similar to the total correlation, except where the density
in step (
52) is replaced with the appropriate independent density associated to the
n-partite system, i.e.,
Thus, the split invariant coordinate transformation (
10) becomes one in which each of the partitions in variable space gives an overall block diagonal Jacobian, (In the simplest case, for
N dimensions and
partitions, the Jacobian matrix is block diagonal in the partitions
, which we use to define the split invariant coordinate transformations in the
bipartite (or
mutual)
information case.)
We then derive what we call the
n-partite information (NPI),
The combinatorial number of possible partitions of the spaces for
splits is given by the combinatorics of Stirling numbers of the second kind [
41]. A Stirling number of the second kind
(often denoted as
) gives the number of
n subsets one can form from a set of
N elements. The definition in terms of binomial coefficients is given by,
Thus, the number of unique n-partite informations one can form from a set of N variables is equal to .
Using (
A28) from the appendix, for any
n-partite information
, where
, we have the chain rule,
where
is the mutual information between the subspace
and the subspace
.
5.1. Remarks on the Upper-Bound of TC
The TC provides an upper-bound for any choice of
n-partition information, i.e., any
n-partite information in which
necessarily satisfies,
This can be shown by using the decomposition of the TC into continuous Shannon entropies which was discussed in [
6],
where the continuous Shannon entropy (While it is true that the continuous Shannon entropy is not coordinate invariant, the particular combinations used in this paper are, due to the TC and
n-partite information being relative entropies themselves.)
is,
Likewise for any
n-partition we have the decomposition,
Since we in general have the inequality [
5] for entropy,
then we also have that for any
k-th partition of a set of
N variables, that the
exhaustive internal partitions (i.e.,
) of
satisfy,
Using (
96) in (
94), we then have that for any
n-partite information,
Thus, the Total Correlations are always greater than or equal to any correlations between any
n-partite splitting of
Upper bounds for the discrete case was discussed in [
42].
5.2. The Bipartite (Mutual) Information
Perhaps the most studied special case of the NPI is the mutual information (MI), which is the smallest possible
n-partition one can form. As was discussed in the introduction, it is useful in inference tasks and was the first quantity to really be defined and exploited [
14] out of the general class of
n-partite informations.
To analyze the mutual information, consider first relabeling the total space as
to match the common notation in MI literature. The
bipartite information considers only two subspaces,
and
, rather than all of them. These two subspaces define a bipartite split in the proposition space such that
and
. This results in turning the product marginal into,
where
and
. Finally, we arrive at the functional that we will label by its split as,
which is the
mutual information. Since the marginal space is split into two distinct subspaces, the mutual information only quantifies the correlations
between the two subspaces and not between all the variables as is the case with the
total correlation for a given split. Whenever the total space
is two-dimensional, the total correlation and the mutual information coincide.
One can derive the mutual information by using the same steps as in the total correlation derivation above, except replacing the independence condition in (
48) with the bipartite marginal in (
98). The goal is the same as (
Section 3) except that the MI
ranks the distributions ρ according to the correlations between two subspaces of propositions, rather than within the entire proposition space.
5.3. The Discrete Total Correlation
One may derive a discrete total correlation and discrete NPI by starting from Equation (
36),
and then following the same arguments without taking the continuous limit after DC1 was imposed.
The inferential transformations explored in
Section 2.1 are somewhat different for discrete distributions. Coordinate transformations are replaced by general
reparameterizations, (An example of such a discrete reparameterization (or discrete coordinate transformation) is intuitively used in coin flipping experiments – the outcome of coin flips may be parameterized with (−1,1) or equally with (0, 1) to represent tails verses heads outcomes, respectively.) in which one defines a bijection between sets,
Like with general coordinate transformations (
3), if
is a bijection then we equate the probabilities,
Since is simply a label for a proposition, we enforce that the probabilities associated to it are independent of the choice of label. One can define discrete split coordinate invariant transformations analogous to the continuous case. Using discrete split invariant coordinate transformations, the index i is removed from above, which is analogous to the removal of the x coordinate dependence in the continuous case. The functional equation for the log is found and solved analogously by imposing DC2. The discrete TC is then found and the discrete NPI may be argued.
The other transformations in
Section 2.1 remain the same, except for replacing integrals by sums. In the above subsections, the continuous relative entropy is replaced by the discrete version,
The discrete MI is extremely useful for dealing with problems in communication theory, such as noisy-channel communication and Rate-Distortion theory [
5]. It is also reasonable to consider situations where one has combinations of discrete and continuous variables. One example is the binary category case [
34].
7. Discussion of Alternative Measures of Correlation
The design derivation in this paper puts the various NPI functionals on an equal foundational footing as the relative entropy. This begs the question as to whether other similar information theoretic quantities can be designed along similar lines. Some quantities that come to mind are
α-mutual information [
45],
multivariate-mutual information [
46,
47],
directed information [
48],
transfer entropy [
49] and
causation entropy [
50,
51,
52].
The
-mutual information [
45] belongs to the family of functionals that fall under the name
Rényi entropies [
53] and their close cousin
Tsallis entropy [
54,
55]. Tsallis proposed his entropy functional as an attempted generalization for applications of statistical mechanics, however the probability distributions that it produces can be generated from the standard MaxEnt procedure [
10] and do not require a new thermodynamics (For some discussion on this topic see [
10] page 114.). Likewise, Rényi’s family of entropies attempts to generalize the relative entropy for generic inference tasks, which inadvertently relaxes some of the design criteria concerning independence. Essentially, Rényi introduces a set of parameterized entropies
, with parameter
, which leads to the weakening of the independent subsystem additivity criteria. Imposing that these functionals then obey subsystem independence immediately constrains
or
, and reduces them back to the standard relative entropy, i.e.,
and
. Without a strict understanding of what it means for subsystems to be independent, one cannot conduct reasonable science. Thus, such “generalized” measures of correlation (such as the
-mutual information [
45]) which abandon subsystem independence cannot be trusted.
Defining multivariate-mutual information (MMI) is an attempt to generalize the standard MI to a case where several sets of propositions are compared with each other, however this is different than the total correlation which we designed in this paper. For example, given three sets of propositions
and
, the
multivariate-mutual information (not to be confused with the
multi-information [
56] which is another name for total correlation) is,
One difficulty with this expression is that it can be negative, as was shown by Hu [
47]. Thus, defining a minimum MMI is not possible in general, which suggests that a design derivation of MMI requires a different interpretation. Despite these difficulties, there have been several recent successes in the study of MMI including Bell [
57] and Baudot et al. [
16,
17] who studied MMI in the context of algebraic topology.
Another extension of mutual information is transfer entropy, which was first introduced by Schreiber [
49] and is a special case of
directed information [
48]. Transfer entropy is a conditional mutual information which attempts to quantify the amount of “information” that flows between two time-dependent random processes. Given a set of propositions which are dynamical, such that
and
, so that at time
the propositions take the form
and
, then the
transfer entropy (TE) between
and
at time
is defined as the conditional mutual information,
The notation
refers to all times
before
. Thus, the TE is meant to quantify the influence of a variable
on predicting the state
when one already knows the history of
, i.e., it quantifies the amount of independent correlations provided by
. Given that TE is a conditional mutual information, it does not require a design derivation independent of the MI. It can be justified on the basis of the discussion around (
A32). Likewise the more general
directed information is also a conditional MI and hence can be justified in the same way.
Finally, the definition of
causation entropy [
50,
51,
52] can also be expressed as a conditional mutual information. Causation entropy (CE) attempts to quantify time-dependent correlations between nodes in a connected graph and hence generalize the notion of transfer entropy. Given a set of nodes
and
, the causation entropy between two subsets conditioned on a third is given by,
The above definition reduces to the transfer entropy whenever the set of variables
. As was shown by Sun et al. [
58], the causation entropy (CE) allows one to more appropriately quantify the causal relationships within connected graphs, unlike the transfer entropy which is somewhat limited. Since CE is a conditional mutual information, it does not require an independent design derivation. As with transfer entropy and directed information, the interpretation of CE can be justified on the basis of (
A32).