**1. Introduction**

In the past, extensive work has been written on defining the information measures which generalize the Shannon entropy [1], such as the one-parameter Rényi entropy [2], the Tsallis entropy [3], the Landsberg–Vedral entropy [4], the Gaussian entropy [5], and the twoparameter Sharma–Mittal entropy [5,6], which reduces to former ones for special choices of the parameters. The Sharma–Mittal entropy can axiomatically be founded as the unique *q*-additive measure [7,8] which satisfies generalized Shannon–Kihinchin axioms [9,10] and which has widely been explored in different research fields starting from statistics [11] and thermodynamics [12,13] to quantum mechanics [14,15], machine learning [16,17] and cosmology [18,19]. The Sharma–Mittal entropy has also been recognized in the field of information theory, where the measures of conditional Sharma–Mittal entropy [20], Sharma–Mittal divergences [21] and Sharma–Mittal entropy rate [22] have been established and analyzed.

**Citation:** Ili´c, V.M.; Djordjevi´c, I.B. On the *α*-*q*-Mutual Information and the *α*-*q*-Capacities. *Entropy* **2021**, *23*, 702. https://doi.org/10.3390/ e23060702

Academic Editors: Petr Jizba and Jan Korbel

Received: 12 February 2021 Accepted: 26 May 2021 Published: 1 June 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Considerable research has also been done in the field of communication theory in order to analyze information transmission in the presence of noise if, instead of Shannon's entropy, the information is quantified with (instances of) Sharma–Mittal entropy and, in general, the information transfer is quantified by an appropriately defined measure of mutual information, while the maximal information transfer is considered as a generalized channel capacity. Thus, after Rényi's proposal for the additive generalization of Shannon entropy [2], several different definitions for Rényi information transfer were proposed by Sibson [23], Arimoto [24], Augustin [25], Csiszar [26], Lapidoth and Pfister [27] and Tomamichel and Hayashi [28]. These measures have been explored thoroughly and their operational characterization in coding theory, hypothesis testing, cryptography and quantum information theory was established, which qualifies them as a reasonable measure of Rényi information transfer [29]. Similar attempts have also been made in the case of non-additive entropies. Thus, starting from the work of Daroczy [30], who introduced a measure for generalized information transfer related to the Tsallis entropy, several attempts followed for the measures which correspond to non-additive particular instances of the Sharma–Mittal entropy, so the definitions for the Rényi information transfer were considered in [24,31], for the Tsallis information transfer in [32] and for the Landsber–Vedral information transfer in [4,33].

In this paper we provide a general treatment of the Sharma–Mittal entropy transfer and a detailed analysis of existing measures, showing that all of the definitions related to non-additive entropies fail to satisfy at least one of the ineluctable properties common to the Shannon case, which we state as axioms, by which the information transfer has to be non-negative, less than the input and output uncertainty, equal to the input uncertainty in the case of perfect transmission and equal to zero, in the case of a totally destructive channel. Thus, breaking some of these axioms implies unexpected and counterintuitive conclusions about the channels, such as achieving super-capacitance or sub-capacitance [4], which could be treated as nonphysical behavior. As an alternative, we propose the *α*-*q*mutual information as a measure of Sharma–Mittal information transfer, maximized with the *α*-*q*-capacity. The *α*-*q* mutual information generalizes the *α*-mutual information by Arimoto [24], which is defined as a *q*-difference between the input Sharma–Mittal entropy and the appropriately defined conditional Sharma–Mittal entropy if the output is given, while the *α*-*q*-capacity represents a generalization of Arimoto's *α*-capacity in the case of *q* = 1. In addition, several other instances can be obtained by specifying the values of parameters *α* and *q*, which includes the information transfer measures for the Tsallis, the Landsber–Vedral and the Shannon entropy, as well as the case of the Gaussian entropy which was not considered before in the context of information transmission.

The paper is organized as follows. The basic properties and special instances of the Sharma–Mittal entropy are listed in Section 2. Section 3 reviews the basics of communication theory, introduces the basic communication channels and establishes the set of axioms which information transfer measures should satisfy. The information transfer measures which are defined by Arimoto are introduced in Section 4, and the alternative definitions for Rényi information transfer measures are discussed in Section 5. Finally, the *α*-*q*-mutual information and the *α*-*q*-capacities are proposed and their properties analyzed in Section 6 while the previously proposed measures of Sharma–Mittal entropy transfer are discussed in Section 7.

#### **2. Sharma–Mittal Entropy**

Let the sets of positive and nonnegative real numbers be denoted with R<sup>+</sup> and R<sup>+</sup> 0 , respectively, and let the mapping *<sup>η</sup><sup>q</sup>* : <sup>R</sup> <sup>→</sup> <sup>R</sup> be defined in

$$\eta\_{\mathcal{J}}(\mathbf{x}) = \begin{cases} \mathbf{x}, & \text{for} \quad q = 1\\ \frac{2^{(1-q)\cdot\mathbf{x}} - 1}{(1-q)\ln 2}, & \text{for} \quad q \neq 1 \end{cases} \tag{1}$$

so that its inverse is given in

$$\eta\_q^{-1}(\mathbf{x}) = \begin{cases} \mathbf{x}, & \text{for} \quad q = 1\\ \frac{1}{1-q} \log((1-q)\mathbf{x}\ln 2 + 1), & \text{for} \quad q \neq 1 \end{cases} \tag{2}$$

The mapping *η<sup>q</sup>* and its inverse are increasing continuous (hence invertible) functions such that *η*(0) = 0. The *q*-logarithm is defined in

$$\text{Log}\_q(\mathbf{x}) = \eta\_q(\log \mathbf{x}) = \begin{cases} \log \mathbf{x}, & \text{for} \quad q = 1 \\ \frac{\mathbf{x}^{(1-q)} - 1}{(1-q)\ln 2}, & \text{for} \quad q \neq 1' \end{cases} \tag{3}$$

and its inverse, the *q*-exponential, is defined in

$$\operatorname{Exp}\_q(y) = \begin{cases} 2^y, & \text{for} \quad q = 1 \\ \left( 1 + (1 - q) y \ln 2 \right)^{\frac{1}{1 - q}} & \text{for} \quad q \neq 1 \end{cases} \tag{4}$$

for 1 + (1 − *q*)*y* ln 2 > 0. Using *ηq*, we can define the pseudo-addition operation ⊕*<sup>q</sup>* [7,8]

$$\mathbf{x} \oplus\_{\mathfrak{q}} \mathbf{y} = \eta\_{\mathfrak{q}} \left( \eta\_{\mathfrak{q}}^{-1}(\mathbf{x}) + \eta\_{\mathfrak{q}}^{-1}(\mathbf{y}) \right) = \mathbf{x} + \mathbf{y} + (1 - q)\mathbf{x}\mathbf{y}; \quad \mathbf{x}, \mathbf{y} \in \mathbb{R}, \tag{5}$$

and its inverse operation, the pseudo substraction

$$\forall x \ominus\_{\mathfrak{q}} y = \eta\_{\mathfrak{q}} \left( \eta\_{\mathfrak{q}}^{-1}(\mathbf{x}) - \eta\_{\mathfrak{q}}^{-1}(y) \right) = \frac{\mathbf{x} - y}{1 + (1 - q)y \ln 2}; \quad \mathbf{x}, y \in \mathbb{R}. \tag{6}$$

The ⊕*<sup>q</sup>* can be rewritten in terms of the generalized logarithm by settings *x* = log *u* and *y* = log *v* so that

$$\operatorname{Log}\_q(\iota \cdot v) = \operatorname{Log}\_q(\iota) \oplus\_q \operatorname{Log}\_q(v); \quad \iota, v \in \mathbb{R}\_+.\tag{7}$$

Let the set of all *n*-dimensional distributions be denoted with

$$\Delta\_{\rm H} \equiv \left\{ (p\_1, \dots, p\_n) \, \Big|\, p\_i \ge 0, \sum\_{i=1}^n p\_i = 1 \right\}; \quad n > 1. \tag{8}$$

Let the function *Hn* : <sup>Δ</sup>*<sup>n</sup>* <sup>→</sup> <sup>R</sup><sup>+</sup> <sup>0</sup> satisfy the following the Shannon–Khinchin axioms, for all *<sup>n</sup>* <sup>∈</sup> <sup>N</sup>, *<sup>n</sup>* <sup>&</sup>gt; 1.

GSK1 *Hn* is continuous in Δ*n*;


$$H\_{\rm mu}(PQ) = H\_n(P) \oplus\_q H\_m(Q|P), \quad \text{where} \quad H\_m(Q|P) = f^{-1}\left(\sum\_{k=1}^n p\_k^{(a)} f\left(H\_m(Q\_{|k})\right)\right). \tag{9}$$

where *f* is an invertible continuous function and *P*(*α*) = (*p* (*α*) <sup>1</sup> , ... , *p* (*α*) *<sup>n</sup>* ) <sup>∈</sup> <sup>Δ</sup>*<sup>n</sup>* is the *α*-escort distribution of distribution *P* ∈ Δ*<sup>n</sup>* defined in

$$p\_k^{(a)} = \frac{p\_k^a}{\sum\_{i=1}^n p\_i^{a'}} \quad k = 1, \dots, n, \quad a > 0. \tag{10}$$

GSK5 *H*<sup>2</sup> 1 2 , 1 2 = Log*q*(1).

As shown in [9], the unique function *Hn*, which satisfies [GSK1]-[GSK5], is Sharma– Mittal entropy [6].

In the following paragraphs we will assume that *X* and *Y* are discrete jointly distributed random variables taking values from sample spaces {*x*1, ... , *xn*} and {*y*1, ... , *ym*}, and distributed in accordance to *PX* ∈ Δ*<sup>n</sup>* and *PY* ∈ Δ*m*, respectively. In addition, the joint distribution of *X* and *Y* will be denoted in *PX*,*<sup>Y</sup>* ∈ Δ*nm* and the conditional distribution of *<sup>X</sup>* given *<sup>Y</sup>* will be denoted in *PX*|*<sup>Y</sup>* <sup>=</sup> *PX*,*Y*(*x*,*y*) *PY*(*y*) <sup>∈</sup> <sup>Δ</sup>*m*, provided that *PY*(*y*) <sup>&</sup>gt; 0. We will identify the entropy of a random variable *X* with the entropy of its distribution *PX* and the Sharma–Mittal entropy will be denoted with *Hα*,*q*(*X*) ≡ *Hn*(*PX*).

Thus, for a random variable which is distributed to *X*, Sharma–Mittal entropy can be expressed in

$$H\_{a,q}(X) = \frac{1}{1-q} \left( \left( \sum\_{x} P\_X(x)^a \right)^{\frac{1-q}{1-a}} - 1 \right),\tag{11}$$

and it can equivalently be expressed as the *η<sup>q</sup>* transformation of Rényi entropy as in

$$H\_{a,q}(X) \equiv \eta\_q(R\_a(X)).\tag{12}$$

Sharma–Mittal entropy, for *<sup>α</sup>*, *<sup>q</sup>* <sup>∈</sup> <sup>R</sup><sup>+</sup> <sup>0</sup> \ 1, being a continuous function of the parameters and the sums goes over the support of *PX*. Thus, in the case of *q* = 1, *α* = 1, Sharma–Mittal reduces to Rényi entropy of order *α* [2]

$$R\_{\mathfrak{a}}(X) \equiv H\_{\mathfrak{a},1}(X) = \frac{1}{1-\mathfrak{a}} \log \left( \sum\_{\mathfrak{x}} P\_{\mathfrak{X}}(\mathfrak{x})^{\mathfrak{a}} \right),\tag{13}$$

which further reduces to Shannon entropy for *α* = 1, *q* = 1, [34]

$$S(X) \equiv H\_{1,1}(X) = \sum\_{\mathbf{x}} P\_X(\mathbf{x}) \log P\_X(\mathbf{x}) \,\tag{14}$$

while in the case of *q* = 1, *α* = 1 it reduces to Gaussian entropy [5]

$$\mathcal{G}\_{\emptyset}(X) \equiv H\_{1,\emptyset}(X) = \frac{1}{(1-q)\ln 2} \left( \prod\_{i=1}^{n} P\_X(\mathbf{x})^{P\_X(\mathbf{x})} - 1 \right). \tag{15}$$

In addition, Tsallis entropy [3] is obtained for *α* = *q* = 1,

$$T\_{\boldsymbol{\theta}}(X) \equiv \frac{1}{(1-q)\ln 2} \left(\sum\_{\mathbf{x}} P\_{\mathbf{X}}(\mathbf{x})^{\boldsymbol{\eta}} - 1\right),\tag{16}$$

while in the case of for *q* = 2 − *α* it reduces to the Landsberg–Vedral entropy [4]

$$L\_a(X) \equiv H\_{a,2-a}(X) = \frac{1}{(a-1)\ln 2} \left(\frac{1}{\sum\_{\mathbf{x}} P\chi(\mathbf{x})^a} - 1\right). \tag{17}$$

#### **3. Sharma–Mittal Information Transfer Axioms**

One of the main goals of information and communication theories is characterization and analysis of the information transfer between sender *X* and receiver *Y*, which communicate through a channel. The sender and receiver are described by probability distributions *PX* and *PY* while the communication channel with the input *X* and the output *Y* is described by the transition matrix *PY*|*X*:

$$P\_{Y|X}^{(i,j)} \equiv P\_{Y|X}(y\_j|x\_i). \tag{18}$$

We assume that maximum likelihood detection is performed at the receiver, which is defined by the mapping *d* : {*y*1,..., *ym*}→{*x*1,..., *xn*} as follows:

$$d(y\_j) = \mathbf{x}\_i \quad \Leftrightarrow \quad P\_{Y|X}(y\_j|\mathbf{x}\_i) > P\_{Y|X}(y\_j|\mathbf{x}\_k); \quad \text{for all } k \neq i,\tag{19}$$

assuming that the inequality in (19) is uniquely satisfied. Thus, if the input symbol *xi* is sent and the output symbol *yj* is received, the *xi* will be detected if *xi* = *d*(*yj*) and a detection error will be made otherwise, and we define the error function functions *φ* : {*x*1,..., *xm*}×{*y*1,..., *ym*}→{0, 1} as in

$$\phi(\mathbf{x}\_i, y\_j) = \begin{cases} 1, & \text{if } \mathbf{x}\_i = d(y\_j) \\ 0, & \text{otherwise,} \end{cases} \tag{20}$$

the detection error if a symbol *xi* is sent

$$P\_{\rm err}(\mathbf{x}\_i) = \sum\_{y\_j} P\_{Y|X}(y\_j|\mathbf{x}\_i)\phi(\mathbf{x}\_i, y\_j); \quad \text{for all} \quad \mathbf{x}\_{i\prime} \tag{21}$$

as well as the average detection error

$$\bar{P}\_{\rm err} = \sum\_{\mathbf{x}\_i} P\_{\mathbf{X}}(\mathbf{x}\_i) P\_{\mathbf{err}}(\mathbf{x}\_i) = \sum\_{\mathbf{x}\_i, \mathbf{y}\_j} P\_{\mathbf{X}, \mathbf{Y}}(\mathbf{x}, \mathbf{y}) \phi(\mathbf{x}\_i, \mathbf{y}\_j). \tag{22}$$

**Totally destructive channel**: A channel is said to be totally destructive if

$$P\_{Y|X}^{(i,j)} = P\_{Y|X}(y\_j|\mathbf{x}\_i) = P\_Y(y\_j) = \frac{1}{m}; \quad \text{for all} \quad \mathbf{x}\_{i\prime} \tag{23}$$

i.e., if the sender *X* and receiver *Y* are described by independent random variables,

$$X \perp \! \perp \! \! \perp Y \quad \Leftrightarrow \quad P\_{X,Y}(x,y) = P\_X(x)P\_Y(y), \tag{24}$$

where the relationship of independence is denoted in ⊥⊥. In this case, *φi*(*yj*) = 1 for all *yj* and the probability of error is *Perr*(*xi*) = 1; for all *xi*, as well as the average probability of error *P*¯ *err* = 1, which means that a correct maximum likelihood detection is not possible.

**Perfect communication channel**: A channel is said to be perfect if for every *xi*,

$$P\_{Y|X}(y\_j|x\_i) > 0, \quad \text{for at least one } y\_j \tag{25}$$

and for every *yj*

$$P\_{Y|X}(y\_j|x\_i) > 0, \quad \text{for exactly one } x\_i. \tag{26}$$

Note that in this case *PY*|*X*(*yj*|*xi*) can still take a zero value for some *yj* and that *<sup>φ</sup>i*(*yj*) = 0 for any non-zero *PY*|*X*(*yj*|*xi*). Thus, the error probability is equal to zero *Perr*(*xi*) = 0; for all *xi*, as well as the average probability of error *P*¯ *err* = 0, which means that perfect detection is possible by means of a maximum likelihood detector.

**Noisy channel with non-overlapping outputs**: A simple example of a perfect transmission channel is the noisy channel with non-overlapping outputs (NOC), which is schematically described in Figure 1. It is a 2-input *<sup>m</sup>* <sup>=</sup> <sup>2</sup>*k*-output channel (*<sup>k</sup>* <sup>∈</sup> <sup>N</sup>) defined by the transition matrix:

$$P\_{Y|X} = \begin{bmatrix} P\_{Y|X}(\cdot|\mathbf{x}\_1) \\ P\_{Y|X}(\cdot|\mathbf{x}\_2) \end{bmatrix} = \begin{bmatrix} \frac{1}{k} & \dots & \frac{1}{k} & 0 & \dots & 0 \\ 0 & \dots & 0 & \frac{1}{k} & \dots & \frac{1}{k} \end{bmatrix} \tag{27}$$

(in this and in the following matrices, the symbol "··· " stands for the *k*-time repletion). In the case of *k* = 1 and *m* = 2*k* = 2, the channel reduces to the noiseless channel. Although the channel is noisy, the input can always be recovered from the output (if *yj* is received and *j* ≤ *k*, the input symbol *x*<sup>1</sup> is sent, otherwise *x*<sup>2</sup> is sent). Thus, it is expected that the information which is passed through the channel is equal to the information that can be generated by the input. Note that for a channel input distributed in accordance with

$$P\_X = \left[ \begin{array}{c} P\_X(\mathbf{x}\_1) \\ P\_X(\mathbf{x}\_2) \end{array} \right] = \left[ \begin{array}{c} a \\ 1 - a \end{array} \right]; \quad 0 \le a \le 1,\tag{28}$$

the joint probability distribution *PX*,*<sup>Y</sup>* can be expressed as in:

$$P\_{X,Y} = \begin{bmatrix} \frac{a}{k} & \dots & \frac{a}{k} & 0 & \dots & 0\\ 0 & \dots & 0 & \frac{1-a}{k} & \dots & \frac{1-a}{k} \end{bmatrix} \tag{29}$$

and the output distribution *PY*, which can be obtained by the summations over columns, is

$$P\_Y = \begin{bmatrix} P\_Y(y\_1), \dots, P\_Y(y\_m) \end{bmatrix}^T = \begin{bmatrix} \frac{a}{k'}, \dots, \frac{a}{k'}, \frac{1-a}{k}, \dots, \frac{1-a}{k} \\\end{bmatrix}^T. \tag{30}$$

**Binary symmetric channels**: The binary symmetric channel (BSC) is a two input two output channel described by the transition matrix

$$P\_{Y|X} = \begin{bmatrix} P\_{Y|X}(\cdot|\mathbf{x}\_1)^T \\ P\_{Y|X}(\cdot|\mathbf{x}\_2)^T \end{bmatrix} = \begin{bmatrix} 1-p & p \\ p & 1-p \end{bmatrix} \tag{31}$$

which is schematically described in Figure 2. Note that for *p* = <sup>1</sup> <sup>2</sup> BSC reduces to a totally destructive channel, while in the case of *p* = 0 it reduces to a perfect channel.

**Figure 1.** Noisy channel with non-overlapping outputs.

**Figure 2.** Binary symmetric channel.

#### *Sharma–Mittal Information Transfer Axioms*

In this paper, we search for information theoretical measures of information transfer between sender *X* and receiver *Y*, which communicate through a channel if the information is measured with Sharma–Mittal entropy. Thus, we are interested in the information transfer measure, *Iα*,*q*(*X*,*Y*), which is called the *α*-*q*-mutual information and its maximum,

$$\mathcal{C} = \max\_{\mathcal{P}\_{\mathcal{X}}} I\_{\mathfrak{a}, \mathfrak{q}}(\mathcal{X}, \mathcal{Y})\_{\prime} \tag{32}$$

which is called the *α*-*q*-capacity and which requires the following set of axioms to be satisfied. (*A*1) The channel cannot convey negative information, i.e.,

$$\mathcal{C}\_{a,q}(P\_{Y|X}) \ge I\_{a,q}(X, \mathcal{Y}) \ge 0. \tag{33}$$

(*A*2) The information transfer is zero in the case of a totally destructive channel, i.e.,

$$P\_{Y|X}(y|\mathbf{x}) = \frac{1}{m'} \text{ for all } \mathbf{x}, y \quad \Rightarrow \quad I\_{\mathbf{a}, \mathbf{q}}(X, Y) = \mathbb{C}\_{\mathbf{a}, \mathbf{q}}(P\_{Y|X}) = \mathbf{0},\tag{34}$$

which is consistent with the conclusion that the average probability of error is one, *P*¯ *err* = 1, in the case of a totally destructive channel.

(*A*3) In the case of perfect transmission, the information transfer is equal to the input information, i.e.,

$$X = Y \quad \Rightarrow \quad I\_{a,\emptyset}(X,Y) = H\_{a,\emptyset}(X), \quad \mathbb{C}\_{a,\emptyset}(P\_{Y|X}) = \mathbb{L}\log\_{\emptyset} n,\tag{35}$$

which is consistent with the conclusion that the average probability of error is zero, *P*¯ *err* = 0, in the case of a perfect transmission channel, so that all the information from the input is conveyed.

(*A*4) The channel cannot transfer more information than it is possible to be sent, i.e.,

$$I\_{a,q}(X,Y) \le C\_{a,q}(P\_{Y|X}) \le \text{Log}\_q\ n\_\prime \tag{36}$$

which means that a channel cannot add additional information.

(*A*5) The channel cannot transfer more information than it is possible to be received, i.e.,

$$I\_{\mathfrak{a},\mathfrak{q}}(X,Y) \le \mathbb{C}\_{\mathfrak{a},\mathfrak{q}}(P\_{Y|X}) \le \text{Log}\_{\mathfrak{q}}\,\,\, m\_{\mathfrak{q}}\tag{37}$$

which means that a channel cannot add additional information.

(*A*6) Consistency with the Shannon case:

$$\lim\_{q \to 1, a \to 1} I\_{\mathfrak{a}, \mathfrak{q}}(X, Y) = I(X, Y), \quad \text{and} \quad \lim\_{q \to 1, a \to 1} \mathbb{C}\_{\mathfrak{a}, \mathfrak{q}}(P\_{Y|X}) = \mathbb{C}(P\_{Y|X}) \tag{38}$$

Thus, the axioms (*A*2) and (*A*3) ensure that the information measures are consistent with the maximum likelihood detection (19)–(21). On the other hand, the axioms (*A*1), (*A*4) and (*A*5), prevent a situation in which a physical system conveys information in spite of going through a completely destructive channel, or in which the negative information transfer is observed, indicating that the channel adds or removes information by itself, which could be treated as nonphysical behavior without an intuitive explanation. Finally, the property (*A*6) ensure that the information transfer measures can be considered as generalizations of corresponding Shannon measures. For these reasons, we assume that the satisfaction of the properties (*A*1)–(*A*5) is mandatory for any reasonable definition of Sharma–Mittal information transfer measures.

#### **4. The** *α***-Mutual Information and the** *α***-Capacity**

One of the first proposals for the Rényi mutual information goes back to Arimoto [24], who considered the following definition of mutual information:

$$I\_a(X,Y) = \frac{a}{1-a} \log \left( \sum\_{\mathcal{Y}} \left( \sum\_{\mathbf{x}} P\_X^{(a)}(\mathbf{x}) P\_{Y|X}^{\mathbf{x}}(\mathbf{y}|\mathbf{x}) \right)^{\frac{1}{a}} \right),\tag{39}$$

where the escort distribution *PX*(*α*) is defined as in (10), and he also invented an iterative algorithm for the computation of the *α*-capacity [35], which is defined from the *α*-mutual information:

$$\mathbb{C}\_{\mathfrak{a}}(P\_{Y|X}) = \max\_{P\_X} I\_{\mathfrak{a}}(X, Y). \tag{40}$$

Notably, Arimoto's mutual information can equivalently be represented using the conditional Rényi entropy

$$R\_{\mathfrak{A}}(X|Y) = \frac{\mathfrak{a}}{\mathfrak{a} - 1} \log\_2 \sum\_{\mathfrak{Y}} P\_Y(\mathfrak{y}) \left( \sum\_{\mathfrak{x}} P\_{X|Y=\mathfrak{y}}(\mathfrak{x})^{\mathfrak{a}} \right)^{\frac{1}{\mathfrak{a}}},\tag{41}$$

as in

$$R\_{\mathfrak{a}}(X,Y) \equiv R\_{\mathfrak{a}}(X) - R\_{\mathfrak{a}}(X|Y),\tag{42}$$

which can be interpreted as the input uncertainty reduction after the output symbols are received and, in the case of *α* → 1, the previous definition reduces to the Shannon case. In addition, this measure is directly related to the famous Gallager exponent

$$E\_0(\rho, P\_X) = -\log\left(\sum\_{\mathcal{Y}} \left(\sum\_{\mathbf{x}} P\_X(\mathbf{x}) P\_{Y|X}^{\frac{1}{1+\rho}}(\mathcal{Y}|\mathbf{x})\right)^{1+\rho}\right),\tag{43}$$

which has been widely used to establish the upper bound of error probability in channel coded communication systems [36] via the relationship [29]

$$I\_a(X,Y) = \frac{a}{1-a} E\_0\left(\frac{1}{a} - 1, P\_X^{(a)}\right). \tag{44}$$

In addition, in the case of *α* → 1, it reduces to

$$I\_1(X,Y) = \lim\_{a \to 1} I\_a(X,Y) = I(X,Y), \tag{45}$$

where

$$I(X,Y) = \sum\_{\mathbf{x},\mathbf{y}} P\_{\mathbf{X},Y}(\mathbf{x},\mathbf{y}) \log \frac{P\_{\mathbf{X},Y}(\mathbf{x},\mathbf{y})}{P\_{\mathbf{X}}(\mathbf{x})P\_{Y}(\mathbf{y})} \tag{46}$$

stands for Shannon's mutual information [37].

The *α*-mutual information *Iα*(*X*,*Y*) and the *α*-capacity *Cα*(*PYX* ) satisfy the axioms (*A*1)–(*A*6) for *q* = 1 and *α* > 0, as stated by the following theorem, which further justifies their usage as the measures of (maximal) information transfer.

**Theorem 1.** *The mutual information measures Iα and Cα satisfy the following set of properties:*

*(A*1*) The channel cannot convey negative information, i.e.,*

$$\mathbb{C}\_{\mathfrak{a}}(P\_{Y|X}) \ge I\_{\mathfrak{a}}(X,Y) \ge 0. \tag{47}$$

*(A*2*) The (maximal) information transfer is zero in the case of a totally destructive channel, i.e.,*

$$P\_{Y|X}(y|\mathbf{x}) = \frac{1}{m'} \text{ for all } \mathbf{x}, y \quad \Rightarrow \quad I\_a(X, Y) = \mathbb{C}\_a(P\_{Y|X}) = 0. \tag{48}$$

*(A*3*) In the case of perfect transmission, the (maximal) information transfer is equal to the (maximal) input information, i.e.,*

$$X = Y \quad \Rightarrow \quad I\_{\mathfrak{a}}(X, Y) = R\_{\mathfrak{a}}(X), \quad \mathbb{C}\_{\mathfrak{a}}(P\_{Y|X}) = \log n. \tag{49}$$

*(A*4*) The channel cannot transfer more information than it is possible to be sent, i.e.,*

$$I\_a(X,Y) \le \mathbb{C}\_a(P\_{Y|X}) \le \log n;\tag{50}$$

*(A*5*) The channel cannot transfer more information than it is possible to be received, i.e.,*

$$I\_{\mathfrak{a}}(X,Y) \le \mathbb{C}\_{\mathfrak{a}}(P\_{Y|X}) \le \log m. \tag{51}$$

*(A*6*) Consistency with the Shannon case:*

$$\lim\_{a \to 1} I\_{\mathfrak{a}}(X, Y) = I(X, Y), \quad \text{and} \quad \lim\_{a \to 1} \mathbb{C}\_{\mathfrak{a}}(P\_{Y|X}) = \mathbb{C}(P\_{Y|X}) \tag{52}$$

**Proof.** As shown in [38], *Rα*(*X*|*Y*) ≤ *Rα*(*X*), and the nonnegativity property (*A*1) follows from the definition of Arimoto's mutual information (42). In addition, if *X* ⊥⊥ *Y*, then *PY*|*X*(*y*|*x*) = *PY*(*y*) so that the definition (61) implies the property (*A*2). Furthermore, in the case of a perfect transmission channel, the mutual information (61) can be represented in

$$I\_{\mathfrak{a}}(\mathbf{X},\mathbf{Y}) = \frac{a}{a-1} \log \frac{\sum\_{\mathbf{y}} \left(\sum\_{\mathbf{x}} P\_{\mathbf{X}}(\mathbf{x})^{a} P\_{\mathbf{Y}|\mathbf{X}}^{\mathbf{a}}(\mathbf{y}|\mathbf{x})\right)^{\frac{1}{a}}}{\left(\sum\_{\mathbf{x}} P\_{\mathbf{X}}^{(a)}(\mathbf{x})\right)^{\frac{1}{a}}} = \frac{a}{a-1} \log \frac{\sum\_{\mathbf{y}} \left(P\_{\mathbf{X}}(d(\mathbf{y}))^{a} P\_{\mathbf{Y}|\mathbf{X}}^{\mathbf{a}}(\mathbf{y} \mid d(\mathbf{y}))\right)^{\frac{1}{a}}}{\left(\sum\_{\mathbf{x}} P\_{\mathbf{X}}^{(a)}(\mathbf{x})\right)^{\frac{1}{a}}},\tag{53}$$

and since

$$\sum\_{y} \left( P\_X(d(y))^x P\_{Y|X}^x(y \mid d(y)) \right)^{\frac{1}{x}} = \sum\_{y} P\_X(d(y)) P\_{Y|X}(y \mid d(y)) = 1$$
 
$$\sum\_{x} \sum\_{y:d(y) = x} P\_X(d(y)) P\_{Y|X}(y \mid d(y)) = \sum\_{x} P\_X(x) \sum\_{y:d(y) = x} P\_{Y|X}(y|x) = 1,\tag{54}$$

we obtain *Iα*(*X*,*Y*) = *Rα*(*X*), which proves the property (*A*3). Moreover, from the definition as shown in [38], Arimoto's conditional entropy is positive and satisfies the weak chain rule *Rα*(*X*|*Y*) ≥ *Rα*(*X*) − log *m*, so that the properties (*A*4) and (*A*5) follow from the definition of Arimoto's mutual information (42). Finally, the property (*A*6) follows directly from the equation (45) and can be approved using L'Hôpital's rule, which completes the proof of the theorem.

#### **5. Alternative Definitions of the** *α***-Mutual Information and the** *α***-Channel Capacity**

Since Rényi's proposal, there have been several lines of research to find an appropriate definition and characterization of information transfer measures related to Rényi entropy, which are established by the substitution of the Rényi divergence measure

$$D\_{\mathfrak{a}}(P||Q) = \frac{1}{\mathfrak{a} - 1} \log \left( \sum\_{\mathfrak{x}} P(\mathfrak{x})^{\mathfrak{a}} Q(\mathfrak{x})^{1 - \mathfrak{a}} \right),\tag{55}$$

instead of the Kullback–Leibler one,

$$D(P||Q) = D\_1(P||Q) = \sum\_{\mathbf{x}} P(\mathbf{x}) \log \frac{P(\mathbf{x})}{Q(\mathbf{x})},\tag{56}$$

in some of the various definitions which are equivalent in the case of Shannon information measures (46) [29]:

$$\begin{split} I(X,Y) &= \min\_{Q\_Y} \mathbb{E}\left[D\_{\mathbb{R}}\left(P\_{Y|X} \| Q\_Y\right)\right] = \min\_{Q\_Y} \mathbb{E}\left[D\_{\mathbb{R}}\left(P\_{Y|X} \| Q\_Y\right)\right] \\ &= \min\_{Q\_X} \min\_{Q\_Y} D\_{\mathbb{R}}\left(P\_{X,Y} \| Q\_X Q\_Y\right) = D\_{\mathbb{R}}\left(P\_{X,Y} \| P\_X P\_Y\right) = S(X) - S(X|Y) \end{split} \tag{57}$$

where *S*(*X*|*Y*) stands for the Shannon conditional entropy,

$$S(X|Y) = \sum\_{\mathbf{x}, \mathbf{y}} P\_{X,Y}(\mathbf{x}, \mathbf{y}) \log P\_{X|Y}(\mathbf{x}|\mathbf{y}).\tag{58}$$

All of these measures are consistent with the Shannon case in view of the property (*A*6), but their direct usage as measures of Rényi information transfer leads to a breaking of some the properties (*A*1)–(*A*5), which justifies the usage of Arimoto's measures from the previous section as appropriate ones in the context of this research. In the following section, we review the alternative definitions.

#### *5.1. Information Transfer Measures by Sibson*

Alternative approaches based on Rényi divergence were proposed by Sibson [23] and considered later by several authors in the context of quantum secure communications [39–44], who introduced

$$J\_a^1(X;Y) = \min\_{Q\_Y} D\_\mathbb{A} \left( P\_{Y|X} P\_X || Q\_Y P\_X \right)\_{\prime} \tag{59}$$

which can be represented as in [26]

$$J\_a^1(X,Y) = \frac{\alpha}{\alpha - 1} \log \left( \sum\_{\mathcal{Y}} \left( \sum\_{\mathbf{x}} P\_{\mathcal{X}}(\mathbf{x}) P\_{Y|X}^{\mathbf{a}}(\mathbf{y}|\mathbf{x}) \right)^{\frac{1}{\alpha}} \right) \tag{60}$$

and, in the discrete setting, can be related to the Gallager exponent as in [29]:

$$J\_{\alpha}^{1}(X,Y) = \frac{\alpha}{1-\alpha} E\_{0}\left(\frac{1}{\alpha} - 1, P\_{X}\right),\tag{61}$$

which differs from Arimoto's definition (61) since in this case the escort distribution does not participate in the error exponent, but an ordinary one does. However, in the case of a perfect channel for which *X* = *Y*, the conditional distribution *P<sup>α</sup> <sup>Y</sup>*|*X*(*y*|*x*) = 1 for *<sup>x</sup>* <sup>=</sup> *<sup>y</sup>* and zero otherwise, so Sibson's measure (60) reduces to *R*1/*α*(*X*), thus breaking the axiom (*A*3). This disadvantage can be overcome by the reparametrization *<sup>α</sup>* ↔ 1/*<sup>α</sup>* so that *<sup>J</sup>*<sup>1</sup> 1/*α*(*X*,*Y*) is used as a measure of Rényi information transfer, and the properties of the resulting measure can be considered in a manner similar to the case of Arimoto.

#### *5.2. Information Transfer Measures by Augustin and Csiszar*

An alternative definition of Rényi mutual information was also presented by Augustin [25], and later Csiszar [26], who defined

$$f\_a^2(X;Y) = \min\_{Q\_Y} \mathbb{E}\left[D\_a\left(P\_{Y|X} \| Q\_Y\right)\right],\tag{62}$$

However, in the case of perfect transmission, for which *X* = *Y*, the measure reduces to Shannon entropy

$$\mathbf{J}\_a^2(\mathbf{X}; \mathbf{Y}) = \mathbf{S}(\mathbf{X}),\tag{63}$$

which breaks the axiom (*A*3).

#### *5.3. Information Transfer Measures by Lapidoth, Pfister, Tomamichel and Hayashi*

A similar obstacle to the case of the Augustin–Csiszar measure can be observed in the case of mutual information which was considered by Lapidoth and Pfister [27] and Tomamichel and Hayashi [28], who proposed

$$J\_a^3(X;Y) = \min\_{Q\_X} \min\_{Q\_Y} D\_a(P\_{X,Y} \| Q\_X Q\_Y). \tag{64}$$

As shown in [27] (Lemma 11), if *X* = *Y*, then

$$f\_a^3(X;Y) = \begin{cases} \frac{a}{1-a} \lim\_{a \to \infty} R\_a(X) & \text{if } a \in \left[0, \frac{1}{2}\right],\\ R\_{\frac{a}{2a-1}}(X) & \text{if } a > \frac{1}{2} \end{cases} \tag{65}$$

so the axiom (*A*3) is broken in this case, as well.

**Remark 1.** *Despite the difference between the definitions of information transfer, in the discrete setting, the alternative definitions discussed above reach the same maximum over the set of input probability distributions, PX, [26,29,45].*

*5.4. Information Transfer Measures by Chapeau-Blondeau, Delahaies, Rousseau, Tridenski, Zamir, Ingber and Harremoes*

Chapeau-Blondeau, Delahaies and Rousseau [31], and independently Tridenski, Zamir and Ingber [46] and Harremoes [47], defined the Rényi mutual information using the Rényi divergence (55), so that the mutual information defined using the Rényi divergence

$$J\_a^4(X,Y) = D\_a(P\_{X,Y} \| \| P\_X P\_Y) \tag{66}$$

for *α* > 0 and *α* = 1, while in the case of *α* = 1 it reduces to Shannon mutual information. However, the ordinal definition can correspond only to a Rényi entropy of order 2 − *α* since in the case of *X* = *Y* it reduces to *J*<sup>4</sup> *<sup>α</sup>*(*X*,*Y*) = *R*2−*α*(*X*) (see also [47]), which can be overcome by the reparametrization *α* = 2 − *q*, similar to the case of Sibson's measure. This measure has been discussed in the past with various operational characterizations, and could also be considered as a measure of information transfer, although the satisfaction of all of the axioms (*A*1)–(*A*6) is not self-evident for general channels.

#### *5.5. Information Transfer Measures by Jizba, Kleinert and Shefaat*

Finally, we will mention the definition by Jizba, Kleinert and Shefaat [48],

$$J\_a^4(X,Y) \equiv \mathcal{R}\_a(X) - \mathcal{R}\_a(X|Y),\tag{67}$$

which is defined in the same manner as in Arimoto's case (42), but with another choice of conditional Rényi entropy

$$\hat{R}\_a(X|Y) = \frac{1}{1-a} \log \sum\_{\mathbf{x}} P\_X^{(a)}(\mathbf{x}) 2^{(1-a)R\_a(X|Y=y)} \, \, \, \, \tag{68}$$

which arises from the Generalized Shannon–Khinchin axiom [GSK4] if the pseudo-additivity in the equation (9) is restricted to an ordinary addition, in which case the GSK axioms uniquely determine Rényi entropy [49]. However, despite its wide applicability in the modeling of causality and financial time series, this mutual information can take negative values which breaks the axiom (*A*1), which is assumed to be mandatory in this paper. For further discussion of the physicalism of negative mutual information in the domain of financial time series analysis, the reader is referred to [48].

#### **6. The** *α***-***q* **Mutual Information and the** *α***-***q***-Capacity**

In the past several attempts have been done to define an appropriate channel capacity measure which corresponds to instances of the Sharma–Mittal entropy class. All of them follow a similar recipe by which the channel capacity is defined as in (32), as a maximum of appropriately defined mutual information *Iα*,*q*. However, all of the classes consider only special cases of Sharma–Mittal entropy and all of them fail to satisfy at least one of the properties (*A*1)–(*A*6) which an information transfer has to satisfy, as will be discussed Section 7.

In this section we propose a general measures of the *α*-*q* mutual information and the *α*-*q*-capacity by the requirement that the axioms (*A*1)–(*A*6) are satisfied, which could qualify them as appropriate measures of information transfer, without nonphysical properties. The special instances of the *α*-*q* (maximal) information transfer measures are also discussed and the analytic expressions for a binary symmetric channel are provided.

#### *6.1. The α-q Information Transfer Measures and Its Instances*

The *α*-*q*-mutual information (42) is defined using the *q*-subtraction defined in (6), as follows:

$$H\_{a,q}(X,Y) = H\_{a,q}(X) \ominus\_q H\_{a,q}(X|Y),\tag{69}$$

where we introduced the conditional Sharma–Mittal entropy *Hα*,*q*(*Y*|*X*) as in

$$H\_{\mathfrak{a},\mathfrak{q}}(X|Y) = \eta\_{\mathfrak{q}}(R\_{\mathfrak{a}}(X|Y)) = \frac{1}{(1-q)\ln 2} \left( \left( \sum\_{y} P\_Y(y) \left( \sum\_{x} P\_{X|Y=y}(x)^{\mathfrak{a}} \right)^{\frac{1}{\mathfrak{a}}} \right)^{\frac{\mathfrak{a}(1-\mathfrak{q})}{\mathfrak{a}-\mathfrak{q}}} - 1 \right), \tag{70}$$

*Rα*(*X*|*Y*) stands for Arimoto's definition of the conditional Rényi entropy (41). The expression (69) can also be obtained if the mapping *η<sup>q</sup>* is applied to both sides of the equality (42), by which Arimoto's mutual information is defined, so we may establish the relationship

$$I\_{a,q}(X,Y) = \eta\_q(I\_a(X,Y)) = \eta\_q\left(\frac{a}{1-a}\log\left(\sum\_{y}\left(\sum\_{x}P\_X^{(a)}(x)P\_{Y|X}^a(y|x)\right)^{\frac{1}{a}}\right)\right),\tag{71}$$

which can be represented using the Gallager error exponent (43) as in

$$I\_{\mathfrak{a},\mathfrak{q}}(X,Y) = \eta\_{\mathfrak{q}}\left(\frac{\mathfrak{a}}{1-\mathfrak{a}}E\_0\left(\frac{1}{\mathfrak{a}}-1, P\_X^{(a)}\right)\right) = \frac{1}{(1-\mathfrak{q})\ln 2} \left(2^{\frac{a(1-\mathfrak{q})}{1-\mathfrak{a}}E\_0\left(\frac{1}{\mathfrak{a}}-1, P\_X^{(a)}\right)} - 1\right). \tag{72}$$

Arimoto's *α*-*q*-capacity is now defined in

$$\mathbb{C}\_{\alpha,\eta} = \max\_{P\_{\mathcal{X}}} I\_{\alpha,\eta}(\mathcal{X}, \mathcal{Y}), \tag{73}$$

and using the fact that *η<sup>q</sup>* is increasing, it can be related with the corresponding *α*-capacity as in

$$\mathbb{C}\_{\mathsf{a},\mathsf{q}} = \max\_{\mathcal{P}\_{\mathcal{X}}} I\_{\mathsf{a},\mathsf{q}}(X,\mathcal{Y}) = \max\_{\mathcal{P}\_{\mathcal{X}}} \eta\_{\mathsf{q}}(I\_{\mathsf{a}}(X,\mathcal{Y})) = \eta\_{\mathsf{q}}\left(\max\_{\mathcal{P}\_{\mathcal{X}}} I\_{\mathsf{a}}(X,\mathcal{Y})\right) = \eta\_{\mathsf{q}}\left(\mathbb{C}\_{\mathsf{a}}(P\_{\mathcal{Y}|X})\right). \tag{74}$$

Using the expressions (45) and (71), in the case of *α* = 1, the *α*-*q* mutual information reduces to

$$\begin{split} I\_{1,q} &= \frac{1}{(1-q)\ln 2} \left( \prod\_{x,y} 2^{P\_{\mathcal{X},Y}(x,y)} \log \frac{P\_{\mathcal{X},Y}(x,y)}{P\_{\mathcal{X}}(x)P\_Y(y)} - 1 \right) \\ &= \frac{1}{(1-q)\ln 2} \left( \prod\_{x,y} \left( \frac{P\_{\mathcal{X},Y}(x,y)}{P\_{\mathcal{X}}(x)P\_Y(y)} \right)^{P\_{\mathcal{X},Y}(x,y)} - 1 \right). \end{split} \tag{75}$$

The *α*-*q*-capacity is given in

$$\mathcal{C}\_{1,q} = \max\_{P\_{\mathcal{X}}} \left( \frac{1}{(1-q)\ln 2} \left( \prod\_{x,y} \left( \frac{P\_{\mathcal{X},Y}(x,y)}{P\_{\mathcal{X}}(x)P\_{Y}(y)} \right)^{P\_{\mathcal{X},Y}(x,y)} - 1 \right) \right) \tag{76}$$

and these measures can serve as (maximal) information transfer measures corresponding to Gaussian entropy, which was not considered before in the context of information transmission. Naturally, if in addition *q* → 1, the measures reduce to Shannon's mutual information and Shannon capacity [37].

Additional special cases of the *α*-*q* (maximal) information transfer include the *α*-mutual information (42) and the *α*-capacity (40), which are obtained for *q* = 1; the measures which correspond to Tsallis entropy can be obtained for *q* = *α* and the ones which correspond to Landsberg–Vedral entropy for *q* = 2 − *α*. These special instances are listed in Table 1.

As discussed in Section 7, previously considered information measures cover only particular special cases and break at least one of the axioms (*A*1)–(*A*5), which leads to unexpected and counterintuitive conclusions about the channels, such as negative information transfer and achieving super-capacitance or sub-capacitance [4], which could be treated as a nonphysical behavior. On the other hand, apart from the generality, the *α*-*q* information transfer measures proposed in this paper overcame the disadvantages which could qualify them as appropriate measures, as stated in the following theorem.

**Theorem 2.** *The α-q information transfer measures Iα*,*<sup>q</sup> and Cα*,*<sup>q</sup> satisfy the set of the axioms (A*1*)–(A*6*).*

**Proof.** The proof is the straightforward application of the mapping *η<sup>q</sup>* to the equations in the *α*-mutual information properties (*A*1)–(*A*5), while the (*A*6) follows from the above discussion.

**Remark 2.** *Note that the symmetry Iα*,*q*(*X*,*Y*) = *Iα*,*q*(*Y*, *X*) *does not hold in general in the case of the α-q mutual information nor in the case of the α mutual information [50,51] and if the mutual information is defined so that the symmetry is preserved, some of the axioms (A*1*)–(A*6*) might be broken. In addition, the alternative definition of the mutual information, Iα*,*q*(*Y*, *X*) = *Hα*,*q*(*Y*) − *Hα*,*q*(*Y*|*X*)*, which uses an ordinary substraction operator instead of <sup>q</sup> operation, can also be introduced, but in this case the property (A*5*) might not hold in general, as discussed in Section 7.*

#### *6.2. The α-q-Capacity of Binary Symmetric Channels*

As shown by Cai and Verdú [45], the *α*-mutual information of Arimoto's type *I<sup>α</sup>* is maximized for the uniform distribution *PX* = (1/2, 1/2), and Arimoto's *α*-capacity has the value

$$\mathbb{C}\_{\mathfrak{a}}(B\mathcal{SC}) = 1 - r\_{\mathfrak{a}}(p),\tag{77}$$

where the binary entropy function *rα* is defined as

$$r\_a(p) = R\_a(p, 1-p) = \frac{1}{1-a} \log(p^a + (1-p)^a),\tag{78}$$

for *α* > 0, *α* = 1, while in the limit of *α* → 1, the expression (78) reduces to the well-known result for the Shannon capacity (see Fano [52])

$$\mathcal{C}\_1(B\mathcal{SC}) = \lim\_{a \to 1} \mathcal{C}\_a(B\mathcal{SC}) = 1 + p \log p + (1 - p) \log(1 - p). \tag{79}$$

The analytic expressions for the *α*-*q*-capacities of binary symmetric channel's can be obtained from the expressions (74) and (77), so that

$$\mathbb{C}\_{a,q}(BSC) = \eta\_q(\mathbb{C}\_a(BSC)) = \frac{1}{(1-q)\ln 2} \left( 2^{1-q} (p^a + (1-p)^a)^{-\frac{1-q}{1-q}} - 1 \right);\tag{80}$$

in the case of *q* = 1, it reduces to the case of Rényi entropy while, in the case of *α* = 1, to the case of Gaussian entropy (77)

$$C\_{1,q}(BSC) = \frac{1}{(1-q)\ln 2} \left(2p^p(1-p)^{1-p} - 1\right). \tag{81}$$

The analytic expressions for BSC *α*-*q* capacities for other instances can straightforwardly be obtained by specifying the values of the parameters, whose instances are listed in Table 1, while the plots of the BSC *α*-*q*-capacities, which correspond to the Gaussian and the Tsallis entropies, are shown in Figures 3 and 4.

The *α*-*q*-capacity (80) can equivalently be expressed in

$$\mathsf{C}\_{\mathfrak{a},\mathfrak{q}}(BSC) = \mathsf{Log}\_{\mathfrak{q}} \mathcal{Z} \ominus\_{\mathfrak{q}} \mathsf{h}\_{\mathfrak{a},\mathfrak{q}}(p),\tag{82}$$

where the Sharma–Mittal binary entropy function is defined in

$$h\_{a,q}(p) = H\_{a,q}(p, 1-p) = \frac{1}{1-q} \left( (p^a + (1-p)^a)^{\frac{1-q}{1-a}} - 1 \right),\tag{83}$$

which reduces to the Rényi binary entropy function, in the case of *q* = 1,

$$h\_{a,1}(p) = \lim\_{q \to 1} h\_{a,q}(p) = R\_a(p, 1 - p) = \frac{1}{1 - a} \log(p^a + (1 - p)^a),\tag{84}$$

to the Tsallis binary entropy function, in the case of *α* = 1,

$$h\_{\mathbb{q}\mathcal{A}}(p) = h\_{\mathbb{q}\mathcal{A}}(p) = T\_{\mathbb{q}}(p, 1 - p) = \frac{1}{1 - q} (p^{\eta} + (1 - p)^{\eta} - 1),\tag{85}$$

to the Gaussian binary entropy function, in the case of *α* = 1,

$$h\_{1, \emptyset}(p) = \lim\_{a \to 1} h\_{a, \emptyset}(p) = G\_{\emptyset}(p, 1 - p) = \frac{1}{(1 - q) \ln 2} \left( p^{-(1 - q)p} (1 - p)^{-(1 - q)(1 - p)} - 1 \right),\tag{86}$$

and to the Shannon binary entropy function, in the case of *α* = *q* = 1,

$$h\_{1,1}(p) = \lim\_{q, a \to 1} h\_{4,q}(p) = \mathcal{S}(p, 1-p) = -p \log p - (1-p) \log(1-p). \tag{87}$$

The expression (82) can be interpreted similarly as in the Shannon case. Thus, a BSC channel with input *X* and output *Y* can be modeled with an input–output relation *Y* = *X* ⊕ *Z* where ⊕ stands for modulo 2 sum and *Z* is channel noise taking values from {1, 0}, distributed in accordance with (*p*, 1 − *p*). If we measure the information which is lost per bit during transmission with the Sharma–Mittal entropy *Hα*,*q*(*Z*) = *hα*(*p*), then *Cα*,*q* stands for useful information left over for every bit of information received.


**Table 1.** Instances of the *α*-*q*-mutual information for different values of the parameters and corresponding expressions for the BSC *α*-*q*-capacities.

**Figure 3.** The *α*-*q*-capacity of BSC for the Gaussian entropy (the case of *α* = 1) as a function of *q* for various values of the channel parameter *p* from 0.5 (totally destructive channel) to 0 (perfect transmission). All of the curves lies between 0 and Log*q*2, which is the maximum value of the Gaussian entropy.

**Figure 4.** The *α*-*q*-capacity of BSC for the Tsallis entropy (the case of *α* = *q*) as a function of *q* for various values of the channel parameter *p* from 0.5 (totally destructive channel) to 0 (perfect transmission). All of the curves lies between 0 and Log*q*2, which is the maximum value of the Tsallis entropy.

#### **7. An Overview of the Previous Approaches to Sharma–Mittal Information Transfer Measures**

In this section, we review the previous attempts at a definition of Sharma–Mittal information transfer measures, which are defined from the basic requirement of consistency with the Shannon measure as given by the axiom (*A*6) . However, as we show in the following paragraphs, all of them break at least one of the axioms (*A*1)–(*A*5) , which are satisfied in the case of the *α*-*q* (maximal) information transfer measures (69) and (73), in accordance with the discussion in Section 6.

#### *7.1. Daróczy's Capacity*

The first considerations of generalized channel capacities and generalized mutual information for the *q*-entropy go back to Daróczy [30], who introduced conditional Tsallis entropy

$$
\bar{T}\_{\emptyset}(Y|X) = \sum\_{\mathbf{x}} P\_X^{\emptyset}(\mathbf{x}) T\_{\emptyset}(Y|X=\mathbf{x}),
\tag{88}
$$

where the row entropies are defined as in

$$T\_q(\mathbf{Y}|\mathbf{X}=\mathbf{x}) = \frac{1}{(1-q)\log(2)} \left(\sum\_{\mathbf{x}} P\_{\mathbf{Y}|\mathbf{X}}(y|\mathbf{x})^q - 1\right) \tag{89}$$

and the mutual information is defined as in

$$J^5\_{\varkappa,\eta}(X,Y) = T\_{\bar{\eta}}(Y) - \bar{T}\_{\bar{\eta}}(Y|X). \tag{90}$$

However, in the case of a totally destructive channel, *<sup>X</sup>* ⊥⊥ *<sup>Y</sup>*, *PY*|*X*(*y*|*x*) = *PY*(*y*), *Tq*(*Y*|*X* = *x*) = *Tq*(*Y*) and

$$T\_{\vec{q}}(\boldsymbol{Y}|\boldsymbol{X}) = T\_{\vec{q}}(\boldsymbol{Y}) \sum\_{\mathbf{x}} P\_{\boldsymbol{X}}(\boldsymbol{x})^{\boldsymbol{q}} \tag{91}$$

so that

$$f\_{\mathbf{a},\boldsymbol{q}}^{\mathbb{S}}(\mathbf{X},\boldsymbol{Y}) = T\_{\boldsymbol{q}}(\boldsymbol{Y}) \left(\mathbf{1} - \sum\_{\mathbf{x}} P\_{\mathbf{X}}(\mathbf{x})^{\boldsymbol{q}}\right) = \left(\mathbf{1} - \sum\_{\mathbf{x}} P\_{\mathbf{X}}(\mathbf{x})^{\boldsymbol{q}}\right) \text{Log}\_{\boldsymbol{\eta}} \boldsymbol{m}.\tag{92}$$

This expression is zero for an input probability distribution *PX* = (1, 0, ... , 0) and its permutations, but, in general, it is negative for *q* < 1, positive for *q* > 1 and 0 only for *q* = 1, so the axiom (*A*2) is broken (see Figure 5). As a result, the channel capacity, which is defined in accordance to (32), is zero for *q* ≤ 1 and positive for *q* > 1, as illustrated in Figure 6 by the example of BSC for which the Daroczy's channel capacity can be computed as in [30,53]

$$\mathbb{C}\_q^5(B\mathbb{C}) = \frac{1 - 2^{1-q}}{q - 1} - \frac{2^{-q}}{q - 1} [1 - (1 - p)^q - p^q]. \tag{93}$$

In the same figure, we plotted the graph for the *α*-*q* channel capacities proposed in this paper, and all of them remain zero in the case of a totally destructive BSC, as expected.

#### *7.2. Yamano Capacities*

Similar problems to the ones mentioned above arise in the case of mutual information and corresponding capacity measures considered by Yamano [33], who addressed the information transmission characterized by Landsberg–Vedral entropy *Lq*, given in (17).

Thus, the first proposal is based on the mutual information of the form

$$J\_q^\delta(X,Y) = L\_\emptyset(X) + L\_\emptyset(Y) - L\_\emptyset(X,Y),\tag{94}$$

where the joint entropy is defined in

$$L\_{\mathbb{P}}(X,Y) = \frac{1}{q-1} \left( \frac{1}{\sum\_{\mathbf{x},\mathbf{y}} P\_{\mathbf{X},Y}(\mathbf{x},\mathbf{y})^q} - 1 \right). \tag{95}$$

However, in the case of a fully destructive channel, *PY*(*y*) = 1/*m* and *PX*,*Y*(*x*, *y*) = *PX*(*x*)/*m*, so that

$$J\_q^{\delta}(X,Y) = \frac{1}{q-1} \left( \frac{1}{\sum\_{\mathbf{x}} P\_X(\mathbf{x})^q} - 1 \right) + \frac{1}{q-1} \left( m^{q-1} - 1 \right) - \frac{1}{q-1} \left( m^{q-1} \frac{1}{\sum\_{\mathbf{x}} P\_X(\mathbf{x})^q} - 1 \right), \tag{96}$$

which can be simplified to

$$f\_q^\delta(\mathbf{X}, \mathbf{Y}) = \frac{1 - m^{q-1}}{q - 1} \left( \frac{1}{\sum\_{\mathbf{x}} P\_\mathbf{X}(\mathbf{x})^q} - 1 \right). \tag{97}$$

Similarly to the case of Daroczy's capacity, this expression is zero for an input probability distribution *PX* = (1, 0, ... , 0) and its permutations but, in general, it is negative for *q* > 1, positive for *q* < 1 and 0 only for *q* = 1, so the axiom (*A*2) is broken (see Figure 5). In Figure 6 we illustrated the Yamano channel capacity as a function of the parameter *q*, in the case of two input channels with *PX* = [*a*, 1 − *a*], the channel capacity is zero for *q* > 1 (which is obtained for *PX* = [1, 0]), and

$$C\_q^\delta(BSC) = \frac{1}{q-1} \left( 2^q - 1 - 2^{2q-2} \right),\tag{98}$$

for *q* > 1 (which is obtained for *PX* = [1/2, 1/2]). In the same Figure, we plotted the graph for the *α*-*q* channel capacities proposed in this paper, and, as before, all of them remain zero in the case of a totally destructive BSC, as expected.

Further attempts were made in [33], where the mutual information is defined in an analogous manner to (66) and (66), with the generalized divergence measure introduced in [54]. Thus, the alternative measure for mutual information is defined in

$$\boldsymbol{f}\_{q}^{T}(\mathbf{X},\boldsymbol{Y}) = \frac{1}{(1-q)\ln 2} \frac{1}{\sum\_{\mathbf{x},\mathbf{y}} P\_{\mathbf{X},\mathbf{Y}}^{\boldsymbol{\theta}}(\mathbf{x},\mathbf{y})} \left[1 - \sum\_{\mathbf{x},\mathbf{y}} P\_{\mathbf{X},\mathbf{Y}}(\mathbf{x},\mathbf{y}) \left(\frac{P\_{\mathbf{X}}(\mathbf{x})P\_{\mathbf{Y}}(\mathbf{y})}{P\_{\mathbf{X},\mathbf{Y}}(\mathbf{x},\mathbf{y})}\right)^{1-q}\right].\tag{99}$$

However, in the case of the simplest perfect communication channel for which *X* = *Y*, the mutual information reduces to

$$J\_q^{\mathcal{T}}(X,Y) = \frac{1}{(1-q)\ln 2} \frac{1 - \sum\_{\mathbf{x}} P\_{\mathcal{X}}(\mathbf{x})^{2-q}}{\sum\_{\mathbf{x}} P\_{\mathcal{X}}(\mathbf{x})^q} \neq L\_{\mathbb{Q}}(X),\tag{100}$$

which breaks the axiom (*A*3).

**Figure 5.** Daróczy's (solid lines) and Yamano's (dashed lines) mutual information in the case of a totally destructive BSC as functions of the input distribution parameter *a*, *PX* = [*a*, 1 − *a*] *<sup>T</sup>* for different values of *q*, obtaining negative values for *q* < 1 and *q* > 1, respectively, breaking the axioms (*A*1) and (*A*2). The *α*-*q*-mutual information is zero; for all *q*, and satisfies (*A*1) and (*A*2).

**Figure 6.** Daróczy's (solid lines) and Yamano's (dashed lines) capacities in the case of totally destructive BSC as functions of the parameter *q*. In the regions of *q* < 1 and *q* > 1, respectively, the corresponding negative mutual information is maximized for *PX* = [1, 0] *<sup>T</sup>* (zero capacity) having the positive values outside the regions and breaking the axiom (*A*2). The *α*-*q*-capacity is zero; for all *q*, and satisfies (*A*2).

#### *7.3. Landsber–Vedral capacities*

To avoid these problems, Landsberg and Vedral [4] proposed the mutual information measure and related channel capacities for the Sharma–Mittal entropy class *Hα*,*q*, particularly considering the choice of *q* = *α*, which corresponds to Tsallis entropy, *q* = 2 − *α*, and the case of *q* = 1, which corresponds to the Rényi entropy

$$J\_{a,q}^{8}(X,Y) = H\_{a,q}(Y) - H\_{a,q}^{7}(Y|X),\tag{101}$$

where the conditional entropy *H*˜ *α*,*q LV*(*Y*|*X*) is defined as in

$$H\_{\boldsymbol{a},\boldsymbol{\eta}}(\boldsymbol{Y}|\boldsymbol{X}) = \sum\_{\mathbf{x}} P\_{\boldsymbol{X}}(\boldsymbol{\mathbf{x}}) H\_{\boldsymbol{a},\boldsymbol{\eta}}(\boldsymbol{Y}|\boldsymbol{X} = \boldsymbol{x}) \tag{102}$$

and

$$H\_{a,q}(\mathbf{Y}|\mathbf{X}=\mathbf{x}) = \frac{1}{1-q} \left( \left( \sum\_{y} P\_{Y|X}(y|\mathbf{x})^{a} \right)^{\frac{1-q}{1-a}} - 1 \right). \tag{103}$$

Although this definition bears some similarities to the *α*-*q* mutual information proposed in formula (69), several key differences can be observed. First of all, it characterizes the information transfer as the output uncertainty reduction after the input symbols are known, instead of input uncertainty reduction, after the output symbols are known (42). In addition, it uses the ordinary—operation instead of the *<sup>q</sup>* one. In addition, note that the definition of conditional entropy (102) generally differs from the definition proposed in (70).

The definition (101) resolves the issue of the axiom (*A*2) which appears in the case of the Daroczy capacity, since in the case of a totally destructive channel (*<sup>X</sup>* ⊥⊥ *<sup>Y</sup>*), *PY*|*X*(*y*|*x*) = *PY*(*y*) and *Lq*(*Y*|*<sup>X</sup>* = *<sup>x</sup>*) = *Lq*(*Y*) and *Lq*(*Y*|*X*) = *Lq*(*Y*), so that *<sup>I</sup>lv <sup>α</sup>*,*q*(*X*,*Y*) = 0. However, the problems remain with the axiom (*A*5), which can be observed in the case of a noisy channel with non-overlapping outputs if the number of channel inputs is lower than the number of channel outputs *n* < *m*. Indeed, in the case of a noisy channel with non-overlapping outputs given by the transition matrix (27), both of the row entropies *Lq*(*Y*|*X* = *x*) have the same value, which is independent of *x*

$$H\_{\mathbf{a},q}(\mathbf{Y}|X=\mathbf{x}) = \frac{k^{1-q}-1}{(q-1)\ln 2} = \text{Log}\_q k; \quad \text{for} \quad \mathbf{x} = \mathbf{x}\_1, \mathbf{x}\_2. \tag{104}$$

and the maximal value of Landsberg–Vedral mutual information (101) is obtained only by maximizing *Hα*,*q*(*Y*) over *PX*, which is achieved if *X* is uniformly distributed, since in this case *Y* is uniformly distributed, as well as (*a* = <sup>1</sup> <sup>2</sup> in (28)), so the maximal value of the output entropy is *Hα*,*q*(*Y*) = Log*q*(2*k*) and the mutual information is maximized for

$$\mathsf{C}\_{a,q}^{8}(\mathsf{NOC}) = \mathsf{Log}\_{q}(2k) - \mathsf{Log}\_{q}(k),\tag{105}$$

which is greater than Log*q*(2) for *k* ≥ 2, i.e., for *m* ≥ 4 outputs, so the axiom (*A*5) is broken, which is illustrated in Figure 7.

#### *7.4. Chapeau-Blondeau–Delahaies–Rousseau Capacities*

Following a similar approach to the one in Section 5.4, Chapeau-Blondeau, Delahaies and Rousseau considered the definition of mutual information which corresponds to the Tsallis entropy using Tsallis divergence,

$$D\_{q,q}(P||Q) = \frac{1}{q-1} \left( \sum\_{\mathbf{x}} P(\mathbf{x})^q Q(\mathbf{x})^{1-q} - 1 \right),\tag{106}$$

can be written in

$$\begin{split} \|f\_q^{\Theta}(\mathbf{X}, \mathbf{Y}) = D\_{\eta, \mathbf{q}}(P\_{\mathbf{X}, \mathbf{Y}} \| P\_{\mathbf{X}} P\_{\mathbf{Y}}) &= \eta\_{\mathbf{q}} \Big( D\_{\mathbf{q}}(P\_{\mathbf{X}, \mathbf{Y}} \| P\_{\mathbf{X}} P\_{\mathbf{Y}}) \Big) \\ &= \frac{1}{1 - q} \Big( 1 - \sum\_{\mathbf{x}, \mathbf{y}} P\_{\mathbf{X}, \mathbf{Y}}(\mathbf{x}, \mathbf{y})^{q} P\_{\mathbf{X}}(\mathbf{x})^{1 - q} P\_{\mathbf{Y}}(\mathbf{y})^{1 - q} \Big). \end{split} \tag{107}$$

However, this definition is not directly applicable as a measure of information transfer to the Tsallis entropy with index *q*, since in the case of *X* = *Y* it reduces to *J*<sup>9</sup> *<sup>q</sup>* (*X*,*Y*) = *T*2−*q*(*X*), and requires the reparametrization *q* ↔ 2 − *q*, similar to Section 5.4, while the satisfaction of the axioms (*A*4) and (*A*5) is not self evident.

**Figure 7.** Landsberg–Vedral capacities for the Tsallis (solid lines) and the Landsberg–Vedral (dashed lines) entropies in the case of a (perfect) noisy channel with non-overlapping outputs with *m* outputs as functions of *q*, for different values of *m*. The axiom (*A*4) is broken for all *m* > 2 and satisfied in the case of corresponding *α*-*q*-capacities, *Cq*,*<sup>q</sup>* and *Cq*,2−*q*.

#### **8. Conclusions and Future Work**

A general treatment of the Sharma–Mittal entropy transfer was provided together with the analyses of existing information transfer measures for the non-additive Sharma–Mittal information transfer. It was shown that the existing definitions fail to satisfy at least one of the axioms common to the Shannon case, by which the information transfer has to be non-negative, less than the input and output uncertainty, equal to the input uncertainty in the case of perfect transmission and equal to zero in the case of a totally destructive channel. Thus, breaking some of these axioms implies unexpected and counterintuitive conclusions about the channels, such as achieving super-capacitance or sub-capacitance [4], which could be treated as nonphysical behavior. In this paper, alternative measures of the *α*-*q* mutual information and the *α*-*q* channel capacity were proposed so that all of the axioms which are broken in the case of the Sharma–Mittal information transfer measures considered before are satisfied, which could qualify them as physically consistent measures of information transfer.

Taking into account the previous research of non-extensive statistical mechanics [3], where the linear growth of the physical quantities has been recognized as a critical property in non-extensive [55] and non-exponentially growing systems [56], and taking into account the previous research from the field of information theory, where the Sharma–Mittal entropy has been considered an appropriate scaling measure which provides extensive information rates [21], the *α*-*q* mutual information and the *α*-*q* channel capacity seem to be

promising measures for the characterization of information transmission in the systems where the Shannon entropy rate diverges or disappears in an infinite time limit. In addition, as was shown in this paper, the proposed information transfer measures are compatible with the maximum likelihood detection, which indicates their potential for operational characterization of coding theory and hypothesis testing problems [26].

**Author Contributions:** Conceptualization, V.M.I. and I.B.D.; validation, V.M.I. and I.B.D.; formal analysis, V.M.I.; funding acquisition, I.B.D.; project administration, I.B.D.; writing—original draft preparation, V.M.I.; writing—review and editing, V.M.I. and I.B.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported in part by NSF under grants 1907918 and 1828132 and by Ministry of Science and Technological Development, Republic of Serbia, Grants Nos. ON 174026 and III 044006. The APC was funded by NSF under grants 1907918 and 1828132.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.
