**Information Theory and Machine Learning**

Editors

**Lizhong Zheng Chao Tian**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editors* Lizhong Zheng Massachusetts Institute of Technology USA

Chao Tian Texas A&M University USA

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Entropy* (ISSN 1099-4300) (available at: https://www.mdpi.com/journal/entropy/special issues/inf learn).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-5307-8 (Hbk) ISBN 978-3-0365-5308-5 (PDF)**

© 2022 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


## **Preface to "Information Theory and Machine Learning"**

The recent successes of machine learning, especially regarding systems based on deep neural networks, have encouraged further research activities and raised a new set of challenges in understanding and designing complex machine learning algorithms. New applications require learning algorithms to be distributed, have transferable learning results, use computation resources efficiently, convergence quickly on online settings, have performance guarantees, satisfy fairness or privacy constraints, incorporate domain knowledge on model structures, etc. A new wave of developments in statistical learning theory and information theory has set out to address these challenges. This Special Issue, "Machine Learning and Information Theory", aims to collect recent results in this direction reflecting a diverse spectrum of visions and efforts to extend conventional theories and develop analysis tools for these complex machine learning systems. We would like to thank all contributing authors and the MDPI team for their support.

> **Lizhong Zheng and Chao Tian** *Editors*

## *Article* **Improved Information-Theoretic Generalization Bounds for Distributed, Federated, and Iterative Learning †**

**Leighton Pate Barnes 1,\*, Alex Dytso <sup>2</sup> and Harold Vincent Poor <sup>1</sup>**


**Abstract:** We consider information-theoretic bounds on the expected generalization error for statistical learning problems in a network setting. In this setting, there are *K* nodes, each with its own independent dataset, and the models from the *K* nodes have to be aggregated into a final centralized model. We consider both simple averaging of the models as well as more complicated multi-round algorithms. We give upper bounds on the expected generalization error for a variety of problems, such as those with Bregman divergence or Lipschitz continuous losses, that demonstrate an improved dependence of 1/*K* on the number of nodes. These "per node" bounds are in terms of the mutual information between the training dataset and the trained weights at each node and are therefore useful in describing the generalization properties inherent to having communication or privacy constraints at each node.

**Keywords:** generalization error; information-theoretic bounds; distribution and federated learning

#### **1. Introduction**

A key feature of machine learning systems is their ability to generalize new and unknown data. Such a system is trained on a particular set of data but must then perform well even on new data points that have not previously been considered. This ability, deemed generalization, can be formulated in the language of statistical learning theory by considering the generalization error of an algorithm (i.e., the difference between the population risk of a model trained on a particular dataset and the empirical risk for the same model and dataset). We say that a model generalizes well if it has a small generalization error, and because models are often trained by minimizing empirical risk or some regularized version of it, a small generalization error also implies a small population risk, which is the average loss over new samples taken randomly from the population. It is therefore of interest to find an upper bound on the generalization error and understand which quantities control it so that we can quantify the generalization properties of a machine learning system and offer guarantees about its performance.

In recent years, it has been shown that information-theoretic measures such as mutual information can be used for generalization error bounds under the assumption of the tail of the distribution of the loss function [1–4]. In particular, when the loss function is sub-Gaussian, the expected generalization error can scale at most with the square root of the mutual information between the training dataset and the model weights [2]. Such bounds offer an intuitive explanation for generalization and overfitting: if an algorithm uses only limited information from its training data, then this will bound the expected generalization error and prevent overfitting. Conversely, if an algorithm uses all of the information from its training set, in the sense that the model is a deterministic function of

**Citation:** Barnes, L.P.; Dytso, A.; Poor, H.V. Improved Information-Theoretic Generalization Bounds for Distributed, Federated, and Iterative Learning. *Entropy* **2022**, *24*, 1178. https://doi.org/10.3390/e24091178

Academic Editors: Chao Tian and Lizhong Zheng

Received: 25 July 2022 Accepted: 20 August 2022 Published: 24 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the training set, then this mutual information can be infinite, and there is the possibility of overfitting.

Another modern focus of machine learning systems has been that of distributed and federated learning [5–7]. In these systems, data are generated and processed in a distributed network of machines. The main differences between the distributed and centralized settings are the information constraints imposed by the network. There has been considerable interest in understanding the impact of both communication constraints [8,9] and privacy constraints [10–13] on the performance of machine learning systems, as well as designing protocols that efficiently train the systems under these constraints.

Since both communication and local differential privacy constraints can be thought of as special cases of mutual information constraints, they should pair naturally with some form of information theoretic generalization bounding in order to induce control over the generalization error of the distributed machine learning system. The information constraints inherent to the network can themselves give rise to tighter bounds on generalization error and thus provide better guarantees against overfitting. Along these lines, in a recent work [14], a subset of the present authors introduced the framework of using information theoretic quantities for bounding both the expected generalization error and a measure of privacy leakage in distributed and federated learning systems. The generalization bounds in this work, however, are essentially the same as those obtained by thinking of the entire system, from the data at each node in the network to the final aggregated model, as a single, centralized algorithm. Any improved generalization guarantees from these bounds would remain implicit in the mutual information terms involved.

In this work, we develop improved bounds on the expected generalization error for distributed and federated learning systems. Instead of leaving the differences between these systems and their centralized counterparts implicit in the mutual information terms, we bring analysis of the structure of the systems directly to the bounds. By working with the contribution from each node separately, we are able to derive upper bounds on the expected generalization error that scale with the number of nodes *K* as *O* - 1 *K* instead of *O* - √1 *K* . This improvement is shown to be tight for certain examples, such as learning the mean of a Gaussian distribution with quadratic loss. We develop bounds that apply to distributed systems in which the submodels from *K* different nodes are averaged together, as well as bounds that apply to more complicated multi-round stochastic gradient descent (SGD) algorithms, such as in federated learning. For linear models with Bregman divergence losses, these "per node" bounds are in terms of the mutual information between the training dataset and the trained weights at each node and are therefore useful in describing the generalization properties inherent to having communication or privacy constraints at each node. For arbitrary nonlinear models that have Lipschitz continuous losses, the improved dependence of *O* - 1 *K* can still be recovered but without a description in terms of mutual information. We demonstrate the improvements given by our bounds over the existing information theoretic generalization bounds via simulation of a distributed linear regression example. A preliminary conference version of this paper was presented in [15]. The present paper completes the work by including all of the missing proof details as well as providing new bounds for noisy SGD in Corollary 4.

#### *Technical Preliminaries*

Suppose we have independent and identically distributed (i.i.d.) data *Zi*∼*π* for *<sup>i</sup>* <sup>=</sup> 1, ... , *<sup>n</sup>*, and let *<sup>S</sup>* = (*Z*1, ... , *Zn*). Suppose further that *<sup>W</sup>* <sup>=</sup> <sup>A</sup>(*S*) is the output of a potentially stochastic algorithm. Let -(*W*, *Z*) be a real-valued loss function and define

$$L(w) = \mathbb{E}\_{\pi}[\ell(w, Z)]$$

to be the population risk for weights (or model) *w*. We similarly define

$$L\_s(w) = \frac{1}{n} \sum\_{i=1}^n \ell(w, z\_i).$$

to be the empirical risk on dataset *s* for model *w*. The generalization error for dataset *s* is then

$$\Delta\_{\mathcal{A}}(s) = L(\mathcal{A}(s)) - L\_s(\mathcal{A}(s))$$

In addition, the expected generalization error is

$$\mathbb{E}\_{\mathcal{S}\sim\pi^{n}}[\Delta\_{\mathcal{A}}(\mathcal{S})] = \mathbb{E}\_{\mathcal{S}\sim\pi^{n}}[L(\mathcal{A}(\mathcal{S})) - L\_{\mathcal{S}}(\mathcal{A}(\mathcal{S}))] \tag{1}$$

where the expectation is also over any randomness in the algorithm. Below, we present some standard results for the expected generalization error that will be needed:

**Theorem 1** (Leave-One-Out Expansion; Lemma 11 in [16])**.** *Let <sup>S</sup>*(*i*) = (*Z*1, ... , *<sup>Z</sup> i* , ... , *Zn*) *be a version of S with Zi replaced by an i.i.d. copy Z i . Denote S* = (*Z* <sup>1</sup>,..., *Z <sup>n</sup>*)*. Then, we have*

$$\mathbb{E}\_{\mathbf{S}\sim\pi^n}[\Lambda\_{\mathcal{A}}(\mathbf{S})] = \frac{1}{n} \sum\_{i=1}^n \mathbb{E}\_{\mathbf{S},\mathbf{S}'} [\ell(\mathcal{A}(\mathbf{S}), Z'\_i) - \ell(\mathcal{A}(\mathbf{S}^{(i)}), Z'\_i)] \dots$$

**Proof.** Observe that

$$\mathbb{E}\_{\mathbb{S}\sim\pi^{n}}[L(\mathcal{A}(\mathcal{S}))] = \mathbb{E}\_{\mathbb{S},\mathcal{S}'}[\ell(\mathcal{A}(\mathcal{S}),Z'\_i)] \tag{2}$$

for each *i* and that

$$\begin{split} \mathbb{E}\_{S \sim \pi^n} [L\_S(\mathcal{A}(S))] &= \frac{1}{n} \sum\_{i=1}^n \mathbb{E}\_{S \sim \pi^n} [\ell(\mathcal{A}(S), Z\_i)] \\ &= \frac{1}{n} \sum\_{i=1}^n \mathbb{E}\_{S, S' \sim \pi^n} [\ell(\mathcal{A}(S^{(i)}), Z'\_i)] \,. \end{split} \tag{3}$$

Putting Equations (2) and (3) together with (1) yields the result.

In many of the results in this paper, we will use one of the two following assumptions:

**Assumption 1.** *The loss function* -(*W* , *<sup>Z</sup>*) *satisfies*

$$\log \mathbb{E}\left[\exp\left(\lambda\left(\ell(\check{W},\check{Z}) - \mathbb{E}[\ell(\check{W},\check{Z})]\right)\right)\right] \le \psi(-\lambda)^2$$

*for <sup>λ</sup>* <sup>∈</sup> (*b*, 0]*, <sup>ψ</sup>*(0) = *<sup>ψ</sup>* (0) = <sup>0</sup>*, where <sup>W</sup> and <sup>Z</sup> are taken independently from the marginals for W and Z, respectively,*

The next assumption is a special case of the previous one with *ψ*(*λ*) = *<sup>R</sup>*2*λ*<sup>2</sup> <sup>2</sup> :

**Assumption 2.** *The loss function* -(*W* , *<sup>Z</sup>*) *is sub-Gaussian with parameter R*<sup>2</sup> *in the sense that*

$$\log \mathbb{E}\left[\exp\left(\lambda\left(\ell(\tilde{W},\tilde{Z}) - \mathbb{E}[\ell(\tilde{W},\tilde{Z})]\right)\right)\right] \le \frac{R^2 \lambda^2}{2} \cdot \varepsilon$$

**Theorem 2** (Theorem 2 in [3])**.** *Under Assumption 1, we have*

$$\mathbb{E}\_{S \sim \pi^n} [\Delta\_{\mathcal{A}}(S)] \le \frac{1}{n} \sum\_{i=1}^n \psi^{\*-1} (I(\mathcal{W}; Z\_i))^2$$

*where <sup>ψ</sup>*∗−1(*y*) = inf*λ*∈[0,*b*) *y*+*ψ*(*λ*) *λ* .

For a continuously differentiable and strictly convex function *<sup>F</sup>* : <sup>R</sup>*<sup>m</sup>* <sup>→</sup> <sup>R</sup>, we define the associated Bregman divergence [17,18] between two points *<sup>p</sup>*, *<sup>q</sup>* <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* to be

$$D\_F(p,q) = F(p) - F(q) - \langle \nabla F(q), p - q \rangle\_{\text{-}\pi}$$

where ·, · denotes the usual inner product.

#### **2. Distributed Learning and Model Aggregation**

Now suppose that there are *K* nodes each having *n* samples. Each node *k* = 1, ... , *K* has a dataset *Sk* = (*Z*1,*k*, ... , *Zn*,*k*), with *Zi*,*<sup>k</sup>* taken i.i.d. from *<sup>π</sup>*. We use *<sup>S</sup>* = (*S*1, ... , *SK*) to denote the entire dataset of size *nK*. Each node locally trains a model *Wk* <sup>=</sup> <sup>A</sup>*k*(*Sk*) with algorithm A*k*. After each node locally trains its model, the models *Wk* are then combined to form the final model *<sup>W</sup>* using an aggregation algorithm *<sup>W</sup>* <sup>=</sup> <sup>A</sup>(*W*1, ... , *WK*) (see Figure 1). In this section, we will assume that *Wk* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* and that the aggregation is performed by simple averaging (i.e., *<sup>W</sup>* <sup>=</sup> <sup>1</sup> *<sup>K</sup>* <sup>∑</sup>*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> *Wk*). Define A to be the total algorithm from the data *S* to the final weights *<sup>W</sup>* such that *<sup>W</sup>* <sup>=</sup> <sup>A</sup>(*S*). In this section, if we say that Assumption <sup>1</sup> or <sup>2</sup> holds, we mean that it holds for each algorithm <sup>A</sup>*k*. As in Theorem 1, we use *<sup>S</sup>*(*i*,*k*) to denote the entire dataset *S* with sample *Zi*,*<sup>k</sup>* replaced by an independent copy *Z <sup>i</sup>*,*k*, and similarly, we use *<sup>S</sup>*(*i*) *<sup>k</sup>* to refer to the sub-dataset at node *k*, with sample *Zi*,*<sup>k</sup>* replaced by an independent copy *Z i*,*k*:

**Figure 1.** The distributed learning setting with model aggregation.

**Theorem 3.** *Suppose that* -(·, *<sup>z</sup>*) *is a convex function of <sup>w</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup> for each <sup>z</sup> and that* <sup>A</sup>*<sup>k</sup> represents the empirical risk minimization algorithm on local dataset Sk in the sense that*

$$\mathcal{W}\_k = \mathcal{A}\_k(S\_k) = \operatorname\*{argmin}\_{w} \sum\_{i=1}^n \ell(w, Z\_{i,k}) \dots$$

*Then, we have*

$$
\Delta\_{\mathcal{A}}(s) \le \frac{1}{K} \sum\_{k=1}^{K} \Delta\_{\mathcal{A}\_k}(s\_k) \dots
$$

**Proof.**

$$\begin{split} \Delta\_{\mathcal{A}}(s) &= \mathbb{E}\_{Z \sim \pi} [\ell(\mathcal{A}(s), Z)] - \frac{1}{nK} \sum\_{i,k} \ell(\mathcal{A}(s), z\_{i,k}) \\ &= \mathbb{E}\_{Z \sim \pi} \Big[ \ell \left( \frac{1}{K} \sum\_{k=1}^{K} w\_k, Z \right) \Big] - \frac{1}{nK} \sum\_{i,k} \ell(\mathcal{A}(s), z\_{i,k}) \\ &\leq \frac{1}{K} \sum\_{k=1}^{K} \mathbb{E}\_{Z \sim \pi} [\ell(w\_k, Z)] - \frac{1}{nK} \sum\_{i,k} \ell(\mathcal{A}(s), z\_{i,k}) \\ &\leq \frac{1}{K} \sum\_{k=1}^{K} \mathbb{E}\_{Z \sim \pi} [\ell(w\_k, Z)] - \frac{1}{K} \sum\_{k=1}^{K} \min\_{w} \frac{1}{n} \sum\_{i=1}^{n} \ell(w, z\_{i,k}) \end{split} \tag{4}$$

In the above display, Equation (4) follows by the convexity of via Jensen's inequality, and Equation (5) follows by minimizing the empirical risk over each node's local dataset, which exactly corresponds to what each node's local algorithm A*<sup>k</sup>* does.

While Theorem 3 seems to be a nice characterization of the generalization bounds for the aggregate model (in that the aggregate generalization error cannot be any larger than the average generalization errors over each node), it does not offer any improvement in the expected generalization error that one might expect when given *nK* total samples instead of just *n* samples. A naive application of the generalization bounds from Theorem 2, followed by the data processing inequality *<sup>I</sup>*(*W* ; *Zi*,*k*) <sup>≤</sup> *<sup>I</sup>*(*Wk*; *Zi*,*k*), runs into the same problem.

#### *2.1. Improved Bounds*

= 1 *K K* ∑ *k*=1

<sup>Δ</sup>A*<sup>k</sup>* (*sk*).

In this subsection, we demonstrate bounds on the expected generalization error that remedy the above shortcomings. In particular, we would like to demonstrate the following two properties:


At a high level, we will improve on the bound from Theorem 3 by taking into account the fact that a small change in *Sk* will only change *<sup>W</sup>* by a fraction <sup>1</sup> *<sup>K</sup>* of the amount that it will change *Wk*. In the case where *W* is a linear or location model, and the loss is a Bregman divergence, we can obtain an upper bound on the expected generalization error that satisfies properties (1) and (2) as follows:

**Theorem 4** (Linear or Location Models with Bregman Loss)**.** *Suppose the loss takes the form of one of the following:*


*In addition, assume that Assumption 1 holds. Then, we have*

$$\mathbb{E}\_{\mathbf{S}\sim\pi^{nK}}[\Delta\_{\mathcal{A}}(\mathbf{S})] = \frac{1}{K^2} \sum\_{k=1}^{K} \mathbb{E}\_{\mathbf{S}\_k \sim \pi^n} [\Delta\_{\mathcal{A}\_k}(\mathbf{S}\_k)]^2$$

*and*

$$\begin{aligned} \mathbb{E}\_{S \sim \pi^{nK}}[\Delta\_{\mathcal{A}}(S)] &\leq \frac{1}{nK^2} \sum\_{i,k} \psi^{\*-1}(I(\mathcal{W}\_k; Z\_{i,k})) \\ &\leq \frac{1}{K^2} \sum\_{k=1}^K \psi^{\*-1}\left(\frac{I(\mathcal{W}\_k; S\_k)}{n}\right) \end{aligned}$$

.

**Proof.** Here, we restrict our attention to case (ii), but the two cases have nearly identical proofs. Using Theorem 1, we have

<sup>E</sup>*S*∼*πnK* [ΔA(*S*)] = 1 *nK* ∑ *i*,*k* E*S*,*S* -(A(*S*), *<sup>Z</sup> <sup>i</sup>*,*k*) <sup>−</sup> -(A(*S*(*i*,*k*) ), *Z i*,*k*) = 1 *nK* ∑ *i*,*k* E*S*,*S <sup>F</sup>*(A(*S*)) <sup>−</sup> *<sup>F</sup>*(*Z <sup>i</sup>*,*k*) <sup>−</sup> <sup>∇</sup>*F*(*Z <sup>i</sup>*,*k*), <sup>A</sup>(*S*) <sup>−</sup> *<sup>Z</sup> i*,*k* <sup>−</sup> *<sup>F</sup>*(A(*S*(*i*,*k*) )) + *F*(*Z <sup>i</sup>*,*k*) + <sup>∇</sup>*F*(*Z <sup>i</sup>*,*k*), <sup>A</sup>(*S*(*i*,*k*) ) <sup>−</sup> *<sup>Z</sup> i*,*k* = 1 *nK* ∑ *i*,*k* E*S*,*S* <sup>∇</sup>*F*(*Z <sup>i</sup>*,*k*), <sup>A</sup>(*S*(*i*,*k*) ) − A(*S*) (6) = 1 *nK* ∑ *i*,*k* E*S*,*S* <sup>∇</sup>*F*(*Z i*,*k*), 1 *KW*(*i*) *<sup>k</sup>* <sup>+</sup> 1 *<sup>K</sup>* ∑ *j*=*k Wj* <sup>−</sup> <sup>1</sup> *<sup>K</sup>* ∑ *j Wj* = 1 *nK*<sup>2</sup> ∑ *i*,*k* E*S*,*S* <sup>∇</sup>*F*(*Z <sup>i</sup>*,*k*), *<sup>W</sup>*(*i*) *<sup>k</sup>* − *Wk* . (7)

In Equation (7), we use *<sup>W</sup>*(*i*) *<sup>k</sup>* to denote <sup>A</sup>*k*(*S*(*i*) *<sup>k</sup>* ). Equation (6) follows the linearity of the inner product and cancels the higher order terms *<sup>F</sup>*(A(*S*)) and *<sup>F</sup>*(A(*S*(*i*,*k*))), which have the same expected values. The key step in Equation (7) then follows by noting that <sup>A</sup>(*S*(*i*,*k*)) only differs from <sup>A</sup>(*S*) in the submodel coming from node *<sup>k</sup>*, which is multiplied by a factor of <sup>1</sup> *<sup>K</sup>* when averaging all of the submodels. By backing out of Equation (6) and re-adding the appropriate canceled terms, we get

$$\mathbb{E}\_{\mathcal{S}\sim\pi^{nK}}[\Delta\_{\mathcal{A}}(\mathcal{S})] = \frac{1}{K^2} \sum\_{k=1}^{K} \mathbb{E}\_{\mathcal{S}\_k \sim \pi^n} [\Delta\_{\mathcal{A}\_k}(\mathcal{S}\_k)] \dots$$

By applying Theorem 2, this yields

$$\mathbb{E}\_{\mathcal{S}\sim\mathcal{T}^{n\mathcal{K}}}[\Delta\_{\mathcal{A}}(\mathcal{S})] \leq \frac{1}{nK^{2}} \sum\_{i,k} \Psi^{\*-1}(I(\mathcal{W}\_{k}; Z\_{i,k})) \ .$$

Then, by noting that *ψ*∗−<sup>1</sup> is non-decreasing and concave, we have

$$\frac{1}{N^2} \sum\_{i,k} \psi^{\*-1}(I(\mathcal{W}\_k; Z\_{i,k})) \le \frac{1}{K^2} \sum\_{k=1}^K \psi^{\*-1} \left( \sum\_{i=1}^n \frac{I(\mathcal{W}\_k; Z\_{i,k})}{n} \right) \dots$$

Using the property that conditioning decreases entropy yields

$$\sum\_{i=1}^{n} I(\mathcal{W}\_k; Z\_{i,k}) \le I(\mathcal{W}\_k; S\_k) \; , \; t$$

and we have

$$\frac{1}{K^2} \sum\_{k=1}^K \psi^{\*-1} \left( \sum\_{i=1}^n \frac{I(\mathcal{W}\_k; Z\_{i,k})}{n} \right) \le \frac{1}{K^2} \sum\_{k=1}^K \psi^{\*-1} \left( \frac{I(\mathcal{W}\_k; S\_k)}{n} \right).$$

as desired.

The result in Theorem 4 is general enough to apply to many problems of interest. For example, if *<sup>F</sup>*(*p*) = *<sup>p</sup>* <sup>2</sup> <sup>2</sup>, then the Bregman divergence *DF* gives the ubiquitous squared -2 loss (i.e., *DF*(*p*, *<sup>q</sup>*) = *<sup>p</sup>* <sup>−</sup> *<sup>q</sup>* <sup>2</sup> <sup>2</sup> ). For a comprehensive list of realizable loss functions, the interested reader is referred to [19]. Using *F* above, Theorem 4 can be applied to ordinary least squares regression, which we will examine in greater detail in Section 4. Other regression models such as logistic regression have loss functions that cannot be described with a Bregman divergence without the inclusion of additional nonlinearity. However, the result in Theorem 4 is agnostic to the algorithm that each node uses to fit its individual model. In this way, each node could fit a logistic model to its data, and the total aggregate model would then be an average over these logistic models. Theorem 4 would still control the expected generalization error for the aggregate model with the extra <sup>1</sup> *<sup>K</sup>* factor. However, critically, the upper bound would only be for the generalization error that is with respect to a loss of the form *DF*(*wTx*, *<sup>y</sup>*), such as quadratic loss.

In order to show that the dependence on the number of nodes *K* from Theorem 4 is tight for certain problems, consider the following example from [3]. Suppose that *<sup>Z</sup>*∼*<sup>π</sup>* <sup>=</sup> <sup>N</sup> (*μ*, *<sup>σ</sup>*<sup>2</sup> *Id*) and -(*w*, *<sup>z</sup>*) = *<sup>w</sup>* <sup>−</sup> *<sup>z</sup>* <sup>2</sup> <sup>2</sup> so that we are trying to learn the mean *μ* of a Gaussian distribution. An obvious algorithm for each node to use is simple averaging of its dataset:

$$w\_k = \mathcal{A}\_k(s\_k) = \frac{1}{n} \sum\_{i=1}^n z\_{i,k} \dots$$

For this algorithm, it can be shown that

$$I(\hat{W}; Z\_{i,k}) = \frac{d}{2} \log \frac{nK}{nK - 1}$$

and

$$
\psi^{\*-1}(y) = 2\sqrt{d\left(1 + \frac{1}{nK}\right)^2 \sigma^4 y}
$$

See Section IV.A. in [3] for further details. If we apply the existing information theoretic bounds from Theorem 2 in an end-to-end way, such as in the approach from [14], we would get

$$\begin{split} \mathbb{E}\_{\mathbf{S}\sim\pi^{n\mathbb{K}}}[\mathbb{A}\_{\mathcal{A}}(\mathbf{S})] &\leq \sigma^{2} d\sqrt{2\left(1+\frac{1}{nK}\right)^{2}\log\frac{nK}{nK-1}}\\ &= O\left(\frac{1}{\sqrt{nK}}\right). \end{split}$$

However, for this choice of algorithm at each node, the true expected generalization error can be computed to be

$$\mathbb{E}\_{\mathcal{S}\sim\pi^{nK}}[\Delta\_{\mathcal{A}}(\mathcal{S})] = \frac{2\sigma^2 d}{nK} \dots$$

By applying our new bound from Theorem 4, we get

$$\begin{aligned} \mathbb{E}\_{\mathcal{S}\sim\pi^{\mathfrak{w}K}}[\Delta\_{\mathcal{A}}(\mathcal{S})] &\leq \frac{\sigma^2 d}{K} \sqrt{2\left(1 + \frac{1}{n}\right)^2 \log\frac{n}{n-1}} \\ &\leq O\left(\frac{1}{K\sqrt{n}}\right) \end{aligned}$$

which shows the correct dependence on *K* and improves upon the *O* -√1 *K* result from prior information theoretic methods.

#### *2.2. General Models and Losses*

In this section, we briefly describe some results that hold for more general classes of models and loss functions, such as deep neural networks and other nonlinear models:

**Theorem 5** (Lipschitz Continuous Loss)**.** *Suppose that* -(*w*, *z*) *is Lipschitz continuous as a function of w in the sense that*

$$|\ell(w, z) - \ell(w', z)| \le \mathcal{C} ||w - w'||\_2$$

*for any z and that* <sup>E</sup>[ *Wk* <sup>−</sup> <sup>E</sup>[*Wk*] <sup>2</sup>] <sup>≤</sup> *<sup>σ</sup>*<sup>0</sup> *for each k. Then, we have*

$$\mathbb{E}\_{\mathcal{S}\sim\pi^{\mathfrak{u}\mathcal{K}}}[\Delta\_{\mathcal{A}}(\mathcal{S})] \leq \frac{2C\sigma\_{0}}{K} \ . $$

**Proof.** Starting with Theorem 1, we have

$$\begin{split} & \mathbb{E}\_{\mathcal{S}\sim\pi^{\rm{KL}}}[\Delta\_{\mathcal{A}}(\mathcal{S})] \\ & \qquad = \frac{1}{nK} \sum\_{i,k} \mathbb{E}\_{\mathcal{S},\mathcal{S}}[\ell(\mathcal{A}(\mathcal{S}),\mathcal{Z}\_{i,k}') - \ell(\mathcal{A}(\mathcal{S}^{(i,k)}),\mathcal{Z}\_{i,k}')] \\ & \qquad \le \frac{1}{nK} \sum\_{i,k} \mathbb{E}\_{\mathcal{S},\mathcal{S}'} \Big[ \mathbb{C} \Big| \Big( \mathcal{A}(\mathcal{S}) - \mathcal{A}(\mathcal{S}^{(i,k)}) \Big) \Big|\_{2} \Big] \\ & \qquad = \frac{1}{nK^{2}} \sum\_{i,k} \mathbb{E}\_{\mathcal{S},\mathcal{S}'} \Big[ \mathbb{C} \Big| \Big| \mathcal{W}\_{k} - \mathcal{W}\_{k}^{(i)} \Big] \Big|\_{2} \Big] \\ & \qquad \le \frac{\mathbb{C}}{nK^{2}} \sum\_{i,k} \mathbb{E}\_{\mathcal{S},\mathcal{S}'} \Big[ \Big| \Big| \mathcal{W}\_{k} - \mathbb{E}[\mathcal{W}\_{k}] \Big| \Big] \Big|\_{2} \Big] + \mathbb{E}\_{\mathcal{S},\mathcal{S}'} \Big[ \Big| \Big| \mathcal{W}\_{k}^{(i)} - \mathbb{E}[\mathcal{W}\_{k}] \Big| \Big]\_{2} \\ & \qquad \le \frac{2\mathcal{C}c\_{0}}{K}, \end{split} \tag{9}$$

where Equation (8) follows from Lipschitz continuity, Equation (9) uses the triangle inequality, and Equation (10) is assumed.

The bound in Theorem 5 is not in terms of the information theoretic quantities *<sup>I</sup>*(*Wk*; *Sk*), but it does show that the *<sup>O</sup>* - 1 *K* upper bound can be shown for much more general loss functions and arbitrary nonlinear models.

#### *2.3. Privacy and Communication Constraints*

Both communication and local differential privacy constraints can be thought of as special cases of mutual information constraints. Motivated by this observation, Theorem 4 immediately implies corollaries for these types of systems:

**Corollary 1** (Privacy Constraints)**.** *Suppose each node's algorithm* A*<sup>k</sup> is an ε-local, differentially private mechanism in the sense that <sup>p</sup>*(*wk* <sup>|</sup>*sk* ) *<sup>p</sup>*(*wk* <sup>|</sup>*s <sup>k</sup>* ) <sup>≤</sup> *<sup>e</sup><sup>ε</sup> for each wk*,*sk*,*s <sup>k</sup>. Then, for losses of the form in Theorem 4, and under Assumption 2, we have*

$$\mathbb{E}\_{\mathcal{S}\sim\pi^{nK}}[\Delta\_{\mathcal{A}}(\mathcal{S})] \leq \frac{1}{K} \sqrt{\frac{2R^2 \min\{\varepsilon, (\varepsilon - 1)\varepsilon^2\}}{n}} \ .$$

**Proof.** Note that

$$\begin{split} I(W\_{k};S\_{k}) &= \sum\_{w\_{k},s\_{k}} p(w\_{k},s\_{k}) \log \frac{p(w\_{k}|s\_{k})}{\sum\_{s'\_{k}} p(w\_{k}|s'\_{k}) p(s'\_{k})} \\ &\leq \sum\_{w\_{k},s\_{k}} p(w\_{k'}s\_{k}) \log \frac{p(w\_{k}|s\_{k})}{\inf\_{s'\_{k}} p(w\_{k}|s'\_{k})} \\ &\leq \sum\_{w\_{k},s\_{k}} p(w\_{k'}s\_{k})\varepsilon = \varepsilon \ . \end{split}$$

Similarly, it is true that

$$\begin{split} I(\mathcal{W}\_{k};\mathcal{S}\_{k}) &= \mathsf{KL}\{P\_{\mathcal{W}\_{k}\mathcal{S}\_{k}}||P\_{\mathcal{S}\_{k}}P\_{\mathcal{W}\_{k}}\} \\ &\leq \mathsf{KL}\{P\_{\mathcal{W}\_{k}\mathcal{S}\_{k}}||P\_{\mathcal{S}\_{k}}P\_{\mathcal{W}\_{k}}\} + \mathsf{KL}\{P\_{\mathcal{S}\_{k}}P\_{\mathcal{W}\_{k}}||P\_{\mathcal{W}\_{k}\mathcal{S}\_{k}}\} \\ &= \sum\_{w\_{k}s\_{k}}p(w\_{k})p(s\_{k})\left(\frac{p(w\_{k}|s\_{k})}{p(w\_{k})}-1\right)\log\frac{p(w\_{k}|s\_{k})}{p(w\_{k})} \\ &\leq \sum\_{w\_{k}s\_{k}}p(w\_{k})p(s\_{k})(\varepsilon^{x}-1)\varepsilon \leq (\varepsilon-1)\varepsilon^{2} \end{split}$$

where the last inequality is only true for *ε* ≤ 1. Putting these two displays together gives *<sup>I</sup>*(*Wk*; *Sk*) <sup>≤</sup> min{*ε*,(*<sup>e</sup>* <sup>−</sup> <sup>1</sup>)*ε*2}, and the result follows from Theorem 4.

**Corollary 2** (Communication Constraints)**.** *Suppose each node can only transit B bits of information to the model aggregator, meaning that each Wk can only take* 2*<sup>B</sup> distinct possible values. Then, for losses of the form in Theorem 4, and under Assumption 2, this yields*

$$\mathbb{E}\_{\mathcal{S}\sim\pi^{nK}}[\Delta\_{\mathcal{A}}(\mathcal{S})] \le \frac{1}{K} \sqrt{\frac{2(\log 2)R^2B}{n}}\ .$$

**Proof.** The corollary follows immediately from Theorem 4 and

$$I(\mathcal{W}\_k; S\_k) \le H(\mathcal{W}\_k) \le (\log 2)B\dots$$

#### **3. Iterative Algorithms**

We now turn to considering more complicated multi-round and iterative algorithms. In this setting, after *<sup>T</sup>* rounds, there is a sequence of weights *<sup>W</sup>*(*T*) = (*W*1, ... , *<sup>W</sup>T*), and the final model *<sup>W</sup> <sup>T</sup>* <sup>=</sup> *fT*(*W*(*T*)) is a function of that sequence, where *fT* gives a linear combination of the *T* vectors *W*1, ... , *WT*. The function *fT* could represent, for example, averaging over the *T* iterates, choosing the last iterate *W<sup>T</sup>* or some weighted average over the iterates. For each round *t*, each node *k* produces an updated model *W<sup>t</sup> <sup>k</sup>* based on its local dataset *Sk* and the previous timestep's global model *Wt*−1. The global model is then updated via an average over all *K* updated submodels:

$$\mathcal{W}^t = \frac{1}{K} \sum\_{k=1}^{K} \mathcal{W}\_k^t \dots$$

The particular example that we will consider is that of a distributed SGD, where each node constructs its updated model *W<sup>t</sup> <sup>k</sup>* by taking one or more gradient steps starting from *Wt*−<sup>1</sup> with respect to random minibatches of its local data. Our model is general enough to account for multiple local gradient steps, as are used in so-called federated learning [5–7], as well as noisy versions of SGDs, such as in [20,21]. If only one local gradient step is taken for each iteration, then the update rule for this example could be written as

$$\mathcal{W}\_k^t = \mathcal{W}^{t-1} - \eta\_t \nabla\_w \ell(\mathcal{W}^{t-1}, Z\_{t\lambda}) + \mathfrak{J}\_t \tag{11}$$

where *Zt*,*<sup>k</sup>* is a data point (or minibatch) sampled from *Sk* on timestep *t*, *η<sup>t</sup>* is the learning rate, and *ξ<sup>t</sup>* is some potential added noise. We assume that the data points *Zt*,*<sup>k</sup>* are sampled without replacement so that the samples are distinct across different values of *t*. We will also assume, for notational simplicity, that *<sup>W</sup> <sup>T</sup>* <sup>=</sup> *<sup>W</sup>T*, although the more general result follows in a straightforward manner.

For this type of iterative algorithm, we will consider the following timestep-averaged empirical risk quantity:

$$\frac{1}{KT} \sum\_{t=1}^{T} \sum\_{k=1}^{K} \ell(W^t, Z\_{t,k}) \text{ .}$$

and the corresponding generalization error, expressed as

$$\Delta\_{\mathbf{zgd}}(S) = \frac{1}{T} \sum\_{t=1}^{T} \left( \mathbb{E}\_{\mathbf{Z} \sim \pi} [\ell(\mathcal{W}^t, Z)] - \frac{1}{K} \sum\_{k=1}^{K} \ell(\mathcal{W}^t, Z\_{t,k}) \right). \tag{12}$$

Note that Equation (12) is slightly different from the end-to-end generalization error that we would get from considering the final model *W<sup>T</sup>* and whole dataset *S*. It is instead an average over the generalization error we would get from each model, stopping at iteration *t*. We perform this so that when we apply the leave-one-out expansion from Theorem 1, we do not have to account for the dependence of *W<sup>t</sup> <sup>k</sup>* on past samples *Zt*,*k* for *t* < *t* and *<sup>k</sup>* <sup>=</sup> *<sup>k</sup>*. Since we expect the generalization error to decrease as we use more samples, this quantity should result in a more conservative upper bound and be a reasonable surrogate object to study. The next bound follows as a corollary to Theorem 4:

**Corollary 3.** *For losses of the form in Theorem 4, and under Assumption 2 (for each W<sup>t</sup> <sup>k</sup>), we have*

$$\mathbb{E}\left[\Lambda\_{\mathsf{spd}}(S)\right] \le \frac{1}{T} \sum\_{t=1}^{T} \frac{1}{K^2} \sum\_{k=1}^{K} \sqrt{2R^2 I(\mathcal{W}\_k^t; Z\_{t,k})} \dots$$

In the particular example described in Equation (11), where Gaussian noise *ξ<sup>t</sup>* ∼ <sup>N</sup> (0, *Idσ*<sup>2</sup> *<sup>t</sup>* ) is added to each iterate, Corollary <sup>3</sup> yields the following. As in [20], we assume that the updates are magnitude-bounded (i.e., sup*w*,*<sup>x</sup>* ∇*w*-(*w*, *<sup>z</sup>*) <sup>2</sup> <sup>≤</sup> *<sup>L</sup>*), the stepsizes satisfy *η<sup>t</sup>* = *<sup>c</sup> <sup>t</sup>* for a constant *<sup>c</sup>* <sup>&</sup>gt; 0, and that *<sup>σ</sup><sup>t</sup>* <sup>=</sup> <sup>√</sup>*ηt*:

**Corollary 4.** *Under the assumptions above, we have*

$$\mathbb{E}\left[\Lambda\_{\mathsf{zgd}}(S)\right] \leq \frac{2RL}{K} \sqrt{\frac{c}{T}} \cdot 1$$

**Proof.** The mutual information terms in Corollary 3 satisfy

$$I(\mathcal{W}\_{k'}^t; Z\_{t,k}) \le I(\mathcal{W}\_{k'}^t \mathcal{W}^{t-1}; Z\_{t,k}) \tag{13}$$

$$=I(\mathcal{W}\_{k'}^t; Z\_{t,k}|\mathcal{W}^{t-1}) + I(\mathcal{W}^{t-1}; Z\_{t,k})\tag{14}$$

$$=I(\mathcal{W}\_{k'}^t; Z\_{t,k}|\mathcal{W}^{t-1})\tag{15}$$

$$\leq \frac{d}{2} \log\left(1 + \frac{\eta\_t^2 L^2}{d \sigma\_t^2}\right) \tag{16}$$

$$t \le \frac{\eta\_t^2 L^2}{2\sigma\_t^2} = \frac{cL^2}{2t} \,. \tag{17}$$

Equation (13) follows from the data-processing inequality, Equation (14) is the chain rule for mutual information, and Equation (15) follows from the independence of *Zt*,*<sup>k</sup>* and *Wt*−1. Equation (16) is due to the capacity of the additive white Gaussian noise channel, and Equation (17) just uses the approximation log(<sup>1</sup> <sup>+</sup> *<sup>x</sup>*) <sup>≤</sup> *<sup>x</sup>*. Thus, we have

$$\mathbb{E}\left[\Lambda\_{\mathsf{sged}}(S)\right] \le \frac{1}{TK} \sum\_{t=1}^{T} RL\sqrt{\frac{c}{t}} \le \frac{2RL}{K}\sqrt{\frac{c}{T}}\ \cdot 1$$

#### **4. Simulations**

We simulated a distributed linear regression example in order to demonstrate the improvement in our bounds over the existing information-theoretic bounds. To accomplish this, we generated *n* = 10 synthetic datapoints at each of *K* different nodes for various values of *<sup>K</sup>*. Each datapoint consisted of a pair (*x*, *<sup>y</sup>*), where *<sup>y</sup>* = *xw*<sup>0</sup> + *<sup>n</sup>* with *<sup>x</sup>*, *<sup>n</sup>*∼N (0, 1), and *<sup>w</sup>*0∼N (0, 1) was the randomly generated true weight that was common to all datapoints. Each node constructed an estimate *<sup>w</sup><sup>k</sup>* of *<sup>w</sup>*<sup>0</sup> using the well-known normal equations which minimize the quadratic loss (i.e., *<sup>w</sup><sup>k</sup>* <sup>=</sup> argmin*<sup>w</sup>* <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=1(*wxi*,*<sup>k</sup>* <sup>−</sup> *yi*,*k*)2). The aggregate model was then the average *<sup>w</sup>* <sup>=</sup> <sup>1</sup> *<sup>K</sup>* <sup>∑</sup>*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> *<sup>w</sup>k*. In order to estimate the old and new information-theoretic generalization bounds (i.e., the bounds from Theorems 2 and 4, respectively), this procedure was repeated *M* = 10<sup>6</sup> times, and the datapoint and model values were binned in order to estimate the mutual information quantities. The value of *M* was increased until the mutual information estimates were no longer particularly sensitive to the number and widths of the bins. In order to estimate the true generalization error, the expectations for both the population risk and the dataset were estimated by Monte Carlo experimentation, with 10<sup>4</sup> trials each. The results can be seen in Figure 2, where it is evident that the new information theoretic bound is much closer to the true expected generalization error and decays with an improved rate as a function of *K*.

**Figure 2.** *Cont*.

**Figure 2.** Information-theoretic upper bounds and expected generalization error for a simulated linear regression example in linear (**top**) and log (**bottom**) scales.

**Author Contributions:** Conceptualization, L.P.B., A.D. and H.V.P.; Formal analysis, L.P.B., A.D. and H.V.P.; Investigation, L.P.B., A.D. and H.V.P.; Methodology, L.P.B., A.D. and H.V.P.; Supervision, H.V.P.; Writing—original draft, L.P.B.; Writing—review & editing, L.P.B., A.D. and H.V.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by National Science Foundation grant number CCF-1908308.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **A Pattern Dictionary Method for Anomaly Detection**

**Elyas Sabeti 1, Sehong Oh 2, Peter X. K. Song <sup>3</sup> and Alfred O. Hero 2,\***


**Abstract:** In this paper, we propose a compression-based anomaly detection method for time series and sequence data using a pattern dictionary. The proposed method is capable of learning complex patterns in a training data sequence, using these learned patterns to detect potentially anomalous patterns in a test data sequence. The proposed pattern dictionary method uses a measure of complexity of the test sequence as an anomaly score that can be used to perform stand-alone anomaly detection. We also show that when combined with a universal source coder, the proposed pattern dictionary yields a powerful atypicality detector that is equally applicable to anomaly detection. The pattern dictionary-based atypicality detector uses an anomaly score defined as the difference between the complexity of the test sequence data encoded by the trained pattern dictionary (typical) encoder and the universal (atypical) encoder, respectively. We consider two complexity measures: the number of parsed phrases in the sequence, and the length of the encoded sequence (codelength). Specializing to a particular type of universal encoder, the Tree-Structured Lempel–Ziv (LZ78), we obtain a novel non-asymptotic upper bound, in terms of the Lambert W function, on the number of distinct phrases resulting from the LZ78 parser. This non-asymptotic bound determines the range of anomaly score. As a concrete application, we illustrate the pattern dictionary framework for constructing a baseline of health against which anomalous deviations can be detected.

**Citation:** Sabeti, E.; Oh, S.; Song, P.X.K.; Hero, A.O. A Pattern Dictionary Method for Anomaly Detection. *Entropy* **2022**, *24*, 1095. https://doi.org/10.3390/e24081095

Academic Editors: Chao Tian and Lizhong Zheng

Received: 14 July 2022 Accepted: 7 August 2022 Published: 9 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Keywords:** pattern dictionary; atypicality; Lempel–Ziv algorithm; lossless compression; anomaly detection

#### **1. Introduction**

Anomaly detection and outlier detection are used for detecting data samples that are inconsistent with normal data samples. Early methods did not take the sequential structure of the data into consideration [1]. However, many real world applications involve data collected as a sequence or time series. In such data, anomalous samples are better characterized as subsequences of time series. Anomaly detection is a challenging task due to the uncertain nature of anomalies. Anomaly detection in time series and sequence data is particularly difficult since both length and occurrence frequency of potentially anomalous subsequences are unknown. Additionally, algorithmic computational complexity can be a challenge, especially for streaming data with large alphabet sizes.

In this paper, we propose a universal nonparametric model-free anomaly detection method for time series and sequence data based on a pattern dictionary (PD). Given training and test data sequences, a pattern dictionary is created from the sets of all the patterns in the training data. This dictionary is then used to sequentially parse and compress (in a lossless manner) the test data sequence. Subsequently, we interpret the number of parsed phrases or the codelength of the test data as anomaly scores. The smaller the number of parsed phrases or the shorter the compressed codelength of the test data, the more similarity between training and test data patterns. This sequential parsing and lossless compression procedure leads to detection of anomalous test sequences and their potential anomalous patterns (subsequences).

The proposed pattern dictionary method has the following properties: (i) it is nonparametric since it does not rely on a family of parametric distributions; (ii) it is universal in the sense that the detection criterion does not require any prior modeling of the anomalies or nominal data; (iii) it is non-Bayesian as the detection criterion is model-free; and (iv) as it depends on data compression, data discretization is required prior to building the dictionary. While the proposed pattern dictionary can be used as a stand-alone anomaly detection method (Pattern Dictionary for Detection (PDD)), we show how it can be utilized in the atypicality framework [2,3] for more general data discovery problems. This results in a method we call PDA (Pattern Dictionary based Atypicality), in which the proposed pattern dictionary is contrasted against a universal source coder which is the Tree-Structured Lempel–Ziv (LZ78) [4,5]. We use the LZ78 as the universal encoder since its compression procedure is similar to our proposed pattern dictionary, and it is (asymptotically) optimal [4,5].

The main contributions of this paper are as follows. First, we propose the pattern dictionary method for anomaly detection and characterize its properties. We show in Theorem 1 that using a multi-level dictionary that separates the patterns by their depth results in a shorter average indexing codelength in comparison to a uni-level dictionary that uses a uniform indexing approach. Second, we develop novel non-asymptotic lower and upper bounds of the LZ78 parser in Theorem 2 and further analyze its non-asymptotic properties. We demonstrate that the non-asymptotic upper bound on the number of distinct phrases resulting from the LZ78 parsing of an |X |-ary sequence of length *l* can be explicitly expressed by the Lambert W function [6]. To the best of our knowledge, such characterization has not previously appeared in the literature. Then, we show in Lemma 1 that the achieved non-asymptotic upper bound on the number of distinct phrases resulting from the LZ78 parsing converges to the optimal upper bound *<sup>l</sup>* log *<sup>l</sup>* of the LZ78 parser as *l* → ∞. Third, we show how the pattern dictionary and LZ78 can be used together in an atypicality detection framework. We demonstrate that the achieved non-asymptotic lower and upper bounds on both LZ78 and pattern dictionary determine the range of the anomaly score. Consequently, we show how these bounds can be used to analyze the effect of dictionary depth on the anomaly score. Furthermore, the bounds are used to set the anomaly detection threshold. Finally, we compare our proposed methods with the competing methods, including nearest neighbors-based similarity [7], threshold sequence time-delay embedding [8–11], and compression-based dissimilarity measure [12–15,15,16], that are designed for anomaly detection in sequence data and time series. We conclude our paper with an experiment that details how the proposed framework can be used to construct a baseline of health against which anomalous deviations are detected.

The paper is organized as follows. In Section 2, we briefly review the relevant literature in anomaly detection (readers who are familiar with anomaly detection can skip this section). Section 3 introduces the detection framework and the notation used in this paper. Section 4 presents our proposed pattern dictionary method and its properties. In Section 5, we show how the proposed pattern dictionary can be used in an atypicality framework alongside LZ78, and we analyze the non-asymptotic properties of the LZ78 parser. Section 6 presents experiments that illustrate the proposed pattern dictionary anomaly detection procedure. Finally, Section 7 concludes our paper.

#### **2. Related Works**

Anomaly detection has a vast literature. Anomaly detection procedures can be categorized into parametric and nonparametric methods. Parametric methods rely on a family of parametric distributions to model the normal data. The slippage problem [17], change detection [18–21], concept drift detection [19–22], minimax quickest change detection (MQCD) [23–25], and transient detection [26–29] are examples of parametric anomaly detection problems. The main difference between our proposed pattern dictionary method and the aforementioned techniques is that our method is a model-free nonparametric

method. The main drawback of the parametric anomaly detection procedure is that it is difficult to accurately specify the parametric distribution for the data under investigation.

Nonparametric anomaly detection approaches do not assume any explicit parameterized model for the data distributions. An example is an adaptive nonparametric anomaly detection approach called geometric entropy minimization (GEM) [30,31] that is based on the minimal covering properties of *K*-point entropic graphs constructed on *N* training samples from a nominal probability distribution. The main difference between GEM-based methods and our proposed pattern dictionary is that former techniques are designed to detect outliers and cannot easily incorporate the temporal information regarding anomaly in a data stream. Another nonparametric detection method is sequential nonparametric testing that considers data as online stream and addresses the growing data storage problem by sequentially testing every new data samples [32,33]. A key difference between sequential nonparametric testing and our proposed pattern dictionary method is that our method is based on coding theory instead of statistical decision theory.

Information theory and universal source coding have been used previously in anomaly detection [34–45]. The detection criteria in these approaches are based on comparing metrics such as complexity or similarity distances that depend on entropy rate. An issue with these approaches is that there are many completely dissimilar sources with the same entropy rate, reducing outlier sensitivity. Another related problem is universal outlier detection [46,47]. In these works, different levels of knowledge about nominal and outlier distributions and number of outliers are incorporated. Unlike these methods, our proposed pattern dictionary approach does not require any prior knowledge about outliers and anomalies. In [48], a measure of empirical informational divergence between two individual sequences generated from two finite-order, finite-alphabet, stationary Markov processes is introduced and used for a simple universal classification. While the parsing procedure used in [48] is similar to the pattern dictionary used in this paper, there are important differences. The empirical measure proposed in [48] is a stand alone score function that is designed for two-class classification, while our measure is a direct byproduct of the LZ78 encoding algorithm designed for single-class classification, i.e., anomaly detection. In addition, the theoretical convergence of the empirical measure to the relative entropy between the class conditioned distributions, shown in [48], is only guaranteed when the sequences satisfy the finite-order Markov property, a condition that may be difficult to satisfy in practice. In [2,3], an information theoretic data discovery framework called *atypicality* has been introduced in which the detection criterion is based on a descriptive codelength comparison of an optimum encoder or a training-based fixed source coder, namely a data-dependent source coder introduced in [2]) with a universal source coder. In this paper, we show how our proposed pattern dictionary method can be used as a training-based fixed source coder in an atypicality framework.

Anomaly and outlier detection for time series has also been extensively studied [49]. Various time series modeling techniques such as regression [50], auto regression [51], auto regression moving average [52], auto regressive integrated moving average [53], support vector regression [54], and Kalman filters [55] have been used to detect anomalous observations by comparing the estimated residuals to a threshold. Many of these methods depend on a statistical assumption on the residuals, e.g., an assumption of Gaussian distribution, while the pattern dictionary method is model-free.

The proposed pattern dictionary method is closely related to the anomaly detection methods that are designed for sequence data. Many of these methods are focused on specific applications. For instance, detection of mutations in DNA sequences [7,56], detection of cyberattacks in computer network [57], and detection of irregular behaviors in online banking [58] are all application-specific examples of anomaly detection for discrete sequences. In the recent years, multiple sequence data anomaly detection methods have been developed specifically for graphs [59], dynamic networks [60], and social networks [61]. Chandola et al. [34] summarized many anomaly detection methods for discrete sequences and identified three general approaches to this problem. These anomaly detection formulations are unique in the way that anomalies are defined, but similar in their reliance on comparison between a test (sub)sequence and normal sequences in the training data. For example, kernel-based techniques such as nearest neighbor-based similarity (NNS) [7] are designed to detect anomalous sequences that are dissimilar to the training data. As another example, threshold sequence time-delay embedding (t-STIDE) [8–11] is established to detect anomalous sequences that contain subsequences with anomalous occurrence frequencies. The compression-based dissimilarity measure (CDM) is proposed for discord detection [12–15,15,16] to detect anomalous subsequences within a long sequence. Chandola et al. [34] also showed how various techniques developed for one problem formulation can be changed and applied to other problem formulations. While our pattern dictionary method shares similarity with NNS, CDM, and t-STIDE, our proposed method is generally applicable to any of the categories of anomaly detection identified in [34]. Furthermore, our detection criterion does not depend on the specific type of anomaly. Note that while CDM is also a compression-based method, its anomaly score is based on a dissimilarity measure that might fail to detect atypical subsequences [2]. For instance, using CDM method, a binary i.i.d. uniform training sequence is equally dissimilar to another binary i.i.d. uniform test sequence or to a test sequence drawn from some other distribution. In Section 6, the detection performance of our proposed pattern dictionary method is compared with NNS, CDM, t-STIDE, and the Ziv–Merhav method of [48].

It is worth mentioning that since the proposed pattern dictionary method is based on lossless source coding, it requires discretization of time series prior to deployment. In fact, many anomaly detection approaches require discretization of continuous data prior to applying inference techniques [62–65]. Note that discretization is also a requirement in other problem settings such as continuous optimization in genetic algorithms [66], image pattern recognition [67], and nonparametric histogram matching over codebooks in computer vision [68].

#### **3. Framework and Notation**

In the anomaly detection literature for sequence data and time series, the following three general formulations are considered [34]: (i) an entire test sequence is anomalous if it is notably different from normal training sequences; (ii) a subsequence within a long test sequence is anomalous if it is notably different from other subsequences in the same test sequence or the subsequences in a given training sequence; and (iii) a given test subsequence or pattern is anomalous if its occurrence frequency in a test sequence is notably different from its occurrence frequency in a normal training sequence. In this paper, we consider a unified formulation in which we determine if a (sub)sequence is anomalous with respect to a training sequence (or training sequence database) if any of the aforementioned three conditions are met. In other words, given a training sequence or a training sequence database, a test sequence is anomalous if it is significantly different from training sequences, or it contains a subsequence that is significantly different from subsequences in the training sequence, or it contains a subsequence whose occurrence frequency is significantly different from its occurrence frequency in the training data.

#### *Notation*

We use *x* to denote a sequence and *x<sup>m</sup> <sup>n</sup>* to denote a subsequence of *x*: *x<sup>m</sup> <sup>n</sup>* <sup>=</sup> {*xi*, *<sup>i</sup>* <sup>=</sup> *<sup>n</sup>*, *<sup>n</sup>* <sup>+</sup> 1, . . . , *<sup>m</sup>*}, and *<sup>x</sup><sup>l</sup>* represents a sequence of length *<sup>l</sup>*, i.e., {*xn*, *<sup>n</sup>* <sup>=</sup> 1, . . . , *<sup>l</sup>*}. <sup>X</sup> denotes a finite set, and <sup>D</sup> represents a dictionary of subsequences. Throughout this paper:


#### **4. Pattern Dictionary: Design and Properties**

Consider a long sequence, called the training data, {*xn*, *<sup>n</sup>* <sup>=</sup> 1, . . . , *<sup>L</sup>*} of length *<sup>L</sup>* drawn from a finite alphabet X . The goal is to *learn* the patterns (subsequences) of this sequence by creating a dictionary that contains all distinct patterns of maximum length (depth) *Dmax L* that are embedded in the sequence. We call this dictionary a *pattern dictionary* D with the maximum depth *Dmax* and the set of observed patterns SD *xL* 1 .

**Example 1.** *Suppose Dmax* <sup>=</sup> <sup>2</sup>*, the alphabet is* <sup>X</sup> <sup>=</sup> {*A*, *<sup>B</sup>*, *<sup>C</sup>*, *<sup>D</sup>*} *and the training sequence is <sup>x</sup>* <sup>=</sup> *ABACADABBACCADDABABACADAB. The set of patterns with depth <sup>d</sup>* <sup>≤</sup> *Dmax in this sequence is* SD(*x*) <sup>=</sup> {*A*, *<sup>B</sup>*, *<sup>C</sup>*, *<sup>D</sup>*, *AB*, *BA*, *AC*, *CA*, *AD*, *DA*, *BB*, *CC*, *DD*}*.*

Since the pattern dictionary is going to be used as a training-based fixed source coder (a data-dependent source coder as defined in [2]), an efficient structure for the pattern representation that minimizes the indexing codelength is of interest. The simplest approach is to consider all the patterns of length 1 ≤ *d* ≤ *Dmax* in one set SD and use a uniform indexing approach. This approach is called a *uni-level dictionary*. Another approach is to separate all the patterns by their depth (pattern length) and arrange them in *Dmax* sets S(1) <sup>D</sup> , <sup>S</sup>(2) <sup>D</sup> , ... , <sup>S</sup>(Dmax) <sup>D</sup> , and define SD <sup>=</sup> *Dmax <sup>d</sup>*=<sup>1</sup> <sup>S</sup>(d) <sup>D</sup> , which we call a *multi-level dictionary*. In the following sections, we show that the latter results in a shorter average indexing codelength. It is worth mentioning that since a multi-level dictionary results in a depthdependent indexing codelength, the average over the depth is considered. A relevant question is if the average of indexing codelength over all the patterns independent of depth should be used as an alternative. Since such pattern dictionaries are used to sequentially parse test data, patterns at smaller depth are more likely to be matched, even if they are anomalous. Thus, the average of indexing codelength over depth can better differentiate depth-dependent anomalies.

#### *4.1. A Special Case*

Suppose all the possible patterns of depth *d* ≤ *Dmax* exist in the training sequence {*xn*, *<sup>n</sup>* <sup>=</sup> 1, . . . , *<sup>L</sup>*}. That is, the cardinality of <sup>S</sup>(d) <sup>D</sup> is S(d) D <sup>=</sup> |X <sup>|</sup> *<sup>d</sup>* for 1 <sup>≤</sup> *<sup>d</sup>* <sup>≤</sup> *Dmax*. Then, the total number of patterns is

$$\begin{aligned} \left| \mathcal{S}\_{\mathcal{D}} \left( \mathbf{x}\_1^L \right) \right| &= \sum\_{d=1}^{D\_{\max}} \left| \mathcal{S}\_{\mathcal{D}}^{(\mathbf{d})} \left( \mathbf{x}\_1^L \right) \right| \\ &= \sum\_{d=1}^{D\_{\max}} |\mathcal{X}|^d \\ &= \frac{|\mathcal{X}| \left( |\mathcal{X}|^{D\_{\max}} - 1 \right)}{|\mathcal{X}| - 1} . \end{aligned}$$

Hence, a uni-level dictionary results in a uniform indexing codelength of

 

$$L^{\rm uni} = \log\left(\frac{|\mathcal{X}|\left(|\mathcal{X}|^{D\_{\rm max}} - 1\right)}{|\mathcal{X}| - 1}\right),$$

$$\approx D\_{\rm max} \log(|\mathcal{X}|).$$

On the other hand, a multi-level dictionary requires a two-stage description of index. The first stage is the index of the depth *d* (using log *Dmax* bits), and the second stage is the index of the pattern among all the patterns with the same depth (using *<sup>d</sup>* log(|X <sup>|</sup>) bits). This two-stage description of the index leads to a non-uniform indexing of codelength: the minimum indexing codelength occurring for the patterns of depth *d* = 1 equals to *Lmulti min* <sup>=</sup>log *Dmax* <sup>+</sup> log(|X <sup>|</sup>) bits, while the maximum indexing codelength occurring for

the patterns of depth *d* = *Dmax* equals to *Lmulti max* <sup>=</sup>log *Dmax* <sup>+</sup> *Dmax* log(|X <sup>|</sup>) bits. Thus, the average indexing codelength of a multi-level dictionary is given by

$$L^{multi} = \frac{1}{D\_{\max}} \sum\_{d=1}^{D\_{\max}} \left( \log D\_{\max} + d \log(|\mathcal{X}|) \right),$$

$$= \log D\_{\max} + \frac{\log(|\mathcal{X}|)}{D\_{\max}} \sum\_{d=1}^{D\_{\max}} d$$

$$\approx \log D\_{\max} + \frac{1}{2} D\_{\max} \log(|\mathcal{X}|).$$

Figures 1 and 2 graphically compare the indexing codelength between a uni-level dictionary and a multi-level dictionary for a fixed alphabet size and a fixed *Dmax*, respectively. As seen, the average indexing codelength of a multi-level dictionary results in a shorter indexing codelength.

**Figure 1.** Comparison of indexing codelength between a uni-level dictionary and a multi-level dictionary (fixed alphabet size |X <sup>|</sup> <sup>=</sup> 100).

**Figure 2.** Comparison of indexing codelength between a uni-level dictionary and a multi-level dictionary (fixed *Dmax* = 20).

#### *4.2. The General Case*

Given the training sequence {*xn*, *<sup>n</sup>* <sup>=</sup> 1, . . . , *<sup>L</sup>*}, suppose there are *ad* <sup>=</sup> S(d) D <sup>≤</sup> |X <sup>|</sup> *d* patterns of depth *d* ≤ *Dmax* (*a*<sup>1</sup> patterns of depth one, *a*<sup>2</sup> patterns of depth two, etc.). The following Theorem 1 shows that the average indexing codelength using a multi-level dictionary is always less than the indexing codelength of a uni-level dictionary.

**Theorem 1.** *Assume there are embedded ad* = S(d) D <sup>≤</sup> |X <sup>|</sup> *<sup>d</sup> patterns of depth* <sup>1</sup> <sup>≤</sup> *<sup>d</sup>* <sup>≤</sup> *Dmax in a training sequence of length <sup>L</sup> Dmax. Let <sup>L</sup>uni and <sup>L</sup>multi be the indexing codelength of a uni-level dictionary and the average indexing codelength of a multi-level dictionary, respectively. Then,*

 $(1)\ L^{\text{multi}} \le L^{\text{uni}};\ and\newline (2)\ \log\left(1+\frac{\left(\sqrt{d\_{D\_{\text{max}}}-\sqrt{d\_1}}\right)^2}{D\_{\text{uni}}d\_{D\_{\text{max}}}}\right) \le L^{\text{uni}} - L^{\text{uni}}$  $\text{where}$ 
$$uv = \frac{\ln\left[\left(\frac{d\_{D\_{\text{max}}}}{d\_{D\_{\text{max}}}-d\_1}\right)\ln\frac{d\_{D\_{\text{max}}}}{d\_1}\right]}{\ln\frac{d\_{D\_{\text{max}}}}{d\_1}}.$$

**Proof.** Since *L Dmax*, clearly 0 < *a*<sup>1</sup> ≤ *a*<sup>2</sup> ≤ ··· ≤ *aDmax* . Using a uni-level dictionary, the indexing codelength is

*a*1

$$\begin{aligned} L^{uni} &= \log \left( \sum\_{d=1}^{D\_{\max}} a\_d \right) \\ &= \log D\_{\max} + \log A\_{D\_{\max}, \ell} \end{aligned}$$

where *ADmax* - (*a*<sup>1</sup> <sup>+</sup> *<sup>a</sup>*<sup>2</sup> <sup>+</sup> ··· <sup>+</sup> *aDmax* )/*Dmax* is the arithmetic mean of *<sup>a</sup>*1, *<sup>a</sup>*2, ... , *aDmax* . Using a multi-level dictionary the average indexing codelength is

$$\begin{split} L^{multi} &= \frac{1}{D\_{\text{max}}} \sum\_{d=1}^{D\_{\text{max}}} (\log D\_{\text{max}} + \log a\_d) \\ &= \log D\_{\text{max}} + \log G\_{D\_{\text{max}}} \end{split} $$

where *GDmax* - - ∏*Dmax <sup>d</sup>*=<sup>1</sup> *ad* 1/*Dmax* is the geometric mean of *a*1, *a*2, ... , *aDmax* . Hence, the comparison between *Luni* and *Lmulti* comes down to comparing the arithmetic mean and the geometric mean of *a*1, *a*2, ... , *aDmax* . Thus, *ADmax* ≥ *GDmax* , which established the first part of the theorem. For the second part of the theorem, we use lower and upper bounds on *ADmax* − *GDmax* derived in [69]

$$\frac{\left(\sqrt{a\_{D\_{\max}}} - \sqrt{a\_1}\right)^2}{D\_{\max}} \le A\_{D\_{\max}} - G\_{D\_{\max}} \le$$

$$\left[wa\_1 + (1 - w)a\_{D\_{\max}} - a\_1^w a\_{D\_{\max}}^{1 - w}\right]$$

,

where *<sup>w</sup>* <sup>=</sup> ln[(*aDmax* /(*aDmax*−*a*1))ln(*aDmax* /*a*1)] ln(*aDmax* /*a*1) . Since *<sup>a</sup>*<sup>1</sup> <sup>≤</sup> *GDmax* <sup>≤</sup> *aDmax* and *<sup>L</sup>uni* <sup>−</sup> *<sup>L</sup>multi* <sup>=</sup> log *ADmax GDmax* , the proof is complete.

Theorem 1 shows that a multi-level dictionary gives shorter average indexing codelength than a uni-level dictionary. log *Dmax* + log *ad* is the indexing codelength for patterns of depth *d*, where *ad* is the total number of observed patterns of the depth *d*. In order to reduce the indexing codelength even further, the patterns of the same length in each set S(d) <sup>D</sup> can be ordered according to their relative frequency (empirical probability) in the training sequence. This allows Huffman or Shannon–Fano–Elias source coding [4] to be

used to assign prefix codes to patterns in each set <sup>S</sup>(d) <sup>D</sup> separately. In this case, for any pattern *x<sup>d</sup>* <sup>1</sup> ∈ S(d) <sup>D</sup> , the indexing codelength becomes

$$L^{multi}\left(\mathbf{x}\_1^d\right) = \log D\_{\text{max}} + L\_D^{(d)}\left(\mathbf{x}\_1^d\right),\tag{1}$$

where *<sup>L</sup>*(*d*) D *xd* 1 is the codelength assigned to the pattern *x<sup>d</sup>* <sup>1</sup> based on its empirical probability using a Huffman or Shannon–Fano–Elias encoder. If such encoders are used, the codelength (1) is optimal ([4] Theorem 5.8.1). Since the whole purpose of creating a pattern dictionary is to learn the patterns in the training data, assigning the shorter codelength to the more frequent patterns and assigning longer codelength to the less frequent patterns in any pattern set <sup>S</sup>(d) <sup>D</sup> will improve the efficiency of the coded representation.

**Example 2.** *Suppose the alphabet is* <sup>X</sup> <sup>=</sup> {*A*, *<sup>B</sup>*, *<sup>C</sup>*, *<sup>D</sup>*} *and the training sequence is <sup>x</sup>* <sup>=</sup> *ABACADABBACCADDABABACADAB. Table 1 shows the dictionary with Dmax* = 3 *created by the patterns inside the training sequence, and the codelength assigned for each pattern using Huffman coding.*

**Table 1.** Filling (training) the dictionary (of maximum depth *Dmax* = 3) with the patterns in the training sequence *ABACADABBACCADDABABACADAB*.


#### *4.3. Pattern Dictionary for Detection (PDD)*

Suppose we want to sequentially compress a test sequence *x<sup>l</sup>* <sup>1</sup> <sup>=</sup> {*xn*, *<sup>n</sup>* <sup>=</sup> 1, . . . , *<sup>l</sup>*} using a trained pattern dictionary D with maximum depth *Dmax* < *l*. The encoder parses the test sequence *x<sup>l</sup>* <sup>1</sup> into *<sup>c</sup>* phrases, *<sup>x</sup>v*2−<sup>1</sup> *<sup>v</sup>*<sup>1</sup> , *<sup>x</sup>v*3−<sup>1</sup> *<sup>v</sup>*<sup>2</sup> , ... , *<sup>x</sup><sup>l</sup> vc* where *vi* is the index of the start of the *i*th phrase, and each phrase *x vi*+1−<sup>1</sup> *vi* is a pattern in the pattern dictionary <sup>D</sup>. Let SD - *xl* 1 = *<sup>x</sup>v*2−<sup>1</sup> *<sup>v</sup>*<sup>1</sup> , *<sup>x</sup>v*3−<sup>1</sup> *<sup>v</sup>*<sup>2</sup> ,..., *<sup>x</sup><sup>l</sup> vc* ! denote the set of the parsed phrases using pattern dictionary <sup>D</sup>. The parsing process begins with setting *<sup>v</sup>*<sup>1</sup> <sup>=</sup> 1 and finding the largest *<sup>v</sup>*<sup>2</sup> <sup>≤</sup> *Dmax* and *<sup>v</sup>*<sup>2</sup> <sup>≤</sup> *<sup>l</sup>* such that *<sup>x</sup>v*2−<sup>1</sup> *<sup>v</sup>*<sup>1</sup> ∈ D but *<sup>x</sup>v*<sup>2</sup> *<sup>v</sup>*<sup>1</sup> ∈ D / . This results in the first phrase *xv*2−<sup>1</sup> <sup>1</sup> . Similarly, the same procedure is performed in order to find the largest *v*<sup>3</sup> ≤ *Dmax* and *<sup>v</sup>*<sup>3</sup> <sup>≤</sup> *<sup>l</sup>* such that *<sup>x</sup>v*3−<sup>1</sup> *<sup>v</sup>*<sup>2</sup> ∈ D but *<sup>x</sup>v*<sup>3</sup> *<sup>v</sup>*<sup>2</sup> ∈ D/ . This type of cross-parsing was first introduced in [48] in order to estimate an empirical relative entropy between two individual sequences that are independent realizations of two finite-order, finite-alphabet and stationary Markov processes. Here, we do not impose such an assumption on the sources generating the sequences. Algorithm 1 summarizes the procedure of the proposed pattern dictionary (PD) parser. After parsing the whole test sequence *x<sup>l</sup>* <sup>1</sup> into *<sup>c</sup>* phrases, *<sup>x</sup>v*2−<sup>1</sup> *<sup>v</sup>*<sup>1</sup> , *<sup>x</sup>v*3−<sup>1</sup> *<sup>v</sup>*<sup>2</sup> , ... , *<sup>x</sup><sup>l</sup> vc* , the codelength will be

1

$$L\left(\mathbf{x}\_1^l\right) = \sum\_{i=1}^c L\_{\mathcal{D}}\left(\mathbf{x}\_{\overline{v}\_i}^{v\_{i+1}-1}\right) + c \log D\_{\text{max}}.\tag{2}$$

**Algorithm 1** Pattern Dictionary (PD) Parser

**Require:** Pattern Dictionary <sup>D</sup>, Test Sequence *<sup>x</sup><sup>l</sup>* 1: Set *c* = 1, *vc* = 1, *d* = 1 2: **while** *vc* <sup>+</sup> *<sup>d</sup>* <sup>−</sup> <sup>1</sup> <sup>&</sup>lt; *<sup>l</sup>* **do** 3: **if** *<sup>x</sup>vc*+*d*−<sup>1</sup> *vc* ∈ S(d) <sup>D</sup> **then** 4: **if** *<sup>d</sup>* <sup>+</sup> <sup>1</sup> <sup>≤</sup> *Dmax* **then** 5: *d* = *d* + 1 6: **else** 7: *vc*+<sup>1</sup> = *vc* + *<sup>d</sup>* 8: *c* = *c* + 1 9: *d* = 1 10: **else** 11: *vc*+<sup>1</sup> <sup>=</sup> *vc* <sup>+</sup> *<sup>d</sup>* <sup>−</sup> <sup>1</sup> 12: *c* = *c* + 1 13: *d* = 1 **return** *<sup>x</sup>v*2−<sup>1</sup> *<sup>v</sup>*<sup>1</sup> , *<sup>x</sup>v*3−<sup>1</sup> *<sup>v</sup>*<sup>2</sup> ,..., *<sup>x</sup><sup>l</sup> vc*

For detection purposes, on a test sequence *x<sup>l</sup>* <sup>1</sup>, either the number of parsed phrases or the codelength can be used as anomaly scores with respect to the trained pattern dictionary <sup>D</sup>. In other words, for any test sequence *<sup>x</sup><sup>l</sup>* <sup>1</sup> and given a pattern dictionary, if the number of parsed phrases SD - *xl* 1 or the codelength *<sup>L</sup>* - *xl* 1 in Equation (2) are greater than a certain threshold, then *x<sup>l</sup>* <sup>1</sup> is declared to be anomalous. While the proposed pattern dictionary technique can be used as a stand-alone anomaly detection technique, below we show how it can be used for atypicality detection [2,3] as a training-based fixed source coder (data-dependent encoder).

#### **5. Pattern Dictionary-Based Atypicality (PDA)**

In [2,3], an *atypicality framework* was introduced as a data discovery and anomaly detection framework that is based on a central definition: "a sequence (or subsequence) is atypical if it can be described (coded) with fewer bits in itself rather than using the (optimum) code for typical sequences". In this framework, detection is based on the comparison of a lossless descriptive codelength between an optimum encoder (if the typical model is known) or a training-based fixed source coder (if the typical model is unknown, but training data are available) and a universal source coder in order to detect atypical subsequences in the data [2,3]. In this section, we apply our proposed pattern dictionary as a training-based fixed source coder (typical encoder) in an atypicality framework. We call it pattern dictionary-based atypicality (PDA) method.

The pattern dictionary-based source coder can be considered as a generalization of the Context Tree [70–72] based fixed source coder that was used in [2] for discrete data. The universal source coder (atypical encoder) used here is the Tree-Structured Lempel– Ziv (LZ78) [4,5]. The primary reason for choosing LZ78 as the universal encoder is that its sequential parsing procedure is similar to the proposed pattern dictionary described in Section 4, and it is (asymptotically) optimal [4,5]. One might ask why do we even need to compare descriptive codelengths of a training-based (or optimum) encoder with a universal encoder for data discovery purposes when, as alluded to in the end of last section, a training-based fixed source coder can be a stand-alone anomaly detector. The necessity of such concurrent comparison is articulated in [2]. In fact, such a codelength comparison enables the atypicality framework to go beyond the detection of anomalies and outliers, extending to the detection of *rare* parts of data that might have a data structure of interest to the practitioner.

We give an example to provide further intuition for why anomaly detection can benefit from our framework that compares the outputs of a typical encoder and an atypical encoder. Consider an i.i.d. binary sequence of length *L* with *P*(*X* = 1) = *p* in which there is embedded an anomalous subsequence of length *<sup>l</sup> <sup>L</sup>* with *<sup>P</sup>*(*<sup>X</sup>* <sup>=</sup> <sup>1</sup>) <sup>=</sup> *<sup>p</sup>* <sup>=</sup> *<sup>p</sup>* that we would like to detect. If *p* = <sup>1</sup> <sup>2</sup> and *<sup>p</sup>* <sup>=</sup> 1, the typical encoder cannot catch the anomaly while the atypical encoder can. On the other hand, if *p* = <sup>1</sup> <sup>3</sup> and *<sup>p</sup>* <sup>=</sup> <sup>2</sup> <sup>3</sup> , the typical encoder identifies the anomaly while an atypical encoder fails to do so (since the entropy for *p* = <sup>1</sup> 3 and *<sup>p</sup>* <sup>=</sup> <sup>2</sup> <sup>3</sup> is the same). Note that in both cases, our framework would catch the anomaly since it uses the difference between the descriptive codelengths of these two encoders.

Recall that in Section 4, we supposed that a test sequence *x<sup>l</sup>* <sup>1</sup> has been parsed using a trained pattern dictionary D with maximum depth *Dmax* < *l*. This parsing results in SD - *xl* 1 parsed phrases. Using Equation (2), the typical codelength of the sequence *<sup>x</sup><sup>l</sup>* <sup>1</sup> is given by

$$L\_T\left(\mathbf{x}\_1^l\right) = \sum\_{\mathcal{Y} \in \mathcal{S}\_{\mathcal{D}}\left(\mathbf{x}\_1^l\right)} L\_{\mathcal{D}}\left(\boldsymbol{y}\right) + \left| \mathcal{S}\_{\mathcal{D}}\left(\mathbf{x}\_1^l\right) \right| \log D\_{\text{max}}.$$

For the atypical encoder, the LZ78 algorithm results in a distinct parsing of the test sequence *xl* <sup>1</sup>. Let <sup>S</sup>*LZ*- *xl* 1 denote the set of parsed phrases in the LZ78 parsing of *x<sup>l</sup>* <sup>1</sup>. As such, the resulting atypical codelength is [4,5]

$$L\_A\left(\mathbf{x}\_1^l\right) = \left|\mathcal{S}\_{LZ}\left(\mathbf{x}\_1^l\right)\right| \left[\log \left|\mathcal{S}\_{LZ}\left(\mathbf{x}\_1^l\right)\right| + 1\right].$$

Since *L xl* 1 using both LZ78 and the pattern dictionary depends on the number of parsed phrases, we investigate the possible range and properties of SD - *xl* 1 − <sup>S</sup>*LZ*- *xl* 1 . While the LZ78 encoder is a well-known compression method which is asymptotically optimal [4,5], its non-asymptotic behavior is not well understood. In the next section, we establish a novel non-asymptotic property of an LZ78 parser, and then compare it with the pattern dictionary parser.

#### *5.1. Lempel–Ziv Parser*

We start this section with a theorem that establishes the non-asymptotic lower and upper bounds on the number of distinct phrases in a sequence parsed by LZ78.

**Theorem 2.** *The number of distinct phrases <sup>c</sup>*(*l*) *resulting from LZ78 parsing of an* |X <sup>|</sup>*-ary sequence x<sup>l</sup>* <sup>1</sup> <sup>=</sup> {*xn*, *<sup>n</sup>* <sup>=</sup> 1, . . . , *<sup>l</sup>*} *satisfies*

$$\frac{1}{2} \left( \sqrt{8l+1} - 1 \right) \le c(l) \le \frac{l \ln|\mathcal{X}|}{\mathcal{W} \left( \frac{\mathcal{C}}{\alpha} |\mathcal{X}|^{\frac{\alpha+1}{-\alpha}} \ln|\mathcal{X}| \right)}.$$

*where <sup>α</sup>* <sup>=</sup> |X <sup>|</sup> <sup>−</sup> <sup>1</sup>*, <sup>β</sup>* <sup>=</sup> (|X <sup>|</sup> <sup>−</sup> <sup>1</sup>) 2 *<sup>l</sup>* <sup>−</sup> |X <sup>|</sup>*, and* <sup>W</sup>(.) *is the Lambert W function [6].*

**Proof.** First, we establish the upper bound. Note that the number of parsed distinct phrases *c*(*l*) is maximized when all the phrases are as short as possible. Define *M* - |X | and let *lk* be the sum of the lengths of all distinct strings of length less than or equal to *k*. Then,

$$l\_k = \sum\_{j=1}^k j \mathcal{M}^j = \frac{1}{\left(M - 1\right)^2} \left[ \left\{ (M - 1)k - 1 \right\} \mathcal{M}^{k+1} + M \right]^k$$

.

Since *<sup>l</sup>* <sup>=</sup> *lk* occurs when all the phrases are of length <sup>≤</sup> *<sup>k</sup>*,

$$c(l\_k) \le \sum\_{j=1}^k M^j = \frac{M\left(M^k - 1\right)}{M - 1} < \frac{M^{k+1}}{M - 1} \le \frac{l\_k}{k - \frac{1}{M - 1}}.$$

If *lk* <sup>≤</sup> *<sup>l</sup>* <sup>&</sup>lt; *lk*+1, we write *<sup>l</sup>* <sup>=</sup> *lk* <sup>+</sup> where

$$\begin{aligned} \triangle &< l\_{k+1} - l\_k = (Mk + M - 1 - k) \frac{M^{k+1}}{M - 1} \\ &= (k + 1) \frac{M^{k+1}}{M - 1} .\end{aligned}$$

We conclude that the parsing ends up with *<sup>c</sup>*(*lk*) phrases of length <sup>≤</sup> *<sup>k</sup>* and *<sup>l</sup>*−*lk <sup>k</sup>*+<sup>1</sup> phrases of length *k* + 1. Therefore,

$$c(l) \le c(l\_k) + \frac{l - l\_k}{k + 1} \le \frac{l\_k}{k - \frac{1}{M - 1}} + \frac{\triangle}{k + 1}$$

$$\le \frac{l\_k + \triangle}{k - \frac{1}{M - 1}} = \frac{l}{k - \frac{1}{M - 1}}.\tag{3}$$

We now bound the size of *<sup>k</sup>* for a given sequence of length *<sup>l</sup>* by setting *<sup>l</sup>* = *lk*. Define *α* - *M* − 1 and *β* - (*<sup>M</sup>* <sup>−</sup> <sup>1</sup>) 2 *l* − *M*. Then,

$$\begin{split} &\frac{1}{\left(M-1\right)^{2}}\left[\left((M-1)k-1\right)M^{k+1}+M\right]=\mathbb{I} \\ &\Longleftrightarrow \left((M-1)k-1\right)M^{k+1}=\left(M-1\right)^{2}l-M \\ &\Longleftrightarrow \left(ak-1\right)M^{k+1}=\beta \\ &\Longleftrightarrow \hat{k}M^{\left(\hat{k}+1\right)/\alpha+1}=\beta \\ &\Longleftrightarrow \hat{k}\frac{\ln M}{\alpha}\exp\left(\hat{k}\frac{\ln M}{\alpha}\right)=\frac{\beta}{\alpha}M^{-1-1/\alpha}\ln M. \end{split}$$

where *<sup>k</sup>* <sup>=</sup> *<sup>α</sup><sup>k</sup>* <sup>−</sup> 1. The last equation can be solved using the Lambert W function [6]. Since all the involved numbers are real and for *<sup>M</sup>* <sup>&</sup>gt; 1 and *<sup>l</sup>* <sup>≥</sup> 2, we have *<sup>β</sup> <sup>α</sup> <sup>M</sup>*−1−1/*<sup>α</sup>* ln *<sup>M</sup>* <sup>≥</sup> <sup>0</sup> <sup>&</sup>gt; −1 *<sup>e</sup>* , it follows that

$$
\hat{k}\frac{\ln M}{\alpha} = \mathcal{W}\left(\frac{\beta}{\alpha}M^{-1-1/\alpha}\ln M\right)
$$

$$
\Longleftrightarrow k = \frac{\alpha \mathcal{W}\left(\frac{\beta}{\alpha}M^{-1-1/\alpha}\ln M\right) + \ln M}{\alpha \ln M}
$$

where W(.) is the Lambert W function. Using equation (3), we write

$$c(l) \le \frac{l}{k - \frac{1}{\alpha}} = \frac{l \ln M}{\mathsf{W} \left(\frac{\beta}{\alpha} M^{-1 - 1/\alpha} \ln M\right)}.$$

To prove the lower bound, note that the number of parsed distinct phrases *c*(*l*) is minimized when the sequence of length *l* consists of only one symbol that repeats. Let *l k* be the sum of the lengths of all such distinct strings of length less than or equal to *k*. Then,

$$
\widetilde{I\_k} = \sum\_{j=1}^k j = \frac{k(k+1)}{2}.
$$

Thus, given a sequence of length *<sup>l</sup>* by enforcing *<sup>l</sup>* <sup>=</sup> *<sup>k</sup>*(*k*+1) <sup>2</sup> , we obtain the lower bound.

Figure 3 illustrates the lower and upper bounds established in Theorem 2 against the sequence length for various alphabet sizes. Note that the lower bound on the number of distinct phrases is independent of the alphabet size.

While numerical experiments are not a substitute for the mathematical proof of Theorem 2 provided above, the reader may find it useful to understand the theorem in terms of a simple example. In Figures 4–6, we compare the theoretical bound with numerical results of simulation for binary i.i.d. sequences. In these experiments, for each value of *P*(*X* = 1), a thousand binary sequences are generated; then, the number of distinct phrases resulting from LZ78 parsing of each sequence is calculated, and hence, the average, minimum, and maximum of these counts are found and represented by error bars.

**Figure 3.** Plot of the lower and upper bounds of Theorem 2 on the number of distinct phrases resulting from LZ78-parsing of an |X |-ary sequence of length *l*.

**Figure 4.** Simulation results compared to the lower and upper bounds of Theorem 2 on the number of distinct phrases resulting from LZ78-parsing of binary sequences of length *l* generated by sources with three different source probabilities *P*(*X* = 1). For every *P*(*X* = 1), one thousand binary sequences of length *l* are generated. Error bars represent the maximum, minimum, and average number of distinct phrases.

Next, we verify the convergence of the non-asymptotic upper bound achieved in Theorem 2 to the asymptotic upper bound of the LZ78 parser. Using a lower bound on Lambert W function ln *<sup>x</sup>* <sup>−</sup> ln(ln *<sup>x</sup>*) <sup>≤</sup> <sup>W</sup>(*x*) [73], we write

$$\begin{aligned} \mathcal{W}\left(\frac{\beta}{\alpha}\frac{\ln M}{M^{1+1/\alpha}}\right) &= \mathcal{W}\left(\left((M-1)l - \frac{M}{M-1}\right)\frac{\ln M}{M^{\frac{M}{M-1}}}\right) \\ &\approx \mathcal{W}(c\_M l \ln M) \\ &\geq \ln \frac{c\_M l \ln M}{\ln(c\_M l \ln M)} \\ &= \ln \frac{c\_M l}{\log(c\_M l \ln M)} \end{aligned}$$

where the logarithm is base *<sup>M</sup>* <sup>=</sup> |X <sup>|</sup> and *cM* <sup>=</sup> *<sup>M</sup>*−<sup>1</sup> *<sup>M</sup>M*/(*M*−1) . Hence, we can further simplify the asymptotic upper bound of *c*(*l*) as follows

$$\begin{split} c(l) &\leq \frac{l \ln M}{\mathrm{W} \left(\frac{\mathcal{G}}{\kappa} M^{-1-1/\alpha} \ln M\right)}\\ &\leq \frac{l \ln M}{\ln \frac{c\_M l}{\log \left(c\_M l \ln M\right)}}\\ &= \frac{l}{\log \frac{c\_M l}{\log \left(c\_M l \ln M\right)}}\\ &= \frac{l}{\log l + \log c\_M - \log \log \left(c\_M l \ln M\right)}\\ &= \frac{l}{\left(1 - \frac{\log \log l + \tilde{c}\tilde{M}}{\log l}\right) \log l} \end{split}$$

where *<sup>c</sup>*"*<sup>M</sup>* <sup>=</sup> log *cM* <sup>−</sup> log log(*cM* ln *<sup>M</sup>*). Therefore, as *<sup>l</sup>* <sup>→</sup> <sup>∞</sup>, we have *<sup>c</sup>*(*l*) <sup>≤</sup> *<sup>l</sup>* log *l* . This is consistent with the binary case *M* = 2 proved in ([4] Lemma 13.5.3) or [5]. The following Lemma extends the result of ([4] Lemma 13.5.3) to |X |-ary case.

**Lemma 1.** *The number of distinct phrases <sup>c</sup>*(*l*) *resulting from LZ78-parsing of an* |X <sup>|</sup>*-ary sequence xl* <sup>1</sup> <sup>=</sup> {*xn*, *<sup>n</sup>* <sup>=</sup> 1, . . . , *<sup>l</sup>*} *satisfies*

$$c(l) \le \frac{l}{(1 - \epsilon\_l)\log l},$$

$$\text{where the logarithm is base } |\mathcal{X}| \text{ and } \epsilon\_l = \min\left\{1, \frac{\log\log l - \log(|\mathcal{X}| - 1) + \frac{3|\mathcal{X}| - 2}{|\mathcal{X}| - 1}}{\log l}\right\} \to 0 \text{ as } l \to \infty.$$

**Proof.** The proof is similar to the proof in ([4] Lemma 13.5.3) or ([74] Theorem 2). Let *M* - |X |. In Theorem 2, we defined *lk* as the sum of the lengths of all distinct strings of length less than or equal to *k*, and we showed that for any given *l* such that *lk* ≤ *l* < *lk*+1, we have *<sup>c</sup>*(*l*) <sup>≤</sup> *<sup>c</sup>*(*lk*) <sup>+</sup> *<sup>l</sup>*−*lk <sup>k</sup>*+<sup>1</sup> <sup>≤</sup> *<sup>l</sup> <sup>k</sup>*<sup>−</sup> <sup>1</sup> *M*−1 . Next, we bound the size of *k*. As such, we have *<sup>l</sup>* <sup>≥</sup> *lk* <sup>≥</sup> *<sup>M</sup><sup>k</sup>* or, equivalently, *<sup>k</sup>* <sup>≤</sup> log *<sup>l</sup>* where the logarithm is base *<sup>M</sup>*. Additionally,

$$\begin{aligned} l \le l\_{k+1} &= \left(k + 1 - \frac{1}{M - 1}\right) \frac{M^{k+2}}{M - 1} + \frac{M}{\left(M - 1\right)^2} \\ &= \left(\frac{k}{M - 1} + \frac{M - 2}{\left(M - 1\right)^2}\right) M^{k+2} + \frac{M}{\left(M - 1\right)^2} \\ &\le \frac{k + 2}{M - 1} M^{k + 2} \le \frac{\log l + 2}{M - 1} M^{k + 2}, \end{aligned}$$

therefore, *<sup>k</sup>* <sup>+</sup> <sup>2</sup> <sup>≥</sup> log (*M*−1)*<sup>l</sup>* log *<sup>l</sup>*+<sup>2</sup> . Equivalently, for *<sup>l</sup>* <sup>≥</sup> *<sup>M</sup>*2,

$$\begin{aligned} k - \frac{1}{M - 1} &\geq \log l - \log(\log l + 2) + \log(M - 1) - 2 - \frac{1}{M - 1} \\ &= \left( 1 - \frac{\log(\log l + 2) - \log(M - 1) + \frac{2M - 1}{M - 1}}{\log l} \right) \log l \\ &\geq \left( 1 - \frac{\log(2 \log l) - \log(M - 1) + \frac{2M - 1}{M - 1}}{\log l} \right) \log l \\ &= \left( 1 - \frac{\log \log l - \log(M - 1) + \frac{3M - 2}{M - 1}}{\log l} \right) \log l \\ &= (1 - \epsilon\_l) \log l, \end{aligned}$$

where *<sup>l</sup>* <sup>=</sup> min% 1, log log *<sup>l</sup>*−log(*M*−1)+ <sup>3</sup>*M*−<sup>2</sup> *M*−1 log *l* & .

Next, we analyze the properties of the number of distinct phrases *c*(*l*) resulting from LZ78-parsing of an |X <sup>|</sup>-ary sequence *<sup>x</sup><sup>l</sup>* <sup>1</sup> <sup>=</sup> {*xn*, *<sup>n</sup>* <sup>=</sup> 1, . . . , *<sup>l</sup>*} when *<sup>l</sup>* is fixed. The error bar representation in Figure 4 shows the variation of *c*(*l*) when *l* is fixed. A possible explanation for such variations is that the statistical distribution of the pseudorandomly generated data are different from the theoretical distribution of the generating source. To elucidate this possibility, we enforce the exact matching of the source probability mass function and the empirical probability mass function of the generated data. Figure 5 represents the number of distinct phrases *c*(*l*) resulting from LZ78-parsing of a binary sequence of fixed length where the characteristic of the generating source and the generated data matches. As seen, there is still some variation around the average value of *c*(*l*). We can specify a distribution-dependent bound on *c*(*l*) when both *l* and the distribution of the source are fixed.

In ([75] Theorem 1), for sequences generated from a memoryless source, *c*(*l*) is assumed to be a random variable with the following mean and variance:

$$\begin{aligned} \operatorname{E}(c(l)) &\sim \frac{l\eta l}{\log l'},\\ \operatorname{Var}(c(l)) &\sim \frac{\left(h\_2 - h^2\right)l}{\log^2 l}, \end{aligned} \tag{4}$$

where *<sup>h</sup>* <sup>=</sup> <sup>−</sup> <sup>∑</sup>*a*∈X *pa* log *pa* is the entropy rate, and *<sup>h</sup>*<sup>2</sup> <sup>=</sup> <sup>∑</sup>*a*∈X *pa* log<sup>2</sup> *pa* with *pa* being the probability of symbol *a* ∈ X . Note that the approximations (4) are asymptotic as *l* → ∞. Below, we obtain a finite sample characterization of *c*(*l*).

**Figure 5.** Similar to Figure 4, the number of distinct phrases resulting from LZ78-parsing of binary sequences of fixed length *l* = 1000 varies over the source probability parameter *P*(*X* = 1). For every *P*(*X* = 1), one thousand binary sequences of length *l* are generated. Error bars represent the maximum, minimum, and average number of distinct phrases.

Consider an |X <sup>|</sup>-ary sequence *<sup>x</sup><sup>l</sup>* <sup>1</sup> <sup>=</sup> {*xn*, *<sup>n</sup>* <sup>=</sup> 1, . . . , *<sup>l</sup>*} with fixed length *<sup>l</sup>* generated from a source with the probability mass function *p*(*x*). Here, the notations *x<sup>l</sup>* <sup>1</sup> and *<sup>x</sup><sup>l</sup>* are used interchangeably. Let *c*(*l*, *p*) denote the number of distinct phrases resulting from LZ78-parsing of the sequence *x<sup>l</sup>* <sup>1</sup> of length *l* and the generating probability mass function is defined by *p*(*x*). In order to find a distribution-dependent bound on the number of distinct phrases in LZ78-based parsing of *x<sup>l</sup>* <sup>1</sup>, we note that since the generating distribution is not necessarily uniform, all the strings *<sup>x</sup><sup>n</sup>* for *<sup>n</sup>* <sup>&</sup>lt; *<sup>l</sup>* <sup>∞</sup> do not necessarily appear as parsed phrases. For instance, consider the binary case with *P*(*X* = 1) = 0.9. Then, it is very unlikely to have a string with multiple consecutive zeros in any parsing of a realization of the finite sequence *x<sup>l</sup>* . As such, using the Asymptotic Equipartition Properties (AEP) ([4] Chapter 3) or Non-asymptotic Equipartition Properties (NEP) [76], we define the *typical set* <sup>A</sup>(*n*) with respect to *<sup>p</sup>*(*x*) as the set of subsequences *<sup>x</sup><sup>n</sup>* ∈ X *<sup>n</sup>* of *<sup>x</sup><sup>l</sup>* <sup>1</sup> with the property

$$2^{-n(h+\mathfrak{c})} \le p(\mathfrak{x}^n) \le 2^{-n(h-\mathfrak{c})},$$

where *h* is the entropy. Then, we have

$$1 = \sum\_{\mathbf{x}^n \in \mathcal{A}^n} p(\mathbf{x}^n) \ge \sum\_{\mathbf{x}^n \in \mathcal{A}\_{\mathbf{x}^n}^{(n)}} p(\mathbf{x}^n) \ge \left| \mathcal{A}\_{\mathbf{x}}^{(n)} \right| 2^{-n(h+\mathfrak{e})} \varkappa$$

therefore, A(*n*) <sup>≤</sup> <sup>2</sup>*n*(*h*<sup>+</sup>). Let *lk* be the sum of the lengths of all the distinct strings *<sup>x</sup><sup>n</sup>* in the set A(*n*) of length less than or equal to *<sup>k</sup>*. We write,

$$\begin{aligned} l\_k &= \sum\_{n=1}^k n \left| \mathcal{A}\_c^{(n)} \right| \\ &\le \sum\_{n=1}^k n 2^{n(h+c)} \\ &= \frac{1}{(m-1)^2} \left[ ((m-1)k-1)m^{k+1} + m \right] \end{aligned}$$

,

where *m* - <sup>2</sup>*h*<sup>+</sup>. Therefore, *<sup>l</sup>* = <sup>1</sup> (*m*−1) 2 ((*<sup>m</sup>* <sup>−</sup> <sup>1</sup>)*<sup>k</sup>* <sup>−</sup> <sup>1</sup>)*mk*+<sup>1</sup> <sup>+</sup> *<sup>m</sup>* can be solved for *k* which leads into an upper bound for *c*(*l*, *p*) as follows

$$\begin{aligned} k &= \frac{a\mathcal{W}\left(\frac{\beta}{a}m^{-1-1/\alpha}\ln m\right) + \ln m}{a\ln m}, \\ c(l, p) &\leq \sum\_{n=1}^{k} \left| \mathcal{A}\_{\varepsilon}^{(n)} \right| = \frac{m\left(m^{k} - 1\right)}{m - 1} \\ &= \frac{2^{k(h+\varepsilon)} - 1}{1 - 2^{-h-\varepsilon}}, \end{aligned}$$

where *<sup>α</sup>* <sup>=</sup> *<sup>m</sup>* <sup>−</sup> 1 and *<sup>β</sup>* <sup>=</sup> (*<sup>m</sup>* <sup>−</sup> <sup>1</sup>) 2 *<sup>l</sup>* <sup>−</sup> *<sup>m</sup>*. Therefore, the dependency of the *<sup>c</sup>*(*l*, *<sup>p</sup>*) upper bound on the distribution is only through the entropy. Figure 6 depicts the upper bound on *c*(*l*, *p*) for = 0.1.

**Figure 6.** Simulation of the probability-dependent upper bound *c*(*l*, *p*) for binary sequences of fixed length *l* = 100 with various probability parameters *P*(*X* = 1). For every *P*(*X* = 1), one thousand binary sequences of length *l* are generated. Error bars represent the maximum, minimum, and average number of distinct phrases.

#### *5.2. Pattern Dictionary Parser versus LZ78 Parser*

Given an |X <sup>|</sup>-ary sequence *<sup>x</sup><sup>l</sup>* <sup>1</sup> <sup>=</sup> {*xn*, *<sup>n</sup>* <sup>=</sup> 1, . . . , *<sup>l</sup>*}, let *cT*(*l*) be the number of parsed phrases of *x<sup>l</sup>* <sup>1</sup> when the typical encoder (pattern dictionary with *Dmax*) is used, and *cA*(*l*) be the number of parsed phrases of *x<sup>l</sup>* <sup>1</sup> when the atypical encoder (LZ78) is used. Clearly, *<sup>l</sup> Dmax* <sup>≤</sup> *cT*(*l*) <sup>≤</sup> *<sup>l</sup>* where the lower bound is achieved when SD - *xl* 1 = *<sup>x</sup>v*2−<sup>1</sup> *<sup>v</sup>*<sup>1</sup> , *<sup>x</sup>v*3−<sup>1</sup> *<sup>v</sup>*<sup>2</sup> ,..., *<sup>x</sup><sup>l</sup> vc* ! , and each *<sup>x</sup>vi*−<sup>1</sup> *vi* ∈ S(Dmax) <sup>D</sup> , namely *<sup>x</sup>vi*−<sup>1</sup> *vi* is of length *Dmax* and exists in the dictionary. The upper bound is achieved when SD - *xl* 1 <sup>=</sup> {*x*1, *<sup>x</sup>*2,..., *xl*} where each *xn* ∈ S(1) <sup>D</sup> . Using the result of Theorem <sup>2</sup> and a lower bound on the Lambert W function, ln *<sup>x</sup>* <sup>−</sup> ln(ln *<sup>x</sup>*) <sup>≤</sup> W(*x*) [73], we have

$$\frac{l}{D\_{\max}} \left( 1 - \frac{D\_{\max}}{\log \frac{l}{\log(l \ln |\mathcal{X}|)}} \right) \le c\_T(l) - c\_A(l)$$

$$\le l \left( 1 - \frac{\sqrt{8l + 1} - 1}{2l} \right). \tag{5}$$

The above bounds have asymptotic and non-asymptotic implications. The asymptotic analysis of the bounds in (5) suggests that as *l* → ∞, for a dictionary with fixed *Dmax*, we have *<sup>l</sup> Dmax* <sup>≤</sup> *cT*(*l*) <sup>−</sup> *cA*(*l*) <sup>≤</sup> *<sup>l</sup>*. This inequality implies the asymptotic dominance of the parser using a typical encoder. This is to be expected due to the asymptotic optimality of LZ78. However, the above inequality also implies a more interesting result: if *Dmax* > log *<sup>l</sup>* log(*<sup>l</sup>* ln|X <sup>|</sup>) as *<sup>l</sup>* <sup>→</sup> <sup>∞</sup>, then *cT*(*l*) can be smaller than *cA*(*l*). The non-asymptotic behavior of the bounds in (5) is more relevant to the anomaly detection problem. These bounds suggest that for a fixed *l* and |X |, increasing *Dmax* has a vanishing effect on the possible range of the anomaly score. Additionally, the achieved bounds on *cT*(*l*) <sup>−</sup> *cA*(*l*) provide the range of values of the anomaly score. This facilitates the search for a data-dependent threshold for anomaly detection, as the search can be restricted to this range.

#### *5.3. Atypicality Criterion for Detection of Anomalous Subsequences*

Consider the problem of finding the atypical (anomalous) subsequences of a long sequence with respect to a trained pattern dictionary D. Suppose we are looking for an infrequent anomalous subsequence *<sup>x</sup>n*+*l*−<sup>1</sup> *<sup>n</sup>* <sup>=</sup> {*xn*, *<sup>n</sup>* <sup>=</sup> *<sup>n</sup>*,..., *<sup>n</sup>* <sup>+</sup> *<sup>l</sup>* <sup>−</sup> <sup>1</sup>} embedded in a test sequence {*xn*, *<sup>n</sup>* <sup>=</sup> 1, . . . , *<sup>L</sup>*} from the finite alphabet <sup>X</sup> . Using Equation (2), the typical codelength of the subsequence *<sup>x</sup>n*+*l*−<sup>1</sup> *<sup>n</sup>* is

$$L\_T\left(\mathbf{x}\_n^{n+l-1}\right) = \sum\_{\mathbf{y} \in \mathcal{S}\_D\left(\mathbf{x}\_n^{n+l-1}\right)} L\_{\mathcal{D}}(\mathbf{y}) + \left| \mathcal{S}\_{\mathcal{D}}\left(\mathbf{x}\_n^{n+l-1}\right) \right| \log D\_{\max} \epsilon$$

while using LZ78, the atypical codelength of the subsequence *<sup>x</sup>n*+*l*−<sup>1</sup> *<sup>n</sup>* is

$$\begin{aligned} L\_A \left( \mathfrak{x}\_n^{n+l-1} \right) &= \left| \mathcal{S}\_{LZ} \left( \mathfrak{x}\_n^{n+l-1} \right) \right| \left[ \log \left| \mathcal{S}\_{LZ} \left( \mathfrak{x}\_n^{n+l-1} \right) \right| + 1 \right] \\ &+ \log^\*(l) + \tau, \end{aligned}$$

where log∗(*l*) + *τ* is an additive penalty for not knowing in advance the start and end points of the anomalous sequence [2,3], and log∗(*l*) = log *l* + log log *l* + ... where the sum continues as long as the argument to the outer log is positive. Let *L <sup>A</sup>* <sup>=</sup> *LA* <sup>−</sup> *<sup>τ</sup>*. We propose the following atypicality criterion for detection of an anomalous subsequence:

$$\triangle L(n) = \max\_{l} \left\{ L\_T \left( \mathbf{x}\_n^{n+l-1} \right) - L\_A^{'} \left( \mathbf{x}\_n^{n+l-1} \right) \right\} > \tau,\tag{6}$$

where *τ* can be treated as an anomaly detection threshold. In practice, *τ* can be set to ensure a false positive constraint, e.g., using bootstrap estimation of the quantiles in the training data.

#### **6. Experiment**

In this section, we illustrate the proposed pattern dictionary anomaly detection on a synthetic time series, known as Mackey–Glass [77], as well as on a real-world time series of physiological signals. In both experiments, first, the real-valued samples are discretized using a uniform quantizer [78], and then, anomaly detection methods are applied.

#### *6.1. Anomaly Detection in Mackey–Glass Time Series*

In this section, we illustrate the proposed anomaly detection method for the case of a chaotic Mackey–Glass (MG) time series that has an anomalous segment grafted into the middle of the sequence. MG time series are generated from a nonlinear time delay differential equation. The MG model was originally introduced to represent the appearance of complex dynamic in physiological control systems [77]. The nonlinear differential equation is of the form *dx*(*t*) *dt* <sup>=</sup> <sup>−</sup>*ax*(*t*) <sup>+</sup> *bx*(*t*−*δ*) <sup>1</sup>+*x*10(*t*−*δ*) , *t* ≥ 0, where *a*, *b* and *δ* are constants. For the training data, we generated 3000 samples of the MG time series with *a* = 0.2, *b* = 0.1, and *δ* = 17. For the test data, we normalized and embedded 500 samples of the

MG time series with *a* = 0.4, *b* = 0.2, and *δ* = 17 inside 1000 samples of a MG time series generated from the same source as the training data, resulting in a test sequence of length 1500. Figure 7 shows a realization of the training data and the test data.

**Figure 7.** Mackey–Glass time series: the training data (**top**) and an example of the test data (**bottom**) in which samples in [501, 1000] are anomalous (shown in red).

The anomaly detection performance of our proposed pattern dictionary is evaluated. To illustrate the effect of the model parameter, i.e., the maximum depth *Dmax*, on the detection and compression performance of the pattern dictionary, we run two experiments. First, we use a 30-fold cross-validation on the training data (resulting in 30 sequences of length 100) and calculate the number of distinct parsed phrases against *Dmax*. Second, we train a pattern dictionary with various *Dmax* using the training data and then evaluate the sensitivity of detector of the anomalous subsequences in the test data using Equation (6) with *τ* = 0. In this experiment, the detection sensitivity (true positive rate) is defined as the ratio of number of samples correctly identified as anomalous over the total number of anomalous samples. Figure 8 illustrates the result of both experiments. As seen, after some point, increasing *Dmax* has diminishing effect on both detection sensitivity and the number of distinct parsed phrases. Note that this behavior is to be expected as it was suggested by the bounds in (5).

Next, we compare anomaly detection performance of our proposed pattern dictionary methods, PDD and PDA, with the nearest neighbors-based similarity (NNS) technique [7], the compression-based dissimilarity measure (CDM) method [12–14], Ziv–Merhav method (ZM) [48], and the threshold Sequence Time-Delay Embedding (t-STIDE) technique [8–11]. In this experiment, a window of length 100 is slid over the test data and each method measures the *anomaly score* (as described below) of the current subsequence with respect to the training data. The anomaly is detected when the score exceeds a threshold, determined to ensure a specified false positive rate. In the following, we compute AUC (area under the curve) of the ROC (receiver operating characteristic) and Precision-Recall curves as performance measures. In the following, we provide details of the implementation.

**Figure 8.** The effect of maximum dictionary depth *Dmax* on parsing and detection sensitivity (true positive rate) of the Mackey–Glass time series presented in Figure 7.

#### Pattern Dictionary for Detection (PDD)

First, the training data are used to create a pattern dictionary with *Dmax* = 40, as described in Section 4. Then, for each subsequence *x*<sup>100</sup> (the sliding window of length 100) of the test data, the anomaly score is computed as the codelength *L x*100 of Equation (2) described in Section 4.3.

#### Pattern Dictionary Based Atypicality (PDA)

Similar to PDD, first the training data are used to create a pattern dictionary with *Dmax* = 40, as described in Section 4. Then, for each subsequence *x*<sup>100</sup> of the test data, the anomaly score is the atypicality measure described in Section 5, i.e., *LT x*100 − *LA x*100 , the difference between the compression codelength of the test subsequence using typical encoder (pattern dictionary) and atypical encoder (LZ78).

#### Ziv–Merhav Method (ZM) [48]

In this method, a cross-parsing procedure is used in which for each subsequence *x*<sup>100</sup> of the test data, the anomaly score is computed as the number of the distinct phrases of *x*<sup>100</sup> with respect to the training data.

#### Nearest Neighbors-Based Similarity (NNS) [7]

In this method, a list S of all the subsequence of length 100 (the length of the sliding window) of the training data is created. Then, for each subsequence *x*<sup>100</sup> of the test data, the distance between *<sup>x</sup>*<sup>100</sup> and all the subsequences in the list <sup>S</sup> is calculated. Finally, the anomaly score of *<sup>x</sup>*<sup>100</sup> is its distance to the nearest neighbor in the list <sup>S</sup>.

Compression-Based Dissimilarity Measure (CDM) [12–14]

In this method, given the training data *xtrain*, for each subsequence *x*<sup>100</sup> of the test data the anomaly score is

$$\mathcal{C}DM(\mathfrak{x}\_{train}, \mathfrak{x}^{100}) = \frac{\mathcal{L}\left(\mathcal{C}\left(\mathfrak{x}\_{train}, \mathfrak{x}^{100}\right)\right)}{\mathcal{L}\left(\mathfrak{x}\_{train}\right) + \mathcal{L}\left(\mathfrak{x}^{100}\right)}.$$

where <sup>C</sup>(*y*, *<sup>x</sup>*) represents concatenation of sequences *<sup>y</sup>* and *<sup>z</sup>*, and <sup>L</sup>(*x*) is the size of the compressed version of the sequence *x* using any standard compression algorithm. The CDM anomaly score is close to 1 if the two sequence are not related, and smaller than one if the sequences are related.

Threshold Sequence Time-Delay Embedding (t-STIDE) [8–11]

In this method, given *l* < 100, for each sub-subsequence *x<sup>l</sup>* of the subsequence *x*<sup>100</sup> of the test data, the likelihood score of *x<sup>l</sup>* is the normalized frequency of its occurrence in the training data, and the anomaly score of *x*<sup>100</sup> is one minus the average likelihood score of all its sub-subsequences of length *l*. In this experiment, various values of *l* are tested and the best performance is reported.

We compare the detection performance of the aforementioned methods by generating 200 test data sequences with different anomaly segments (the anomalous MG segments have different initializations in each test dataset). The detection results of comparisons are reported in Table 2. As seen, our proposed PDD and PDA methods outperform the rest, with ZM and CDM coming in third place. The effect of alphabet size of the quantized data (the resolution parameter of the uniform quantizer [78]) on anomaly detection performance is summarized in Table 3. Table 3 shows that our proposed PDD and PDA methods outperform in all three cases of data resolution.

**Table 2.** Comparison of anomaly detection methods (*μ* ± *σ* representation is used where *μ* is the mean and *σ* is the standard deviation). The proposed PDA method attains overall best performance (bold entries of table).


Since the parsing procedure of our proposed PD-based methods and the ZM method [48] are similar, it is of interest to compare the running time of these two methods. While the cross-parsing procedure of the ZM method was introduced as an on the fly process [48], we can also consider another implementation similar to our proposed PD by creating a codebook of all the subsequences of the training data prior to the parsing procedure. As such, in order to compare the running time of the dictionary/codebook creation and parsing procedure of our PD-based methods with the aforementioned two implementations of the ZM method, we use the same MG training data of length 3000, one test dataset of length 1500 while a sliding window of length 100 is slid over it for anomaly score calculation, and the PD-based method with *Dmax* = 40. Note that since a sliding window of length 100 over the test data is considered, for the codebook-based implementation of ZM, all the subsequences of the training data up to length 100 are extracted which make its codebook creation process significantly faster. Table 4 summarizes the running time comparison. As it can be seen, our PD-based method is faster in both dictionary/codebook creation and parsing process.


**Table 3.** Comparison of anomaly detection methods for different cases of data resolutions: high resolution corresponds to an alphabet size of 90, medium resolution corresponds to an alphabet size of 45, and low resolution corresponds to an alphabet size of 10. In this table, *μ* ± *σ* representation is used where *μ* is the mean and *σ* is the standard deviation. The proposed PDA method achieves overall best performance (bold entries of table).

**Table 4.** Comparison of running time (in second) of PD-based method and two implementations of the ZM method for different cases of data resolutions: high resolution corresponds to an alphabet size of 90, medium resolution corresponds to an alphabet size of 45, and low resolution corresponds to an alphabet size of 10. This experiment is performed on a Hansung laptop with 2.60 GHz CPU, 500 GB of SSD, and 16 GB of RAM using MATLAB R2021a. The proposed PD-based method has fastest run time overall (bold entries in table).


#### *6.2. Infection Detection Using Physiological Signals*

Finally, we apply the proposed pattern dictionary method to detect unusual patterns in physiological signals of two human subjects after exposure to a pathogen while only one of these subjects became symptomatically ill. The time series data were collected in a human viral challenge study that was performed in 2018 at the University of Virginia under a DARPA grant. Consented volunteers were recruited into this study following an IRB-approved protocol and the data was processed and analyzed at Duke University and the University of Michigan. The challenge study design and data collection protocols are described in [79]. Volunteers' skin temperature and heart rate were recorded by a wearable device (Empatica E4) over three consecutive days before and five consecutive days after exposure to a strain of human Rhinovirus (RV) pathogen. During this period, the wearable time series were continuously recorded while biospecimens (viral load) were collected daily. The infection status can be clinically detected by biospecimen samples, but in practice, the collection process of these types of biosamples can be invasive and costly. As such, here, we apply the proposed anomaly detection framework to the measured two-dimensional heart rate and temperature time series to detect unusual patterns after exposure with respect to the normal (healthy) baseline patterns.

In the preprocessing phase, we followed the wearable data preprocessing procedure described in [80]. Specifically, we first downsample the time series to one sample per minute by averaging. Then, we apply an outlier detection procedure to remove technical noise, e.g., sensor contact loss. After preprocessing, the two-dimensional space of temperature and heart rate time series is discretized using a two-dimensional uniform quantizer [78] with step size of 5 for heart rate and 0.5 for temperature, resulting in one-dimensional discrete sequence data. The first three days of data are used as the training data, and the PDA methods with maximum depth *Dmax* = 30 are used to learn the patterns in the training data. In order to detect anomalous patterns of the test data (the last five days), we used the result of Section 5.3 and the atypicality criterion of Equation (6), which requires choosing the threshold *τ*. While this threshold can be chosen freely, we selected it using cross-validation on the training data. Leave-one-out cross-validation over the training data generates an empirical null distribution of the PDA anomaly score function *LT* − *LA*. The threshold *τ* was chosen as the upper 99% quantile of this distribution. Figure 9 illustrates the result of anomaly detection on one subject who became infected as measured by viral shedding as shown in Figure 9C. All the anomalous patterns occur when the subject was shedding the virus. Figure 10 also depicts the result of anomaly detection on one subject who had a mild infection with a low level of viral shedding, as shown in Figure 10C. Note that in this case, no anomalous patterns were detected.

**Figure 9.** Anomaly detection using the proposed PDA method for a subject based on heart rate and temperature data collected from a wearable wrist sensor. Anomalies are shown in red in (**a**,**b**). (**c**) shows the subject's infection level.

**Figure 10.** Anomaly detection using the proposed PDA method for a subject who had a mild infection with low level of viral shedding based on heart rate and temperature data collected from a wearable wrist sensor. Note that no anomaly has been detected: (**a**) heart rate, (**b**) temperature, and (**c**) infection level.

#### **7. Conclusions**

In this paper, we have developed a universal nonparametric model-free anomaly detection method for time series and sequence data using a pattern dictionary. We proved that using a multi-level dictionary that separates the patterns by their depth results in a shorter average indexing codelength in comparison to a uni-level dictionary that uses a uniform indexing approach. We illustrated that the proposed pattern dictionary method can be used as a stand-alone anomaly detector, or integrated with Tree-Structured Lempel– Ziv (LZ78) and incorporated into an atypicality framework. We developed novel nonasymptotic lower and upper bounds of the LZ78 parser and demonstrated that the nonasymptotic upper bound on the number of distinct phrases resulting from LZ78-parsing of an |X |-ary sequence can be explicitly derived in terms of the Lambert W function, an important theoretical result that is not trivial. We showed that the achieved non-asymptotic bounds on LZ78 and pattern dictionary determine the range of the anomaly score and the anomaly detection threshold. We also presented an empirical study in which the pattern dictionary approach is used to detect anomalies in physiological time series. In the future work, we will investigate the generalization of the context tree weighting methods to the general discrete case, using the pattern dictionary since the pattern dictionary handles sparsity well and is computationally less expensive when the alphabet size is large.

**Author Contributions:** Data curation, E.S. and S.O.; Formal analysis, E.S.; Funding acquisition, A.O.H.; Methodology, E.S.; Project administration, A.O.H.; Software, E.S. and S.O.; Supervision, A.O.H.; Validation, E.S., P.X.K.S. and A.O.H.; Visualization, E.S. and S.O.; Writing—original draft, E.S.; Writing—review & editing, P.X.K.S. and A.O.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by Michigan Institute for Data Science, and grants from the Army Research Office, grant W911NF-15-0479, the Defense Advanced Research Projects Agency, grant N66001-17-2-401, and the Department of Energy/National Nuclear Security Administration, grant DE-NA0003921.

**Institutional Review Board Statement:** University of Michigan ethical review and approval were waived since only de-identified data artifacts from a previously approved human challenge study were made available to the co-authors for the analysis reported in Section 6.2 Information concerning provenance of the data, i.e., the human rhinovirus (RV) challenge study experiment and its IRB approved protocol, was reported in [79].

**Informed Consent Statement:** The de-identified data used for our analysis in Section 6.2 came from a human rhinovirus (RV) challenge study in which informed consent was obtained from all subjects involved in the study. See the [79] for details.

**Data Availability Statement:** The experimental data used in Section 6.2 will be made available upon request.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **A Maximal Correlation Framework for Fair Machine Learning**

**Joshua Lee 1,†,‡, Yuheng Bu 1,\*,‡, Prasanna Sattigeri 2, Rameswar Panda 2, Gregory W. Wornell 1, Leonid Karlinsky <sup>2</sup> and Rogerio Schmidt Feris <sup>2</sup>**


**Abstract:** As machine learning algorithms grow in popularity and diversify to many industries, ethical and legal concerns regarding their fairness have become increasingly relevant. We explore the problem of algorithmic fairness, taking an information–theoretic view. The maximal correlation framework is introduced for expressing fairness constraints and is shown to be capable of being used to derive regularizers that enforce independence and separation-based fairness criteria, which admit optimization algorithms for both discrete and continuous variables that are more computationally efficient than existing algorithms. We show that these algorithms provide smooth performance– fairness tradeoff curves and perform competitively with state-of-the-art methods on both discrete datasets (COMPAS, Adult) and continuous datasets (Communities and Crimes).

**Keywords:** fairness; HGR maximal correlation; independence criterion; separation criterion

Panda, R.; Wornell, G.W.; Karlinsky, L.; Schmidt Feris, R. A Maximal Correlation Framework for Fair Machine Learning. *Entropy* **2022**, *24*,

Academic Editor: Friedhelm Schwenker

e24040461

461. https://doi.org/10.3390/

**Citation:** Lee, J.; Bu, Y.; Sattigeri, P.;

Received: 15 February 2022 Accepted: 24 March 2022 Published: 26 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

The use of machine learning in many industries has raised many ethical and legal concerns, especially that of fairness and bias in predictions, e.g., [1,2]. As systems are trusted to aid or make decisions regarding loan applications, criminal sentencing, and even health care, it is vital that unfair biases do not influence them.

However, mitigating these biases is complicated by ever-changing perspectives on fairness, and a good system for enforcing fairness must be adaptable to new settings. In particular, there are often competing notions on fairness. Two of these popular notions are independence and separation (a third condition, sufficiency, is beyond the scope of this paper), as discussed in [3]. Independence ensures that predictions are independent from membership in a protected class, so that one achieves equal favorable outcome rates across all groups, and it arises in applications such as affirmative action [4]. Separation is designed to achieve equal type I/II error rates across all groups by enforcing independence between predictions and membership in a protected class conditional on the class label. This criterion is used to measure fairness in recidivism predictions and bank loan applications. A significant body of work, including [3,5–7], has gone into explaining that independence and separation are inherently incompatible for non-trivial cases, and their applicability needs to be determined by the application and the stakeholders. This motivates us to construct a framework that is flexible enough to handle different fairness criteria and to do it with different modalities of data (discrete vs. continuous data, for example).

This bias mitigation must also be balanced out with the system's usefulness, and often, one must tune the tradeoff between the fairness (as measured in the particular context) and performance according to a current situation, which can be a difficult process if the tradeoff curve is not smooth. Generating the frontier of possible values can be computationally

infeasible or impossible if the algorithm does not have a regularization parameter to adjust (see, [8,9]), thus making it difficult to achieve this balance, which makes the fast generation of fair classifiers even more important.

Different contexts also require different points of intervention during the learning process to ensure fairness. *Pre-processing* approaches ([8,10–14]) modify the data to eliminate bias, whereas *post-processing* approaches ([15–18]) modify learned features/predictions from existing models to be more fair. We focus on the *in-processing* approach [9,19–21], where the fairness criteria are directly incorporated in the training objective to produce fairer learned features. Motivated by few-shot applications where only a pre-trained network and few samples labeled with the sensitive attribute are available, we also seek a method that is applicable in a post-processing manner when we have access to only a small number of samples labeled with the sensitive attribute that we wish to be fair about, which would arise in settings where collecting this information can be very difficult.

In this paper, we frame the ideas of independence and separation in a way that allows a relevant regularizer or penalty term to be derived in addition to a measure of fairness, which is useful in enforcing fairness while also tractable, admitting an optimization algorithm (e.g., if used as an objective for a neural net trained using gradient descent, it must be differentiable), and easily computed. Existing approaches can struggle with efficiency, can fail to provide good control over the performance–fairness tradeoff, and/or can only deal with either discrete or continuous data.

We make the following contributions in this paper:


#### **2. Background**

#### *2.1. Fairness Objectives in Machine Learning*

Consider the standard supervised learning scenario where we predict the value of a target variable *Y* ∈ Y using a set of decision or predictive variables *X* ∈ X with training samples {(*x*1, *<sup>y</sup>*1), ... ,(*xn*, *yn*)}. For example, *<sup>X</sup>* may be information about an individual's credit history, and *Y* is whether the individual will pay back a certain loan. In general, we wish to find features *f*(*x*), which are predictive of *Y*, so that we can construct a good predictor *y*ˆ = *T*(*f*(*x*)) of *y* under some loss criteria *L*(*y*ˆ, *y*).

Now, suppose we have some sensitive attributes *D* ∈ D we wish to be "fair" about (e.g., race, gender), and training samples {(*x*1, *<sup>y</sup>*1, *<sup>d</sup>*1), ... ,(*xn*, *yn*, *dn*)}. For example, in the criminal justice system, predictions about the chance of recidivism of a convicted criminal (*Y*) given factors such as the nature of the crime and the number of prior arrests (*X*) should not be determined by race (*D*). This is a known issue with the COMPAS recidivism score, which, despite not using race as an input to make decisions, still leads to systematic bias toward members of certain races in the output score as in [22,23].

The two most popular criteria for fairness are independence and separation. Independence states that for a feature to be fair, it must satisfy the independence property *<sup>Y</sup>*<sup>ˆ</sup> <sup>⊥</sup> *<sup>D</sup>* or *<sup>f</sup>*(*x*) <sup>⊥</sup> *<sup>D</sup>*. The intuition is simple: if the prediction/feature is independent of

the sensitive attribute, then no information about the sensitive attribute is used to predict *Y*. This criterion has been studied under the lens of *demographic parity* and *disparate impact* in [3], and it admits a class of fairness measures based on the degree of dependence between *f*(*X*) and *D*. For example, independence is satisfied if and only if the mutual information *I*(*f*(*X*); *D*) is zero. When *D* is binary, another popular class of measures used by the US Equal Employment Opportunity Commission [4] is the disparate impact, which is defined as D <sup>P</sup>(*Y*|*<sup>D</sup>* <sup>=</sup> <sup>1</sup>); <sup>P</sup>(*Y*|*<sup>D</sup>* <sup>=</sup> <sup>0</sup>) = <sup>P</sup>(*Y*ˆ=1|*D*=0) <sup>P</sup>(*Y*ˆ=1|*D*=1) .

Separation requires the conditional independence property (*Y*<sup>ˆ</sup> <sup>⊥</sup> *<sup>D</sup>*)|*<sup>Y</sup>* or (*f*(*X*) <sup>⊥</sup> *<sup>D</sup>*)|*Y*. This criterion allows for a violation of demographic parity to the extent that it is justified by the target variable. In the general case, this criterion suggests a fairness measure based on the conditional dependence between *Y*ˆ and *D* conditioned on *Y*. In the case where *D* is binary, we obtain the *equalized opportunities* (EO) measures in [3], which are given by the differences in error rates for the two groups (e.g., the difference between the false positive rates for *D* = 0, 1). For a more complete discussion of the advantages and disadvantages of these two criteria, please refer to [3].

#### *2.2. Maximal Correlation*

Since these fairness criteria are expressed as enforcing independencies with respect to joint distributions, we look for constraints that reduce the dependency between variables. In particular, the right formulation of correlation between learned features and sensitive attributes can provide a framework for measuring and optimizing for fairness. One effective measure applicable to both continuous and discrete data is the Hirschfeld–Gebelein–Renyi (HGR) maximal correlation, which is a measure of nonlinear correlation that originated in [24] and is further developed in [25,26]. The HGR maximal correlation between two random variables is equal to zero if and only if the two variables are independent, and it increases in value the more correlated they are (i.e., the more biased/unfair).

**Definition 1.** *For two jointly distributed random variables X* ∈ X *and Y* ∈ Y*, given* 1 ≤ *k* ≤ *<sup>K</sup>* <sup>−</sup> <sup>1</sup> *with K* <sup>=</sup> min{|X |, |Y|}*, the HGR maximal correlation problem is*

$$(\mathbf{f}^\*, \mathbf{g}^\*) \triangleq \underset{\mathbf{f} \colon \mathcal{X} \to \mathbb{R}^k, \mathbf{g} \colon \mathcal{Y} \to \mathbb{R}^k}{\text{arg}\max} \, \mathbb{E}\left[\mathbf{f}^T(X)\,\mathbf{g}(Y)\right],\tag{1}$$

*with constraints*

$$\mathbb{E}[\mathbf{f}(X)] = \mathbb{E}[\mathbf{g}(Y)] = \mathbf{0}, \quad \mathbb{E}\left[\mathbf{f}(X)\mathbf{f}^T(X)\right] = \mathbb{E}\left[\mathbf{g}(Y)\mathbf{g}^T(Y)\right] = \mathbf{I},\tag{2}$$

*and expectations taken over PX*,*Y. We refer to* **f**<sup>∗</sup> *and* **g**<sup>∗</sup> *as maximal correlation functions, with* **f**<sup>∗</sup> = (*f* <sup>∗</sup> <sup>1</sup> ,..., *f* <sup>∗</sup> *<sup>k</sup>* )<sup>T</sup> *and* **<sup>g</sup>**<sup>∗</sup> = (*g*<sup>∗</sup> <sup>1</sup>,..., *g*<sup>∗</sup> *<sup>k</sup>* )T*, and the associated maximal correlations are*

$$\sigma(f\_i^\* \ g\_i^\*) \triangleq \mathbb{E}[f\_i^\*(X) \ g\_i^\*(Y)], \text{ for } i = 1, \ldots, k,\tag{3}$$

*and the HGR maximal correlation is*

$$\text{HGR}\_k(X,Y) \triangleq \mathbb{E}\left[\mathbf{f}^\*(X)\,\mathbf{g}^\*(Y)\right] = \sum\_{i=1}^k \sigma\left(f\_i^\* \,\mathbf{g}\_i^\*\right). \tag{4}$$

Note that the original definition of HGR maximal correlation is the special case of our definition when *k* = 1 (see, [27]). This generalization of maximal correlation analysis enables us to produce more than one feature mapping by solving the maximal correlation problem, and these feature mappings can be used in other applications, including ensemble learning, multi-task learning, and transfer learning [28,29].

#### *2.3. Related Work*

Independence and separation have been studied in many works. Most existing approaches fail to provide an efficient solution in both discrete/continuous settings. Ref. [11] develops an optimizer using absolute difference in odds <sup>|</sup>P(*Y*<sup>ˆ</sup> <sup>=</sup> <sup>1</sup>|*<sup>D</sup>* <sup>=</sup> <sup>1</sup>) <sup>−</sup> <sup>P</sup>(*Y*<sup>ˆ</sup> <sup>=</sup> <sup>1</sup>|*<sup>D</sup>* <sup>=</sup> <sup>0</sup>)<sup>|</sup> as a regularizer, which requires discrete *<sup>Y</sup>* and *<sup>D</sup>* and was only applied to Naïve Bayes and Logistic Regression to enforce the independence criterion. In [16], a post-processing method is provided using a probabilistic combination of classifiers to achieve the desired ROC curves, which only applies when *D* is discrete. Alternatively, Ref. [8] proposes pre-processing the data beforehand to enforce fairness before learning, based on randomized mappings of the data subject to a fairness constraint defined by *<sup>J</sup>* <sup>=</sup> max(<sup>|</sup> <sup>P</sup>(*Y*ˆ=1|*D*=1) <sup>P</sup>(*Y*ˆ=1|*D*=0) <sup>−</sup> <sup>1</sup>|, <sup>|</sup> <sup>P</sup>(*Y*ˆ=1|*D*=0) <sup>P</sup>(*Y*ˆ=1|*D*=1) <sup>−</sup> <sup>1</sup>|). Again, this method is only designed for independence with discrete *Y* and *D*, and it requires processing the entire dataset, which is computationally complex. Ref. [30] propose the use of a robust log-loss predictor for fairness, but in practice, it requires that *Y* be discrete.

Other methods can also be limited in their ability to handle all dependencies between variables. Ref. [31] uses a covariance-based constraint to enforce fairness, so it likely would not do well on other metrics. Furthermore, it is strictly a linear penalty rather than our non-linear formulation and penalizes the predictions of the system rather than the features learned. This limits the relationships between variables it can capture. An adversarial method is proposed in [20] to enforce independence or separation, but it requires the training of an adversary to predict the sensitive attribute, which can introduce issues of convergence and bias.

Recently, Ref. [9] propose the use of the HGR maximal correlation as a regularizer for either the independence or the separation constraint. In contrast to our approach dealing with the maximal correlation directly, they use a *χ*<sup>2</sup> divergence computed over a mesh grid to upper bound the HGR maximal correlation during the optimization of the classifier (either a linear regressor or a Deep Neural Net (DNN)). This method applies to cases where *X* is continuous and *Y* and *D* are either continuous or discrete variables, but it scales poorly with the bandwidth and dimensionality of *D*, and it treats the discrete case in the same way as the continuous case, resulting in slow performance on discrete datasets.

There are other works that use either an HGR-based or mutual information-based formulation of fairness but do not generalize to more than one setting. Refs. [32,33] use correlation-based regularizers but can only be used in the independence case. Furthermore, Ref. [33] only works with discrete targets, and only uses a single mode of the HGR maximal correlation (as opposed to multiple modes, which our method makes use of) for regularization, which limits the information it can encapsulate, and it is also not designed for continuous sensitive attributes. Ref. [34] also develops a method that can only be used for independence, and it requires training an additional network in order to evaluate a bound for the mutual information which can be used to as a fairness penalty, thus increasing the complexity and required runtime. Finally, Ref. [35] approximates the mutual information with a variational formulation, but it does not include a formulation for continuous labels.

#### **3. Maximal Correlation for Fairness**

Equipped with the HGR maximal correlation as a measure of dependence, we explore its use as a fairness penalty. Depending on the data modality (discrete/continuous) and the fairness criteria (independence/separation), the resulting fair learning algorithm takes different specifically tailored forms. In this section, we demonstrate how to derive these regularizers and algorithms to ensure the aforementioned fairness objectives for both discrete and continuous cases.

#### *3.1. Maximal Correlation for Discrete Learning*

In this subsection, the decision variable *X*, target variable *Y*, and sensitive attribute *D* are discrete random variables defined on alphabets X , Y, and D, respectively.

We first describe how to solve the discrete maximal correlation problem using a divergence transfer matrix (DTM)-based approach. As it is shown later, it is more convenient to work with their equivalent representation via DTM instead of the joint distribution *PX*,*Y*.

**Definition 2.** *The divergence transfer matrix (DTM)* **<sup>B</sup>***Y*,*<sup>X</sup>* <sup>∈</sup> <sup>R</sup>|Y|×|X | *associated with joint distribution PX*,*<sup>Y</sup> is given by*

$$\mathbf{B}\_{X,Y}(\mathbf{x},\mathbf{y}) \stackrel{\Delta}{=} \frac{P\_{X,Y}(\mathbf{x},\mathbf{y})}{\sqrt{P\_X(\mathbf{x})}\sqrt{P\_Y(\mathbf{y})}}.\tag{5}$$

The following useful result expresses that the maximal correlation problem can be solved by simply computing the singular value decomposition (SVD) of the DTM **B** in the discrete case.

**Theorem 1** ([27])**.** *Assume that the SVD of DTM* **B***Y*,*<sup>X</sup> takes the form*

$$\mathbf{B}\_{Y,X} = \sum\_{i=0}^{K-1} \sigma\_i \psi\_i^Y (\psi\_i^X)^T,\tag{6}$$

*with singular values <sup>σ</sup>*<sup>0</sup> <sup>≥</sup> *<sup>σ</sup>*<sup>1</sup> ≥···≥ *<sup>σ</sup>K*−1*, singular vectors <sup>ψ</sup><sup>Y</sup> <sup>i</sup> , <sup>ψ</sup><sup>X</sup> <sup>i</sup> , and <sup>K</sup>* <sup>=</sup> min{|X |, |Y|}*. Then, we have*

$$
\varphi\_0 = 1, \quad \psi\_0^X(\mathbf{x}) = \sqrt{P\_X(\mathbf{x})}, \quad \psi\_0^Y(\mathbf{y}) = \sqrt{P\_Y(\mathbf{y})}, \tag{7}
$$

*and the maximal correlation functions are related to the singular vectors in the SVD:*

$$f\_i^\*(\mathbf{x}) = \frac{\psi\_i^X(\mathbf{x})}{\sqrt{P\_X(\mathbf{x})}}, \quad \mathbf{g}\_i^\*(\mathbf{x}) = \frac{\psi\_i^Y(y)}{\sqrt{P\_Y(y)}},\tag{8}$$

*with associated maximal correlations σ*(*f* <sup>∗</sup> *<sup>i</sup> g*<sup>∗</sup> *<sup>i</sup>* ) = *<sup>σ</sup>i, for <sup>i</sup>* <sup>=</sup> 1, ··· , *<sup>K</sup>* <sup>−</sup> <sup>1</sup>*. Thus, the conditional distribution PY*|*<sup>X</sup> has the following decomposition:*

$$P\_{Y|X}(y|\mathbf{x}) = P\_Y(y) \left[ 1 + \sum\_{i=1}^{K-1} \sigma\_i f\_i^\*(\mathbf{x}) g\_i^\*(y) \right]. \tag{9}$$

As we can see from this theorem, the singular values *σ<sup>i</sup>* (since the associated maximal correlations is equal to the corresponding singular values of DTM, we abuse the notation a little bit and use *σ* to denote both of them) of the matrix **B***Y*,*<sup>X</sup>* essentially characterize the dependence between two discrete random variables, and the singular vectors Φ*<sup>X</sup>* = [*ψ<sup>X</sup>* <sup>1</sup> , ··· , *<sup>ψ</sup><sup>X</sup> <sup>k</sup>* ] and <sup>Φ</sup>*<sup>Y</sup>* = [*ψ<sup>Y</sup>* <sup>1</sup> , ··· , *<sup>ψ</sup><sup>Y</sup> <sup>k</sup>* ] are equivalent to the maximal correlation functions **f** and **g**.

Since our goal is to construct feature mappings **f**(*x*) under fairness constraints, our algorithms in the discrete case are built on the following variational characterization of an SVD, which does not involve **g**(*y*):

**Lemma 1** ([36])**.** *For any k* <sup>≤</sup> *<sup>K</sup>* <sup>−</sup> <sup>1</sup> *and* <sup>Φ</sup>*<sup>X</sup>* <sup>∈</sup> <sup>R</sup>|X |×(*k*+1)*,*

$$\max\_{\mathbf{A}^{\mathsf{T}}\_{X}\Phi\_{X}=\mathbf{I}} \|\mathbf{B}\Phi\_{X}\|\_{\mathbb{F}}^{2} = \sum\_{i=0}^{k} \sigma\_{i}^{2} \,. \tag{10}$$

*where A* <sup>F</sup> -'tr(*A*T*A*) *denotes the Frobenius norm.*

#### 3.1.1. Independence

To ensure sufficient independence, we must construct feature mappings **<sup>f</sup>** : X → <sup>R</sup>*<sup>k</sup>* so that the maximal correlations between **f**(*X*) and *Y* are large, while the ones between **f**(*X*) and *D* are small. Motivated by Lemma 1 and Theorem 1, we propose the following DTM-based approach to construct **f**:

$$\max\_{\Phi \in \mathbb{R}^{|\mathcal{X}| \times (k+1)}: \Phi^{\mathsf{T}} \Phi = \mathsf{I}} \| \mathbf{B}\_{Y, \mathsf{X}} \Phi \|\_{\mathsf{F}}^2 - \lambda \| \mathbf{B}\_{D, \mathsf{X}} \Phi \|\_{\mathsf{F}'}^2 \tag{11}$$

where **B***Y*,*<sup>X</sup>* and **B***D*,*<sup>X</sup>* denote the DTMs of distribution *PY*,*<sup>X</sup>* and *PD*,*X*, respectively, and *λ* is the regularization coefficient that controls the penalty of the maximal correlations between **f**(*X*) and *D*. Φ<sup>∗</sup> = [*φ*<sup>∗</sup> <sup>0</sup> , *φ*<sup>∗</sup> <sup>1</sup> , ··· , *φ*<sup>∗</sup> *<sup>k</sup>* ] is the solution of the optimization problem (11). As shown in Theorem 1, **<sup>B</sup>***Y*,*<sup>X</sup>* and **<sup>B</sup>***D*,*<sup>X</sup>* have a shared right singular vector '*PX*(*x*), and we can let *φ*∗ <sup>0</sup> <sup>=</sup> '*PX*(*x*). Then, the feature mappings for independence can be obtained by normalizing other column vectors in Φ∗

$$f\_i(\mathbf{x}) = \phi\_i^\*(\mathbf{x}) / \sqrt{P\_X(\mathbf{x})}, \ i = 1, \cdots, k. \tag{12}$$

We have the following remarks:

(1) The optimization problem in (11) can be written as max tr(ΦT **B**T *<sup>Y</sup>*,*X***B***Y*,*<sup>X</sup>* <sup>−</sup> *<sup>λ</sup>***B**<sup>T</sup> *D*,*X* **B***D*,*<sup>X</sup>* Φ), and it can be solved exactly by computing the eigen decomposition of **B**<sup>T</sup> *<sup>Y</sup>*,*X***B***Y*,*<sup>X</sup>* − *λ***B**<sup>T</sup> *<sup>D</sup>*,*X***B***D*,*X*.

(2) Lemma <sup>1</sup> states that the Frobenius norm squared **<sup>B</sup>***Y*,*<sup>X</sup> <sup>F</sup>* <sup>2</sup> <sup>F</sup> corresponds to the squared sum of the singular values. Actually, the following lemma shows that **<sup>B</sup>***Y*,*<sup>X</sup> <sup>F</sup>* <sup>2</sup> F can be further related to the mutual information *I*(*X*;*Y*) when the dependence between *X* and *Y* is weak.

**Lemma 2** ([27])**.** *Let <sup>X</sup>* ∈ X *and <sup>Y</sup>* ∈ Y *be -dependent random variables; i.e., the <sup>χ</sup>*2*-divergence is bounded Dχ*<sup>2</sup> (*PX*,*<sup>Y</sup> PXPY*) <sup>≤</sup> *, then*

$$I(X;Y) = \frac{1}{2} \sum\_{i=1}^{K-1} \sigma\_i^2 + o(\epsilon^2). \tag{13}$$

(3) As suggested by Lemma 2, the optimization problem in (11) can also be interpreted as maximizing the mutual information between **f**(*X*) and *Y* while penalizing the mutual information *I*(**f**(*X*); *D*).

Once we solve (11) and obtain the feature mappings **f**(*x*), we can obtain the corresponding maximal correlation function **g**(*y*) for the target variable *Y* via one step of the alternating conditional expectations algorithm by [37]:

$$\log\_i(y) \approx \mathbb{E}\_{p\_{X|Y}(\cdot|y)}[f\_i(X)], \; i = 1, \ldots, k. \tag{14}$$

In turn, **g**(*y*) can be computed by further normalizing the conditional expectations of **f**(*X*), so that the condition E **g**(*Y*)**g**T(*Y*) = **I** is satisfied. Finally, the predictions *Y*ˆ can be made following the Maximum A Posteriori (MAP) rule, where the posteriori distribution *PY*|*X*(*y*|*x*) can be approximately computed by plugging the learned feature mappings **<sup>f</sup>**(*X*) and **g**(*Y*) into (9), i.e.,

$$\hat{Y} = \underset{y \in \mathcal{Y}}{\text{arg}\max} \, P\_Y(y) \left[ 1 + \sum\_{i=1}^k \sigma\_i f\_i(x) g\_i(y) \right]. \tag{15}$$

#### 3.1.2. Separation

For the separation criterion, we want to ensure sufficient conditional independence (*f*(*X*) <sup>⊥</sup> *<sup>D</sup>*)|*Y*. Here, we cannot simply replace the **<sup>B</sup>***D*,*<sup>X</sup>* in (11) with a conditional DTM, as it involves three random variables and thus cannot be usefully expressed as a matrix. Since maximal correlation is related to mutual information as shown in Lemma 2, we consider the following formulation:

$$\begin{aligned} \max\_{\mathbf{f}} I(\mathbf{f}(X); Y) - \lambda I(\mathbf{f}(X); D, Y) \\ \mathbf{f} = \max\_{\mathbf{f}} I(\mathbf{f}(X); Y) - \lambda \left( I(\mathbf{f}(X); Y) + I(\mathbf{f}(X); D | Y) \right) \\ \mathbf{f} = \max\_{\mathbf{f}} (1 - \lambda) I(\mathbf{f}(X); Y) - \lambda I(\mathbf{f}(X); D | Y), \end{aligned} \tag{16}$$

where the first equality follows from the chain rule of mutual information and *<sup>λ</sup>* <sup>∈</sup> (0, 1). Thus, we can control the conditional mutual information *<sup>I</sup>*(**f**(*X*); *<sup>D</sup>*|*Y*) by adding the joint mutual information *I*(**f**(*X*); *D*,*Y*) as a regularizer in the training process.

Note that Lemma 1 and Lemma 2 imply that mutual information can be approximated using DTM, as shown in (11) in an independence case. Accordingly, we approximate (16) using the following optimization problem to ensure the separation criterion for discrete data:

$$\max\_{\Phi \in \mathbb{R}^{|\mathcal{A}| \times (k+1)} : \Phi^{\mathsf{T}} \Phi = \mathsf{I}} \|B\_{\mathsf{Y}, \mathsf{X}} \Phi\|\_{\mathsf{F}}^2 - \lambda \|B\_{\mathsf{D} \odot \mathsf{Y}, \mathsf{X}} \Phi\|\_{\mathsf{F}}^2 \tag{17}$$

where *D* ⊗ *Y* is the Cartesian product of *D* and *Y*, and *BD*⊗*Y*,*<sup>X</sup>* denotes the DTM of distribution *PD*⊗*Y*,*X*. Once we obtained the solution Φ∗, we could follow similar steps as in the independence case to get **f**(*x*) and **g**(*y*) and make predictions for the test samples.

#### *3.2. Maximal Correlation for Continuous Learning*

When *X*, *Y*, and *D* are all continuous and real-valued, computing the HGR maximal correlation becomes much more difficult, since the space of functions over real numbers is not tractable. Thus, we turn to approximations and begin by limiting our scope of learning algorithms to those that train models (e.g., neural nets) via gradient descent (or SGD) using samples, which encompasses most of the commonly used methods. Then, it follows that any approximation of the HGR maximal correlation used must be differentiable to calculate the gradient. Thus, we restrict the space of maximal correlation functions to be the family of functions that can be learned by neural nets, allowing us to compute the gradient while still providing a rich set of functions to search over.

#### 3.2.1. Independence

To ensure sufficient independence, we want to minimize the loss function *L*(*Y*ˆ,*Y*) and the maximal correlation between **f**(*X*) and *D*. Then, our optimization (for a given *λ*) becomes:

$$\min\_{\substack{\mathbf{f}\colon\mathcal{X}\to\mathbb{R}^{m}\\T\colon\mathbb{R}^{m}\to\mathcal{Y}}} L(T(\mathbf{f}(X)),\mathcal{Y}) + \lambda \text{HGR}\_{k}(\mathbf{f}(X),D),\tag{18}$$

where HGR*k*(**f**(*X*), *<sup>D</sup>*) = max**g**, **<sup>h</sup>** E **g**T(**f**(*X*)) **h**(*D*) , with E[**g**(**f**(*X*))] = E[**h**(*D*)] = **0**, and E **g**(**f**(*X*))**g**T(**f**(*X*)) = E **h**(*D*)**h**T(*D*) = **I**. *m* is the dimension of the features **f**(*X*), *k* is the number of maximal correlation functions, and **<sup>g</sup>**: <sup>R</sup>*<sup>m</sup>* <sup>→</sup> <sup>R</sup>*k*, **<sup>h</sup>**: D → <sup>R</sup>*<sup>k</sup>* are the maximal correlation functions relating **f**(*X*) with *D*. Given the difficulty of enforcing the orthogonalization constraint, we use a variational characterization of the HGR maximal correlation called Soft-HGR proposed in [29], which relaxes the orthogonal constraint:

$$\mathsf{HKR}\_{\mathsf{soft}}(X,Y) \triangleq \max\_{\substack{\mathbb{E}\left[\mathbf{g}(X)\right] = \mathbf{0} \\ \mathbb{E}\left[\mathbf{h}(Y)\right] = \mathbf{0}}} \mathbb{E}\left[\mathbf{g}^{\mathrm{T}}(X)\,\mathbf{h}(Y)\right] - \frac{1}{2} \,\mathrm{tr}\left(\mathrm{cov}\left[\mathbf{g}(X)\right] \mathrm{cov}\left[\mathbf{h}(Y)\right]\right), \tag{19}$$

where cov[*X*] is the covariance matrix of *X*. [29] shows that this Soft-HGR formulation can be viewed as a low-rank approximation of the original HGR maximal correlation problem in the discrete case. Then, our learning objective becomes:

$$\underset{\begin{subarray}{c}\mathbf{f}\colon\mathcal{X}\stackrel{\scriptstyle\mathbf{m}}{\longrightarrow}\mathbb{R}^{m}\\T\colon\mathbb{R}^{m}\to\mathcal{Y}\end{subarray}}{\min}\limits\_{\begin{subarray}{c}\mathbb{R}^{m}\to\mathbb{R}^{k}\\\mathbb{E}[\mathbf{g}(\mathbf{f}(\boldsymbol{X}))]=\mathbb{E}[\mathbf{h}(D)]=\mathbf{0}\end{subarray}}\max\limits\_{\begin{subarray}{c}\mathbf{0}\le\mathbf{f}\le\mathbf{0}\\\mathbf{0}\le\mathbf{0}\end{subarray}}\mathbb{C}\_{\prime}\tag{20}$$

where

$$\mathcal{C} = L(T(\mathbf{f}(X)), \mathbf{Y}) + \lambda \mathbb{E} \left[ \mathbf{g}^{\mathrm{T}}(\mathbf{f}(X)) \, \mathbf{h}(D) \right] - \frac{\lambda}{2} \, \mathrm{tr} \left( \mathrm{cov} [\mathbf{g}(\mathbf{f}(X))] \, \mathrm{cov} [\mathbf{h}(D)] \right).$$

We solve this optimization by alternating between optimizing **f**, *T* and optimizing **g**, **h**. In practice, we implement this by alternating between one step of gradient descent for **f** and *T* and five steps of gradient descent on **g** and **h** to allow the maximal correlation functions to adapt to the changing of features **f**.

#### 3.2.2. Separation

For separation, we use a similar argument as in the discrete case to ensure the conditional independence. Specifically, we solve the following optimization problem:

$$\min\_{\begin{subarray}{c}\mathbf{f}\colon\mathcal{X}\to\mathbb{R}^{m}\\T\colon\mathbb{R}^{m}\to\mathcal{Y}\end{subarray}}L(T(\mathbf{f}(X)),\mathcal{Y})+\lambda\left(\operatorname{HGR}\_{\text{soft}}(f(X),D\otimes\mathcal{Y})-\operatorname{HGR}\_{\text{soft}}(f(X),\mathcal{Y})\right).\tag{21}$$

Note that for the first Soft-HGR term, we use **g**, **h** to denote the maximal correlation functions and **g** , **h** to denote the functions for the second term. Similar to the discrete case, the difference term allows us to approximate the conditional mutual information using two unconditional terms. Once again, we solve this optimization by alternating between optimizing **f**, *T* and optimizing **g**, **h**, **g** , **h** .

#### 3.2.3. Few-Shot Learning

In the continuous case, our learning objective can also be applied a posteriori in a few-shot setting with a clasifier that has already been trained in a fairness-unaware manner on a large number of samples without the sensitive attribute label. In this case, we can formulate our objective as before and use the few samples containing the sensitive attribute to further train the network and force it to learn fairer features that are still predictive of the desired labels.

#### **4. Experimental Results**

In order to illustrate the effectiveness of our algorithms, we run experiments using the proposed algorithms on discrete (Adult and COMPAS) and continuous (Communities and Crimes) datasets.

#### *4.1. Discrete Case*

We test the proposed DTM-based approach on the ProPublica's COMPAS recidivism dataset (https://github.com/propublica/compas-analysis (accessed on 14 February 2022)) and the UCI Adult dataset (https://archive.ics.uci.edu/ml/datasets/adult (accessed on 14 February 2022)), which were chosen as they contain categorical features and are used in prior works. More experiments for the discrete case can be found in the Appendix A.

For the COMPAS dataset, the goal is to predict whether the individual recidivated (re-offended) (*Y*) using the severity of charge, number of prior crimes, and age category as the decision variables (*X*). As discussed in [8], COMPAS scores are biased against African-Americans, so race is set to be the sensitive attribute (*D*) and filtered to contain only Caucasian and African-American individuals. As for the Adult dataset, the goal is to predict the binary indicator (*Y*) of whether the income of the individual is more than 50K or not based on the following decision variables (*X*): age (quantized to decades) and education (in years), and the sensitive attribute (*D*) is the gender of the individual.

For both datasets, we randomly split all data into 80%/20% training/test samples. We first construct an estimate of DTM **B**ˆ with the empirical distribution of the training set; then, we solve the proposed optimization in (11) and (17) using **B**ˆ to obtain fair feature mappings ˆ **f**(*x*), **g**ˆ(*y*). The predictions *Y*ˆ of the test samples *X* are given by plugging the learned feature mappings ˆ **f**(*x* ), **<sup>g</sup>**ˆ(*y*) into the MAP rule (15), where *PY* can be estimated from the empirical distribution *P*ˆ *<sup>Y</sup>* on the training set.

For the independence case, we compare the tradeoff between the performance and the discrimination achieved by our method with that of the optimized pre-processing methods proposed in [8]. Note that we adopt the same settings as the experiments in [8] to do a fair comparison, and the reported results for their method are from their work. We plot the area under the ROC curve (AUC) of *P*ˆ *<sup>Y</sup>*|*X*(*y*|*x* ) compared to the true test labels *Y* against the following standard discrimination measure derived from legal proceedings [4]:

$$J = \max\_{d, d' \in \mathcal{D}} \left| \mathbb{P}\_{\hat{Y}|D}(1|d) / \mathbb{P}\_{\hat{Y}|D}(1|d') - 1 \right|. \tag{22}$$

Figures 1 and 2 (Top) show the results. For both datasets, it can be seen that simply dropping the sensitive attribute *D* and applying logistic regression (LR) and random forest (RF) algorithms cannot ensure independence between *Y*ˆ and *D*. However, the proposed DTMbased algorithm provides a tradeoff between performance and discrimination by varying the value of the regularizer *λ* in the optimization (11), which outperforms the optimized pre-processing methods in [8] on the Adult dataset and achieves similar performance on the COMPAS dataset. More importantly, the DTM-based algorithm provides a smooth tradeoff curve between the performance and discrimination, so that a desired level of fairness can be achieved by setting *λ* in practice. In addition, since our method only requires us to perform eigen-decomposition, it runs significantly faster than the optimized pre-processing method, which needs to solve a much more complex optimization problem. Empirically, we find at least a tenfold speed up in runtime compared to the existing methods.

**Figure 1.** Regularization results on the COMPAS dataset, with AUC plotted against discrimination measure for independence (**Top**), and accuracy plotted against DEO for separation (**Bottom**), respectively.

**Figure 2.** Regularization results on Adult dataset, with AUC plotted against discrimination measure for independence (**Top**), and accuracy plotted against DEO for separation (**Bottom**), respectively.

For the separation criterion, we compare the balanced accuracy achieved by our algorithm with that of the adversarial debiasing method in [20] (implementation given in [2]) against the difference in equalized opportunities (DEO), which is another standard measure used commonly in the literature:

$$\text{DECO} = \left| \mathbb{P}(\hat{Y} = 1 | D = 1, \ Y = 1) - \mathbb{P}(\hat{Y} = 1 | D = 0, \ Y = 1) \right|. \tag{2.3}$$

The results on the COMPAS and Adult datasets are presented in Figures 1 and 2 (Bottom). Compared to the naïve logistic regression, the proposed DTM-based algorithm dramatically decreases the DEO while maintaining similar accuracy performance on both datasets, which outperforms the adversarial debiasing method in [20] on the Adult dataset. We note that the accuracy and DEO curve achieved by the proposed algorithm in the separation setting has a smaller range compared to that in the independence setting. This is because the value of the regularizer *<sup>λ</sup>* is restricted in the separation optimization problem (17) to *<sup>λ</sup>* <sup>∈</sup> [0, 1), but only to *λ* > 0 for the optimization in (11). More details about the influence of the regularizer *λ* can be found in Appendix A.

#### *4.2. Continuous Case*

In the continuous case, we experiment on the Communities and Crimes (C&C) dataset (http://archive.ics.uci.edu/ml/datasets/communities+and+crime (accessed on 14 February 2022)). The goal is to predict the crime rate *Y* of a community given a set of 121 statistics *X* (distributions of income, age, urban/rural, etc.). The 122-th statistic (percentage of black people in the community) is used as the sensitive variable *D*. All variables in this dataset are real-valued. The dataset was split into 1794 training and 200 test samples. Following [9], we use a Neural Net with a 50-node hidden layer (which we denote as *f*(*x*)) and train a predictor *y*ˆ = *T*(*f*(*x*)) with the mean squared error (MSE) loss and the Soft-HGR penalty, varying *λ*. For Soft-HGR, we use two two-layer NNs with scalar outputs as the two maximal correlation functions **g** and **h**, and then, we trained them according to (20) (independence) or (21) (separation). Then, we computed the test MSE and test "discrimination" in each case.

For independence, our metric was *I*(*Y*ˆ; *D*), which was approximated using a standard *<sup>k</sup>*NN-based mutual information estimator [38]. For separation, we computed *<sup>I</sup>*(*Y*ˆ; *<sup>D</sup>*|*Y*) using the same estimator. We report the results of our experiment as well as that of the *χ*<sup>2</sup> method of [9] with the same architecture. The results of the experiments are presented in Figure 3.

**Figure 3.** Independence (**top**) and Separation (**bottom**) regularization on the C&C dataset, with MSE plotted against *<sup>I</sup>*(*Y*ˆ; *<sup>D</sup>*|*Y*).

As expected, we see a tradeoff between the MSE and discrimination, creating a frontier of possible values. We also see that the Soft-HGR penalty provides modest gains compared to the *χ*<sup>2</sup> method for both independence and separation.

Moreover, our method runs significantly faster than the *χ*<sup>2</sup> method (on the order of seconds per iteration for our method versus just under a minute per iteration for the comparison method), as the *χ*<sup>2</sup> method requires computation over a mesh grid of a Gaussian KDE, which scales with the product of the number of "bins" (mesh points) and the number of training samples, while our method only scales with the number of samples (*O*(*n*)), since it only requires passing over all the training samples a constant number of times per iteration. For large bandwidths, *d* can become quite large. KDE methods also scale poorly with dimensionality (see, [39]) in an exponential manner, and thus, if *d* is highdimensional, the *χ*<sup>2</sup> method would run much slower than our method, which can take in an

arbitrarily-sized input and scale linearly with the dimensionality of the input multiplied by the number of samples. Empirically, we find that our method runs around five times faster.

We also run experiments to illustrate how our method's simplicity allows it to adapt to the few-shot, few-epoch regime faster than that of the *χ*<sup>2</sup> method. We take 10 "few-shot" samples from the training set; then, we train a network to predict *Y* from *X* without any fairness regularizer using the full training set. Then, we run five more iterations of gradient descent on the trained model using the fairness-regularized objective and the 10 few-shot samples, and we compare the separation results between the Soft-HGR and *χ*<sup>2</sup> regularizer. We choose to compare to the *χ*<sup>2</sup> regularizer as it is one of the few methods designed to handle continuous *D*. The results are shown in Figure 4. Once again, we see the tradeoff curve, and we see that our method outperform the *χ*<sup>2</sup> method, and that it appears to be competitive with the standard case in just a few iterations, while the *χ*<sup>2</sup> method is still far from achieving the original MSE. We also vastly outperform the baseline (before fairness regularization) model in reducing discrimination, at the cost of only a small increase in error. Thus, in situations where, due to ethical/legal issues, only a few samples labeled with the sensitive attribute can be collected, fairness can still be enforced.

**Figure 4.** Independence (**top**) and Separation (**bottom**) regularization on the C&C dataset in the *few-shot* settings, with MSE plotted against *<sup>I</sup>*(*Y*ˆ; *<sup>D</sup>*|*Y*).

#### **5. Conclusions**

As machine learning algorithms gain more relevance, more focus will be placed upon ensuring their fairness. We have presented a framework using the HGR maximal correlation, which provides effective and computationally efficient methods for enforcing independence and separation constraints, and derived algorithms for fair learning on discrete and continuous data, which provide competitive tradeoff curves. In addition, we have also shown promising results in the few-shot setting and suggested a method for rapidly adapting a classifier to improve fairness. In the future, it would be beneficial to extend this framework to other criteria (e.g., sufficiency) and to to determine how to use this framework to enforce fairness in a transfer learning setup coupled with the few-shot setting, to determine how to fairly adapt a classifier to a new task.

However, this method requires knowledge of the sensitive attribute for all samples during the training time, which can be impractical in some cases. Further extension into developing these regularizers with a limited number of such samples would be very useful.

**Author Contributions:** J.L.: software, continuous algorithm design, writing—original draft. Y.B.: software, discrete algorithm design, writing—original draft. P.S. and R.P.: conceptualization, writing review and editing. G.W.W.: supervision, funding acquisition, and writing—review and editing. L.K. and R.S.F.: writing—review and editing. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported, in part, by the MIT-IBM Watson AI Lab under Agreement No. W1771646, and NSF under Grant No. CCF-1717610.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data and code can be found in Section 4.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Effect of the Regularizer in Discrete Case**

In this section, we provide additional experiment results to demonstrate how the performance of classification and fairness measures change with different values of the regularizer.

We use the same setup described in Section 4.1 and present the results in Figures A1–A4. In Figures A1 and A2, we plot the achieved AUC and Discrimination (measured with *J* in (21)) versus the value of *λ* for both COMPAS and Adult data using independence criterion. In Figures A3 and A4, we plot the accuracy of the classifier and DEO versus *λ* for both datasets using separation criterion. As shown by all the figures, the performance of classification and fairness measures are all decreasing as we increase *λ*, and the proposed DTM-based algorithm is able to provide a smooth tradeoff curve between the performance and fairness measures.

Note that the value of the regularizer *λ* is restricted in the separation optimization problem to *<sup>λ</sup>* <sup>∈</sup> [0, 1); therefore, the range of the achieved performance in Figures A3 and A4 is smaller than that in Figures A1 and A2.

**Figure A1.** Results for independence regularization on the discrete COMPAS dataset, AUC results (**Top**) and discrimination measure *J* (**Bottom**) are plotted with respect to different values of *λ*.

**Figure A2.** Results for independence regularization on the discrete Adult dataset, AUC results (**Top**) and discrimination measure *J* (**Bottom**) are plotted with respect to different values of *λ*.

**Figure A3.** Results for separation regularization on the discrete COMPAS dataset, accuracy (**Top**) and DEO (**Bottom**) are plotted with respect to different values of *λ*.

**Figure A4.** Results for separation regularization on the discrete Adult dataset, accuracy (**Top**) and DEO (**Bottom**) are plotted with respect to different values of *λ*.

#### **References**


## *Article* **CTRL: Closed-Loop Transcription to an LDR via Minimaxing Rate Reduction**

**Xili Dai 1,2,†, Shengbang Tong 1,†, Mingyang Li 3,†, Ziyang Wu 4,†, Michael Psenka 1, Kwan Ho Ryan Chan 5, Pengyuan Zhai 6, Yaodong Yu 1, Xiaojun Yuan 2, Heung-Yeung Shum <sup>4</sup> and Yi Ma 1,3,\***


**Abstract:** This work proposes a new computational framework for learning a structured generative model for real-world datasets. In particular, we propose to learn *a Closed-loop Transcription*between a multi-class, multi-dimensional data distribution and a *Linear discriminative representation* (*CTRL*) in the feature space that consists of multiple independent multi-dimensional linear subspaces. In particular, we argue that the optimal encoding and decoding mappings sought can be formulated as a *two-player minimax game between the encoder and decoder*for the learned representation. A natural utility function for this game is the so-called *rate reduction*, a simple information-theoretic measure for distances between mixtures of subspace-like Gaussians in the feature space. Our formulation draws inspiration from closed-loop error feedback from control systems and avoids expensive evaluating and minimizing of approximated distances between arbitrary distributions in either the data space or the feature space. To a large extent, this new formulation unifies the concepts and benefits of Auto-Encoding and GAN and naturally extends them to the settings of learning a *both discriminative and generative* representation for multi-class and multi-dimensional real-world data. Our extensive experiments on many benchmark imagery datasets demonstrate tremendous potential of this new closed-loop formulation: under fair comparison, visual quality of the learned decoder and classification performance of the encoder is competitive and arguably better than existing methods based on GAN, VAE, or a combination of both. Unlike existing generative models, the so-learned features of the multiple classes are structured instead of hidden: different classes are explicitly mapped onto corresponding *independent principal subspaces* in the feature space, and diverse visual attributes within each class are modeled by the *independent principal components* within each subspace.

**Keywords:** closed-loop transcription; linear discriminative representation; rate reduction; minimax game

#### **1. Introduction**

One of the most fundamental tasks in modern data science and machine learning is to learn and model complex distributions (or structures) of real-world data, such as images or texts, from a set of observed samples. By "to learn and model", one typically means that

**Citation:** Dai, X.; Tong, S.; Li, M.; Wu, Z.; Psenka, M.; Chan, K.H.R.; Zhai, P.; Yu, Y.; Yuan, X.; Shum, H.-Y.; et al. CTRL: Closed-Loop Transcription to an LDR via Minimaxing Rate Reduction. *Entropy* **2022**, *24*, 456. https://doi.org/ 10.3390/e24040456

Academic Editors: Lizhong Zheng and Chao Tian

Received: 10 February 2022 Accepted: 17 March 2022 Published: 25 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

we want to establish a (parametric) mapping between the distribution of the real data, say *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*D*, and a more compact random variable, say *<sup>z</sup>* <sup>∈</sup> <sup>R</sup>*d*:

$$f(\cdot, \theta) : \mathbf{x} \in \mathbb{R}^D \mapsto \mathbf{z} \in \mathbb{R}^d \quad \text{or the inverse} \quad g(\cdot, \eta) : \mathbf{z} \in \mathbb{R}^d \mapsto \mathbf{x} \in \mathbb{R}^D,\tag{1}$$

where *z* has a certain standard structure or distribution (e.g., normal distributions). The solearned representation or feature *z* would be much easier to use for either generative (e.g., decoding or replaying) or discriminative (e.g., classification) purposes, or both.

**Data embedding versus data transcription.** *Be aware* that the support of the distribution of *x* (and that of *z*) is typically *extremely low-dimensional* compared to that of the ambient space (for instance, the well-known CIFAR-10 datasets consist of RGB images with a resolution of 32 <sup>×</sup> 32. Despite the images being in a space of <sup>R</sup>3072, our experiments will show that the intrinsic dimension of each class is less than a dozen, even after they are mapped into a feature space of R128) hence the above mapping(s) may not be uniquely defined based on the support in the space R*<sup>D</sup>* (or R*d*). In addition, the data *x* may contain multiple components (e.g., modes, classes), and the intrinsic dimensions of these components are not necessarily the same. Hence, without loss of generality, we may assume the data *x* to be distributed over a union of low-dimensional nonlinear submanifolds <sup>∪</sup>*<sup>k</sup> <sup>j</sup>*=1M*<sup>j</sup>* <sup>⊂</sup> <sup>R</sup>*D*, where each submanifold M*<sup>j</sup>* is of dimension *dj D*. Regardless, we hope the learned mappings *f* and *g* are (locally dimension-preserving) *embedding* maps [1], when restricted to each of the components M*j*. In general, the dimension of the feature space *d* needs to be significantly higher than all of these intrinsic dimensions of the data: *d* > *dj*. In fact, it should preferably be higher than the sum of all the intrinsic dimensions: *<sup>d</sup>* <sup>≥</sup> *<sup>d</sup>*<sup>1</sup> <sup>+</sup> ··· <sup>+</sup> *dk*, since we normally expect that the features of different components/classes can be made fully independent or orthogonal in R*d*. Hence, without any explicit control of the mapping process, the actual features associated with images of the data under the embedding could still lie on some arbitrary nonlinear low-dimensional submanifolds inside the feature space R*d*. The distribution of the learned features remains "latent" or "hidden" in the feature space.

So, for features of the learned mappings (1) to be truly convenient to use for purposes such as data classification and generation, the goals of learning such mappings should not only simply reduce the dimension of the data *x* from *D* to *d* but also determine explicitly and precisely how the mapped feature *z* = *f*(*x*) is distributed within the feature space R*d*, in terms of both its support and density. Moreover, we want to establish an explicit map *<sup>g</sup>*(·) from this distribution of feature *<sup>z</sup>* back to the data space such that the distribution of its image *x*ˆ = *g*(*z*) (closely) matches that of *x*. To differentiate from finding arbitrary feature embeddings (as most existing methods do), we call embeddings of data onto an explicit family of models (structures or distributions) in the feature space as *data transcription*.

**Paper Outline.** This work is to show how such transcription can be achieved for real-world visual data with one important family of models: the linear discriminative representation (LDR) introduced by [2]. Before we formally introduce our approach in Section 2, for the remainder of this section, we first discuss two existing approaches, namely autoencoding and GAN, that are closely related to ours. As these approaches are rather popular and known to the readers, we will mainly point out some of their main conceptual and practical limitations that have motivated this work. Although our objective and framework will be mathematically formulated, the main purpose of this work is to verify the effectiveness of this new approach empirically through extensive experimentation, organized and presented in Section 3 and Appendix A. Our work presents compelling evidence that the closed-loop data transcription problem and our rate-reduction-based formulation deserve serious attention from the information-theoretical and mathematical communities. This has raised many exciting and open theoretical problems or hypotheses about learning, representing, and generating distributions or manifolds of high-dimensional real-world data. We discuss some open problems in Section 4 and new directions in Section 5. Source code can be found at https://github.com/Delay-Xili/LDR (accessed on 9 February 2022).

#### *1.1. Learning Generative Models via Auto-Encoding or GAN*

**Auto-Encoding and its variants.** In the machine-learning literature, roughly speaking,there have been two representative approaches to such a distribution-learning task. One is the classic "Auto Encoding" (AE) approach [3,4] that aims to simultaneously learn an encoding mapping *f* from *x* to *z* and an (inverse) decoding mapping *g* from *z* back to *x*:

$$\mathbf{X} \xrightarrow{f(\mathbf{x}, \boldsymbol{\theta})} \mathbf{Z} \xrightarrow{\mathcal{S}(\mathbf{z}, \boldsymbol{\eta})} \mathbf{X}.\tag{2}$$

Here, we use bold capital letters to indicate a matrix of finite samples *<sup>X</sup>* = [*x*1, ... , *<sup>x</sup>n*] <sup>∈</sup> <sup>R</sup>*D*×*<sup>n</sup>* of *<sup>x</sup>* and their mapped features *<sup>Z</sup>* = [*z*1, ... , *<sup>z</sup>n*] <sup>⊂</sup> <sup>R</sup>*d*×*n*, respectively. Typically, one wishes for two properties: firstly, the decoded samples *X*ˆ are "similar" or close to the original *X*, say in terms of maximum likelihood *p*(*X*); and secondly, the (empirical) distribution of the mapped samples *<sup>Z</sup>*, denoted as *<sup>p</sup>*ˆ(*z*|*X*), is close to certain desired prior distribution *p*(*z*), say some much lower-dimensional multivariate Gaussian (The classical PCA can be viewed as a special case of this task. In fact, the original auto-encoding is precisely cast as *nonlinear* PCA [3], assuming the data lie on only one nonlinear submanifold M).

However it is typically very difficult, often computationally intractable to maximize the likelihood function *p*(*X*) or to minimize certain "distance", say the *KL-divergence* <sup>D</sup>*KL*(*p*ˆ, *<sup>p</sup>*), between *<sup>p</sup>*ˆ(*z*|*X*) and *<sup>p</sup>*(*z*). Except for simple distributions such as Gaussian, the KL divergence usually does not have a closed-form, even for a mixture of Gaussians. The likelihood and the KL-divergence become ill-conditioned when the supports of the distributions are low-dimensional (i.e., degenerate) and not overlapping (which is almost always the case in practice when dealing with distributions of high-dimensional data in high-dimensional spaces). So in practice, one typically chooses to minimize instead certain approximate bounds or surrogates derived with various simplifying assumptions on the distributions involved, as is the case in variational auto-encoding (VAE) [5,6]. As a result, even after learning, the precise posterior distribution of *<sup>p</sup>*ˆ(*z*|*X*) remains unclear or hidden inside the feature space.

In this work, we will show that if we impose specific requirements on the (distribution of) learned feature *z* to be a mixture of subspace-like Gaussians, a natural closed-form distance can be introduced for such distributions based on rate distortion from the information theory. In addition, the optimal solution to the feature representation within this family can be learned directly from the data *without specifying any target p*(*z*) *in advance*, which is particularly difficult in practice when the distribution of a mixed dataset is multi-modal and each component may have a different dimension.

**GAN and its variants.** Compared to measuring distribution distance in the (often controlled) feature space *z*, a much more challenging issue with the above auto-encoding approach is how to effectively measure the distance between the decoded samples *X*ˆ and the original *X* in the data space *x*. For instance, for visual data such as images, their distributions *<sup>p</sup>*(*X*) or generative models *<sup>p</sup>*(*X*|*z*) are often not known. Despite extensive studies in the computer vision and image processing literature [7], it remains elusive to find a good measure for similarity of real images that is both efficient to compute and effective in capturing visual quality and semantic information of the images equally well. Precisely due to such difficulties, it has been suggested early on by [8] that one may have to take a discriminative approach to learn the distribution or a generative model for visual data. More recently, *Generative Adversarial Nets (GAN)* [9] offers an ingenious idea to alleviate this difficulty by utilizing a powerful discriminator *d*, usually modeled and learned by a deep network, to discern differences between the generated samples *X*ˆ and the real ones *X*:

$$\mathbf{Z} \xrightarrow{\mathcal{S}^{(\mathbf{z},\boldsymbol{\eta})}} \mathbf{X} \,\,\mathbf{X} \xrightarrow{d(\mathbf{x},\boldsymbol{\theta})} \mathbf{0} \,\mathbf{1} \,\,\text{d}\,\,\text{d}\,\,\text{d}\,\,\text{d}\,\,\text{d}\,\,\text{d}\,\,\text{d}\,\,\text{d}\,\,\text{d}\,\,\text{d}\,\,\text{d}\,\,\text{d}\,\,\text{d}\,\,\text{d}\,\,\text{d}\,\,\text{d}\,\,\text{d}\,\,\,\text{d}\,\,\,\text{d}\,\,\,\text{d}\,\,\,\text{d}\,\,\,\text{d}\,\,\,\text{d}\,\,\,\text{d}\,\,\,\text{d}\,\,\,\text{d}\,\,\,\text{d}\,\,\,\text{d}\,\,\,\text{d}\,\,\,\text{d}\,\,\,\text{d}\,\,\,\text{d}\,\,\,\text{d}\,\,\,\,\text{2}\,\,\,\text{2}\,\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2}\,\,\text{2$$

To a large extent, such a discriminator plays the role of minimizing certain distributional distance, e.g., the *Jensen–Shannon divergence*, between the data *X* and *X*ˆ . Compared to the KL-divergence, the JS-divergence is well-defined even if the supports of the two

distributions are non-overlapping. (However, JS-divergence does not have a closed-form expression even between two Gaussians, whereas KL-divergence does). However, as shown in [10], since the data distributions are low-dimensional, the JS-divergence can be highly ill-conditioned to optimize. (This may explain why many additional heuristics are typically used in many subsequent variants of GAN). So, instead, one may choose to replace JS-divergence with the earth mover's distance or the Wasserstein distance. However both JS-divergence and W-distance can only be approximately computed between two general distributions. (For instance, the W-distance requires one to compute the maximal difference between expectations of the two distributions over all 1-Lipschitz functions). Furthermore, neither the JS-divergence nor the W-distance have closed-form formulae, even for the Gaussian distributions. (The (-1-norm) W-distance can be bounded by the (-2-norm) W2-distance which has a closed-form [11]. However, as is well-known in high-dimensional geometry, -1-norm and -<sup>2</sup> norm deviate significantly in terms of their geometric and statistical properties as the dimension becomes high [12]. The bound can become very loose). However, from a data representation perspective, *subspace-like Gaussians (e.g., PCA) or a mixture of them are the most desirable family of distributions that we wish our features to become.* This would make all subsequent tasks (generative or discriminative) much easier. In this work, we will show how to achieve this with a different fundamental metric, known as the rate reduction, introduced by [13].

The original GAN aims to directly learn a mapping *<sup>g</sup>*(·), called a generator, from a standard distribution (say, a low-dimensional Gaussian random field) to the real (visual) data distribution in a high-dimensional space. However, distributions of real-world data can be rather sophisticated and often contain *multiple* classes and *multiple* factors in each class [14]. This makes learning the mapping *g* rather challenging in practice, suffering difficulties such as *mode-collapse* [15]. As a result, many variants of GAN have been subsequently developed in order to improve the stability and performance in learning multiple modes and disentangling different factors in the data distribution, such as *Conditional GAN* [16–20], *InfoGAN* [21,22], or *Implicit Maximum Likelihood Estimation (IMLE)* [23,24]. In particular, to learn a generator for multi-class data, prevalent conditional GAN literature requires label information as conditional inputs [16,25–27]. Recently, [28,29] has proposed training a *k*-class GAN by generalizing the two-class cross entropy to a (*k* + 1)-class cross entropy. In this work, *we will introduce a more refined* 2*k-class measure* for the *k* real and *k* generated classes. In addition, to avoid features for each class collapsing to a singleton [30], instead of cross entropy, *we will use the so-called rate-reduction measure that promotes multi-mode and multi-dimension in the learned features* [13]. One may view the rate reduction as a metric distance that has closed-form formulae for a mixture of (subspace-like) Gaussians, whereas neither JS-divergence nor W-distance can be computed in closed form (even between two Gaussians).

Another line of research is about how to stabilize the training of GAN. SN-GAN [31] has shown that spectral normalization on the discriminator is rather effective, which we will adopt in our work, although our formulation is not so sensitive to such choice designed for GAN (see ablation study in Appendix A.9). PacGAN [32] shows that the training stability can be significantly improved by packing a pair of real and generated images together for the discriminator. Inspired by this work, *we show how to generalize such an idea to discriminating an arbitrary number of pairs of real and decoded samples without concatenating the samples.* Our results in this work will even suggest that the larger the batch size discriminated, the merrier (see ablation study in Appendix A.10). In addition, ref. [29] has shown that optimizing the latent features leads to state-of-the-art visual quality. Their method is based on the deep compressed sensing GAN [28]. Hence, there are strong reasons to believe that their method essentially utilizes the *compressed sensing* principle [12] to implicitly exploit the low-dimensionality of the feature distribution. Our framework *will explicitly expose and exploit such low-dimensional structures on the learned feature distribution.*

**Combination of AE and GAN.** Although AE (VAE) and GAN originated with somewhat different motivations, they have evolved into popular and effective frameworks for

learning and modeling complex distributions of many real-world data such as images. (In fact, in some idealistic settings, it can be shown that AE and GAN are actually equivalent: for instance, in the LOG settings, authors in [33] have shown that GAN coincides with the classic PCA, which is precisely the solution to auto-encoding in the linear case). Many recent efforts tend to combine both auto-encoding and GAN to generate more powerful generative frameworks for more diverse data sets, such as [15,34–42]. As we will see, in our framework, AE and GAN can be naturally interpreted as two different segments of a closed-loop data transcription process. However, unlike GAN or AE (VAE), the "origin" or "target" distribution of the feature *z* will no longer be specified *a priori*, and is instead learned from the data *x*. In addition, *this intrinsically low-dimensional distribution of z (with all of its low-dimensional supports) is explicitly modeled as a mixture of orthogonal subspaces (or independent Gaussians) within the feature space* R*d*, sometimes known as the principal subspaces.

**Universality of Representations.** Note that GANs (and most VAEs) are typically designed without explicit modeling assumptions on the distribution of the data nor on the features. Many even believe that it is this "universal" distribution learning capability (assuming minimizing distances between arbitrary distributions in high-dimensional space can be solved efficiently, which unfortunately has many caveats and often is impractical) that is attributed to their empirical success in learning distributions of complicated data such as images. In this work, we will provide empirical evidence that such an "arbitrary distribution learning machine" might not be necessary. (In fact, it may be computationally intractable in general). A *controlled and deformed* family of low-dimensional linear subspaces (Gaussians) can be more than powerful, and expressive enough to model real-world visual data. (In fact, a Gaussian mixture model is already a universal approximator of almost arbitrary densities [43]. Hence, we do not loose any generality at all). As we will also see, once we can place a proper and precise metric on such models, the associated learning problems can become much better conditioned and more amenable to rigorous analysis and performance guarantees in the future.

#### *1.2. Learning Linear Discriminative Representation via Rate Reduction*

Recently, the authors in [2] proposed a new objective for deep learning that aims to learn a *linear discriminative representation* (LDR) for multi-class data. The basic idea is to map distributions of real data, potentially on *multiple* nonlinear submanifolds <sup>∪</sup>*<sup>k</sup> <sup>j</sup>*=1M*<sup>j</sup>* <sup>⊂</sup> <sup>R</sup>*<sup>D</sup>* (in classical statistical settings, such nonlinear structures of the data were also referred to as principal curves or surfaces [44,45]. There has been a long quest of trying to extend PCA to handle potential nonlinear low-dimensional structures in data distribution (see [46] for a thorough survey) to a family of canonical models consisting of multiple independent (or orthogonal) linear subspaces, denoted as <sup>∪</sup>*<sup>k</sup> <sup>j</sup>*=1S*<sup>j</sup>* <sup>⊂</sup> <sup>R</sup>*d*. To some extent, this generalizes the classic nonlinear PCA [3] to more general/realistic settings where we simultaneously apply *multiple nonlinear PCAs* to data on multiple nonlinear submanifolds. Or equivalently, the problem can also be viewed as a nonlinear extension to the classic *Generalized PCA* (GPCA) [46]. (Conventionally, "generalized PCA" refers to generalizing the setting of PCA to multiple *linear* subspaces. Here, we need to further generalize multiple *nonlinear* submanifolds. Unlike conventional discriminative methods that only aim to predict class labels as one-hot vectors, the LDR aims to learn the likely multi-dimensional distribution of the data, hence it is suitable for both discriminative and generative purposes. It has been shown that this can be achieved via maximizing the so-called "rate reduction" objective based on the rate distortion of subspace-like Gaussians [47].

**LDR via MCR**2**.** More precisely, consider a set of data samples *<sup>X</sup>* = [*x*1, ... , *<sup>x</sup>n*] <sup>∈</sup> <sup>R</sup>*D*×*<sup>n</sup>* from *<sup>k</sup>* different classes. That is, we have *<sup>X</sup>* <sup>=</sup> <sup>∪</sup>*<sup>k</sup> <sup>j</sup>*=1*X<sup>j</sup>* with each subset of samples *X<sup>j</sup>* belonging to one of the low-dimensional submanifolds: *<sup>X</sup><sup>j</sup>* ⊂ M*j*, *<sup>j</sup>* <sup>=</sup> 1, ... , *<sup>k</sup>*. Following the notation in [2], we use a matrix **Π***<sup>j</sup>* (*i*, *i*) = 1 to denote the membership of sample *i* belonging to class *<sup>j</sup>* (and **<sup>Π</sup>***<sup>j</sup>* <sup>=</sup> 0 otherwise). One seeks a continuous mapping *<sup>f</sup>*(·, *<sup>θ</sup>*) : *<sup>x</sup>* → *<sup>z</sup>* from *<sup>X</sup>* to an optimal representation *<sup>Z</sup>* = [*z*1,..., *<sup>z</sup>n*] <sup>⊂</sup> <sup>R</sup>*d*×*n*:

$$\mathbf{X} \xrightarrow{f(\mathbf{x}, \boldsymbol{\theta})} \mathbf{Z}\_{\prime} \tag{4}$$

which maximizes the following coding rate-reduction objective, known as *the MCR*<sup>2</sup> *principle* [13]:

$$\max\_{\mathbf{Z}} \Delta R(Z \mid \Pi, \varepsilon) \doteq \underbrace{\frac{1}{2} \log \det \left( I + a \mathbf{Z} \mathbf{Z}^\* \right)}\_{R(Z \mid \varepsilon)} - \underbrace{\sum\_{j=1}^k \frac{\gamma\_j}{2} \log \det \left( I + a\_j \mathbf{Z} \Pi^j \mathbf{Z}^\* \right)}\_{R\_\varepsilon(Z \mid \Pi, \varepsilon)},\tag{5}$$

where *α* = *<sup>d</sup> <sup>n</sup>*<sup>2</sup> , *<sup>α</sup><sup>j</sup>* <sup>=</sup> *<sup>d</sup>* tr(**Π***<sup>j</sup>* )<sup>2</sup> , *<sup>γ</sup><sup>j</sup>* <sup>=</sup> tr(**Π***<sup>j</sup>* ) *<sup>n</sup>* for *<sup>j</sup>* <sup>=</sup> 1, ... , *<sup>k</sup>*. In this paper, for simplicity we denote <sup>Δ</sup>*R*(*<sup>Z</sup>* <sup>|</sup>**Π**, ) as <sup>Δ</sup>*R*(*Z*) assuming **<sup>Π</sup>**, are known and fixed. The first term *<sup>R</sup>*(*<sup>Z</sup>* <sup>|</sup>), or *R*(*Z*) for short, is the coding rate of the whole feature set *Z* (coded as a Gaussian source) with a prescribed precision ; the second term *Rc*(*<sup>Z</sup>* <sup>|</sup>**Π**, ), or simply *Rc*(*Z*), is the average coding rate of the *<sup>k</sup>* subsets of features *<sup>Z</sup><sup>j</sup>* = *<sup>f</sup>*(*Xj*) (each coded as a Gaussian).

As has been shown by [13], maximizing the difference between the two terms will expand the whole feature set while compressing and linearizing features of each of the *k* classes. If the mapping *f* maximizes the rate reduction, it maps the features of different classes into independent (orthogonal) subspaces in R*d*. Figure 1 illustrates a simple example of data with *k* = 2 classes (on two submanifolds) mapped to two incoherent subspaces (solid black lines). Notice that, compared to AE (2) and GAN (3), the above mapping (4) is only one-sided: from the data *X* to the feature *Z*. In this work, we will see how to use the rate-reduction metric to establish inverse mapping from the feature *Z* back to the data *X*, while still preserving the subspace structures in the feature space.

**Figure 1. CTRL: A Closed-loop Transcription to an LDR.** The encoder *f* has dual roles: it learns an LDR *z* for the data *x* via maximizing the rate reduction of *z* and it is also a "feedback sensor" for any discrepancy between the data *x* and the decoded *x*ˆ. The decoder *g* also has dual roles: it is a "controller" that corrects the discrepancy between *x* and *x*ˆ and it also aims to minimize the overall coding rate for the learned LDR.

#### **2. Data Transcription via Rate Reduction**

#### *2.1. Closed-Loop Transcription to an LDR (CTRL)*

One issue with this one-sided LDR learning (4) is that maximizing the above objective (5) tends to expand the dimension of the learned subspace for features in each class (if the dimension of the feature space *d* is too high, maximizing the rate reduction may overestimate the dimension of each class. Hence, to learn a good representation, one needs to pre-select a proper dimension for the feature space, as achieved in the experiments in [13]. In fact the same "model selection" problem persists even in the simplest single-subspace case, which is the classic PCA [48]. Selecting the correct number of principal components in a heterogeneous noisy situation remains an active research topic [49]). To verify whether the learned features are neither over-estimating nor under-estimating the data structure, we may consider learning a decoder *<sup>g</sup>*(·, *<sup>η</sup>*) : *<sup>z</sup>* → *<sup>x</sup>* from the representation *<sup>Z</sup>* <sup>=</sup> *<sup>f</sup>*(*X*, *<sup>θ</sup>*) back to the data space *x*: *X*ˆ = *g*(*Z*, *η*), and check how close *X* and *X*ˆ are or how close their features *Z*

and *Z*ˆ = *f*(*X*ˆ , *θ*) are. In principle, the decoder *g* should examine if all the learned features by the encoder *f* are both necessary and sufficient for achieving this task. The overall pipeline can be illustrated by the following "closed-loop" diagram:

$$\mathbf{X} \xrightarrow{f(\mathbf{x}, \boldsymbol{\theta})} \mathbf{Z} \xrightarrow{\mathcal{G}(\mathbf{z}, \boldsymbol{\eta})} \mathbf{X} \xrightarrow{f(\mathbf{x}, \boldsymbol{\theta})} \mathbf{Z},\tag{6}$$

where the overall model has parameters: <sup>Θ</sup> <sup>=</sup> {*θ*, *<sup>η</sup>*}.

Notice that in the above process, the segment from *X* to *X*ˆ resembles a typical *Auto-Encoding* process; although, as we will soon see, our MCR2-based encoder *f* plays an additional role as a discriminator. The segment from *Z* to *Z*ˆ draws resemblance to the typical GAN process; although, in our context, the distribution of the latent variable *z* will be learned from the data *x*. Despite these connections, as we will soon see, this new closed-loop formulation will allow us to utilize the *error feedback* mechanism (widely practiced in control systems) and directly enforce loop consistency between encoding and decoding (networks) *without* using any additional discriminator(s) that are typically needed in existing VAE/GAN architectures.

Here, in the specific context of rate reduction, we name this special auto-encoding process "*Transcription to an LDR*" since the maximal rate-reduction principle explicitly transcribes the data *X*, via *f* , to features *Z* on a linear discriminative representation (LDR) (through our extensive experiments on diverse real-world visual datasets, one does not lose any generality or expressiveness by restricting to this special but rich class of models. On the contrary, the restriction significantly simplifies and improves the learning process), which can be subsequently decoded back to the data space *X*ˆ , via *g*. Hence, the encoding and decoding maps *f* and *g* together form a "closed-loop" process, as illustrated in Figure 1. We hope that this closed-loop transcription to an LDR (CTRL) has the following good properties:


Mathematically, we seek an *embedding* of the data *x* supported on certain nonlinear submanifolds <sup>∪</sup>*<sup>k</sup> <sup>j</sup>*=1M*<sup>j</sup>* in the space <sup>R</sup>*<sup>D</sup>* to feature *<sup>z</sup>* on a set of (discriminative) linear subspaces <sup>∪</sup>*<sup>k</sup> <sup>j</sup>*=1S*<sup>j</sup>* in the feature space <sup>R</sup>*d*. Ideally, both *<sup>f</sup>* and *<sup>g</sup>* should be embeddings [1], when restricted on the support of the data distribution or that of the features. (That is, we hope *<sup>f</sup>* |M*<sup>j</sup>* and *<sup>g</sup>* |S*<sup>j</sup>* are all embeddings for all *<sup>j</sup>* <sup>=</sup> 1, ... , *<sup>k</sup>*.) In addition, more ideally, we hope *<sup>f</sup>* and *<sup>g</sup>* are mutually inverse embeddings: *<sup>g</sup>* ◦ *<sup>f</sup>* <sup>=</sup> Id (when restricted on the submanifolds). Nevertheless, if we are only interested in learning the distribution, embeddings of the support would often suffice the purposes (e.g., classification or generative purposes). Notice that the above goals are similar to many VAE+GAN-related methods in the machine-learning literature, such as BiGAN [38] and ALI [39]. We will discuss the differences of our approach from these existing methods in Section 2.3 (as well as providing some experimental comparisons in the Appendix A).

At first sight, this is a rather daunting task, since we are trying to learn over a (seemingly infinite-dimensional) functional space of all embeddings and distributions from finite samples. In this work, we will take a more pragmatic approach and show how one can learn a good encoding, decoding, and representation tuple: (*f* , *g*, *z*) from *X* via tractable computational means. In particular, we will convert the above goals to certain feasible programs that optimize a sensible measure of goodness for the learned representations *Z*.

#### *2.2. Measuring Distances in the Feature Space and Data Space*

**Contractive measure for the decoder.** For the *second* item in the above wishlist, as the representations in the feature space *z* are by design linear subspaces or (degenerate) Gaussians, we have geometrically or statistically meaningful metrics for both samples and

distributions in the feature space *z*. For example, we care about the distance between distributions between the features of the original data *Z* and the transcribed *Z*ˆ. Since the features of each class, *Z<sup>j</sup>* and *Z*ˆ*j*, are similar to subspaces/Gaussians, their "distance" can be measured by the rate reduction, with (5) restricted to two sets of equal size:

$$\Delta\mathcal{R}(\mathbf{Z}\_{\rangle}, \mathbf{Z}\_{\rangle}) \doteq \mathcal{R}(\mathbf{Z}\_{\rangle} \cup \mathbf{Z}\_{\rangle}) - \frac{1}{2} (\mathcal{R}(\mathbf{Z}\_{\rangle}) + \mathcal{R}(\mathbf{Z}\_{\rangle})).\tag{7}$$

According to the interpretation of the rate reduction given in [13], the above quantity precisely measures the volume of the space between *Z<sup>j</sup>* and *Z*ˆ*j*, illustrated as a pair of black and blue lines in Figure 1. Then, for the "distance" of all, say *k*, classes, we simply sum the rate reduction for all pairs:

$$d(\mathbf{Z}, \hat{\mathbf{Z}}) \doteq \min\_{\eta} \sum\_{j=1}^{k} \Delta \mathbb{R} \left( \mathbf{Z}\_{j}, \hat{\mathbf{Z}}\_{j} \right) = \min\_{\eta} \sum\_{j=1}^{k} \Delta \mathbb{R} \left( \mathbf{Z}\_{j}, f(\mathcal{g}(\mathbf{Z}\_{j}, \eta), \boldsymbol{\theta}) \right), \tag{8}$$

where *<sup>Z</sup><sup>j</sup>* = *<sup>f</sup>*(*Xj*, *<sup>θ</sup>*) and *<sup>Z</sup>*ˆ*<sup>j</sup>* = *<sup>f</sup>*(*X*ˆ*j*, *<sup>θ</sup>*). Obviously, a main goal of the learned decoder *<sup>g</sup>*(·, *<sup>η</sup>*) is to *minimize* the distance between these distributions. Notice that if the encoder *<sup>f</sup>* preserves (i.e., injective for) the intrinsic structures of the original data *X*, (this is typically the case for MCR2-based feature representation [13]) this criterion essentially aims to ensure there will be some decoded sample *x*ˆ close to every data sample *x*—hence the decoder *g* should be "surjective". According to the ideas of IMLE [23], such a requirement could effectively help to avoid mode-collapsing or mode-dropping.

**Contrastive measure for the encoder.** For the *first* item in our wishlist, however, we normally do not have a natural metric or "distance" for similarity of samples or distributions in the original data space *x* for data such as images. As mentioned before, finding proper metrics or distance functions on natural images has always been an elusive and challenging task [7]. To alleviate this difficulty, we can measure the similarity or difference between *X*ˆ and *X* through their mapped features *Z*ˆ and *Z* in the feature space (again assuming *f* is structure-preserving). If we are interested in discerning *any* differences in the distributions of the original and transcribed samples, we may view the MCR2 feature encoder *<sup>f</sup>*(·, *<sup>θ</sup>*) as a "discriminator" to *magnify* any difference between all pairs of *X<sup>j</sup>* and *X*ˆ*j*, by simply maximizing, instead of minimizing, the *same quantity* in (8):

$$d(\mathbf{X}, \hat{\mathbf{X}}) \doteq \max\_{\theta} \sum\_{j=1}^{k} \Delta R\left(\mathbf{Z}\_{j}, \hat{\mathbf{Z}}\_{j}\right) = \max\_{\theta} \sum\_{j=1}^{k} \Delta R\left(f(\mathbf{X}\_{j}, \theta), f(\hat{\mathbf{X}}\_{j}, \theta)\right). \tag{9}$$

That is, a "distance" between *X* and *X*ˆ can be measured as the maximally achievable rate reduction between all pairs of classes in these two sets. In a way, this measures how well or badly the decoded *X*ˆ aligns with the original data *X*—hence measuring the goodness of "injectivity" of the encoder *f* . Notice that such a discriminative measure is consistent with the idea of GAN [9] that tries to separate *X* and *X*ˆ into two classes, measured by the crossentropy. Nevertheless, here the MCR2-based discriminator *f* naturally generalizes to cases when the data distributions are multi-class and multi-modal, and the discriminativeness is measured with a more refined measure—the rate reduction—instead of the typical twoclass loss (e.g., cross entropy) used in GANs. See Appendix A.8 for comparisons with some ablation studies.

One may wonder why we need the mapping *<sup>f</sup>*(·, *<sup>θ</sup>*) to function as a discriminator between *X* and *X*ˆ by maximizing max*<sup>θ</sup>* Δ*R f*(*X*, *θ*), *f*(*X*ˆ , *θ*) . Figure 2 gives a simple illustration: there might be many decoders *g* such that *f* ◦ *g* is an identity (Id) mapping. Here, we use the notion of "identity mapping" in a loose sense: depending on the context, it could simply mean an embedding from *<sup>S</sup><sup>z</sup>* to *<sup>S</sup>z*. *<sup>f</sup>* ◦ *<sup>g</sup>*(*z*) = *<sup>z</sup>* for all *<sup>z</sup>* in the subspace *<sup>S</sup><sup>z</sup>* in the feature space. However, *g* ◦ *f* is not necessarily an auto-encoding map for *x* in the original distribution *<sup>S</sup><sup>x</sup>* (here for simplicity drawn as a subspace). That is, *<sup>g</sup>* ◦ *<sup>f</sup>*(*Sx*) <sup>⊂</sup> *<sup>S</sup>x*, let alone *<sup>g</sup>* ◦ *<sup>f</sup>*(*Sx*) = *<sup>S</sup><sup>x</sup>* or *<sup>g</sup>* ◦ *<sup>f</sup>*(*x*) = *<sup>x</sup>*. One should expect, without careful control of the

image of *g*, with high probability, this would be the case, especially when the support of the distribution of *x* is extremely low-dimensional in the original high-dimensional data space. For example, as we will see in the experiments, the intrinsic dimension of the submanifold associated with each image category is about a dozen, whereas images are embedded in a (pixel) space of thousands or tens of thousands of dimensions.

**Figure 2. Embeddings of Low-Dimensional Submanifolds in High-Dimensional Spaces.** *Sx* (blue) is the submanifold for the original data *x*; *Sz* (red) is the image of *Sx* under the mapping *f* , representing the learned feature *z*; and the green curve is the image of the feature *z* under the decoding mapping *g*.

**Remark: representing the encoding and decoding mappings.** Some practical questions arise immediately: how rich should the families of functions be that we should consider to use for the encoder *f* and decoder *g* that can optimize the above rate-reductiontype objectives? In fact, similar questions exist for the formulation of GAN, regarding the realizability of the data distribution by the generator, see [50]. Conceptually, here we know that the encoder *f* needs to be rich enough to discriminate (small) deviations from the true data support M*j*, while the decoder *g* needs to be expressive enough to generate the data distribution from the learned mixture of subspace-Gaussians. How should we represent or parameterize them, hence making our objectives computable and optimizable? For the most general cases, these remain widely open and challenging mathematical and computational problems. As we mentioned earlier, in this work, we will take a more pragmatic approach by simply representing these mappings with popular neural networks that have empirically proven to be good at approximating distributions of practical (visual) datasets or for achieving the maximum of the rate-reduction-type objectives [13]. Nevertheless, our experiments indicate that our formulation and objectives are *not so sensitive* to particular choices in network structures or many of the tricks used to train them. In addition, in the special cases when the real data distribution is benignly deformed from an LDR, the work of [2] has shown that one can explicitly construct these mappings from the rate-reduction objectives in the form of a deep network known as ReduNet. However, it remains unclear how such constructions could be generalized to closed-loop settings. Regardless, answers to these questions are beyond the scope of this work, as our purposes here are mainly to empirically verify the validity of the proposed closed-loop data transcription framework.

#### *2.3. Encoding and Decoding as a Two-Player MiniMax Game*

Comparing the contractive and contrastive nature of (8) and (9) on the same utility, we see the roles of the encoder *<sup>f</sup>*(·, *<sup>θ</sup>*) and the decoder *<sup>g</sup>*(·, *<sup>η</sup>*) naturally as "**a two-player game**": *while the encoder f tries to magnify the difference between the original data and their transcribed data, the decoder g aims to minimize the difference.* Now for convenience, let us define the "closed-loop encoding" function:

$$h(\mathbf{x}, \theta, \eta) \doteq f\left(\mathbf{g}\left(f(\mathbf{x}, \theta), \eta\right), \theta\right) : \mathbf{x} \mapsto \mathbf{z}. \tag{10}$$

Ideally, we want this function to be very close to *f*(*x*, *θ*) or at least the distributions of their images should be close. With this notation, combining (8) and (9), a closed-loop notion of "distance" between *X* and *X*ˆ can be computed as *an equilibrium point* to the following Min-Max (or Max-Min) program for the same utility in terms of rate reduction (theoretically, there might be significant difference in formulating and seeking the desired solution as the equilibrium point to a min-max or max-min game. In practice, we do not see major differences as we optimize the program by simply alternating between minimization and maximization. We leave a more careful investigation to future work):

$$\mathcal{D}(\mathbf{X}, \hat{\mathbf{X}}) \doteq \min\_{\eta} \max\_{\theta} \sum\_{j=1}^{k} \Delta R\left(f(\mathbf{X}\_{j}, \theta), h(\mathbf{X}\_{j}, \theta, \eta)\right). \tag{11}$$

Notice that this only measures the difference between (features of) the original data and its transcribed version. It does not measure how good the representation *Z* (or *Z*ˆ) is for the multiple classes within *X* (or *X*ˆ ). To this end, we may combine the above distance with the original MCR2-type objectives (5): namely, the rate reduction Δ*R*(*Z*) and Δ*R*(*Z*ˆ) for the learned LDR *Z* for *X* and *Z*ˆ for the decoded *X*ˆ . Notice that although the encoder *f* tries to *maximize* the multi-class rate reduction of the features *Z* of the data *X*, the decoder *g* should *minimize* the rate reduction of the multi-class features *Z*ˆ of the decoded *X*ˆ . That is, the decoder *g* tries to use a minimal coding rate needed to achieve a good decoding quality.

Hence, the overall "multi-class" Min-Max program for learning the Closed-loop Transcription to an LDR, named CTRL-Multi, is subject to certain constraints (upper or lower bounds) on the first term and the second term. In this work, we only consider the simple case by adding these rate-reduction quantities together. Of course, in the future, one may consider other more delicate formulations. For instance, we may consider a Min-Max game on the third term (11). Such constrained minimax games have also started to draw attention lately [51].

$$\min\_{\boldsymbol{\eta}} \max\_{\boldsymbol{\theta}} \mathcal{T}\_{\mathcal{X}}(\boldsymbol{\theta}, \boldsymbol{\eta}) \quad \doteq \underbrace{\Delta \mathcal{R}\left(f(\mathbf{X}, \boldsymbol{\theta})\right)}\_{\text{Expansion encoder}} + \underbrace{\Delta \mathcal{R}\left(h(\mathbf{X}, \boldsymbol{\theta}, \boldsymbol{\eta})\right)}\_{\text{Compressive decoder}} + \sum\_{j=1}^{k} \underbrace{\Delta \mathcal{R}\left(f(\mathbf{X}\_{j}, \boldsymbol{\theta}), h(\mathbf{X}\_{j}, \boldsymbol{\theta}, \boldsymbol{\eta})\right)}\_{\text{Contrative encoder} \& \text{Contrast}}$$

$$= \Delta \mathcal{R}\left(\mathbf{Z}(\boldsymbol{\theta})\right) + \Delta \mathcal{R}\left(\mathbf{Z}(\boldsymbol{\theta}, \boldsymbol{\eta})\right) + \sum\_{j=1}^{k} \Delta \mathcal{R}\left(\mathbf{Z}\_{j}(\boldsymbol{\theta}), \mathbf{Z}\_{j}(\boldsymbol{\theta}, \boldsymbol{\eta})\right). \tag{12}$$

Empirically, we have evaluated the necessity of these terms in an ablation study (see Appendix A.8.3). Notice that, without the terms associated with the generative part *h* or with all such terms fixed as constant, the above objective is precisely the original MCR<sup>2</sup> objective proposed by [13]. In an unsupervised setting, if we view each sample (and its augmentations) as its own class, the above formulation remains exactly the same. The number of classes *k* is simply the number of independent samples. In addition, notice that the minimax objective function depends only on (features of) the data *X*, hence one can learn the encoder and decoder (parameters) without the need for sampling or matching any additional distribution (as typically needed in GANs or VAEs).

As a special case, if *X* only has one class, the above Min-Max program reduces (as the first two rate reduction terms automatically become zero) to a special "two-class" or "binary" form, named CTRL-Binary, between *X* and the decoded *X*ˆ by viewing *X* and *X*ˆ as two classes {**0**, **1**}. Notice that this binary case resembles formulation of the original GAN (3). Nevertheless, instead of using cross entropy, our formulation adopts a more refined rate-reduction measure, which has been shown to promote diversity in the learned representation [13]).

$$\text{CTRL-Binary: } \min\_{\eta} \max\_{\theta} \mathcal{T}\_{\mathbf{X}}^{b}(\theta, \eta) \doteq \Delta \mathcal{R} \{ f(\mathbf{X}, \theta), h(\mathbf{X}, \theta, \eta) \} = \Delta \mathcal{R} \{ \mathbf{Z}(\theta), \mathcal{Q}(\theta, \eta) \}. \tag{13}$$

Sometimes, even when *X* contains multiple classes/modes, one could still view all classes together as one class. Then, the above binary objective is to align the union distribution of all classes with their decoded *X*ˆ . This is typically a simpler task to achieve than the multi-class one (12), since it does not require learning of a more refined multi-class CTRL for the data, as we will later see in experiments. Notice that one good characteristic of the above formulation is that *all quantities in the objectives are measured in terms of rate reduction for the learned features* (assuming features eventually become subspace Gaussians).

In all of our subsequent experiments, we solve the above minimax programs using the most basic gradient descent–ascent (GDA) algorithm [52] that alternates between the minimization and maximization, with the same learning rate and without any timescale separation (as typically needed for training GANs [53]). Although more refined optimization schemes can likely further improve the efficiency and performance, we leave these for future investigations.

**Remark: closed-loop error correction.** One may notice that our framework (see Figure 1) draws inspiration from closed-loop error correction widely practiced in feedback control systems. In the machine-learning and deep-learning literature, the idea of closedloop error correction and closed-loop fixed point has been explored before to interpret the recursive error-correcting mechanism and explain stability in a forward (predictive) deep neural network, for example the *deep equilibrium networks* [54] and the *deep implicit networks* [55], again drawing inspiration from feedback control. Here, in our framework, the closedloop mechanism is not used to interpret the encoding or decoding (forward) networks *f* and *g*. Instead, it is used to form an overall feedback system between the two encoding and decoding networks for correcting the "error" in the distributions between the data *x* and the decoded *x*ˆ. Using terminology from control theory, one may view the encoding network *f* as a "sensor" for error feedback while the decoding network *g* as a "controller" for error correction. However, notice that here the "target" for control is not a scalar nor a finite dimensional vector, but a continuous mapping—in order for the distribution of *x*ˆ to match that of the data *x*. This is in general a control problem in an infinite dimensional space. The space of diffeomorphisms of submanifolds is infinite-dimensional [1]. Ideally, we hope when the sensor *f* and the controller *g* are optimal, the distribution of *x* becomes a "fixed point" for the closed loop while the distribution of *z* reaches a compact LDR. Hence, the minimax programs (12) and (13) can also be interpreted as games between an error-feedback sensor and an error-reducing controller.

**Remark: relation to bi-directional or cycle consistency.** The notion of "bi-directional" and "cycle" consistency between encoding and decoding has been exploited in the works of BiGAN [38] and ALI [39] for mappings between the data and features and in the work of CycleGAN [56] for mappings between two different data distributions. In our context, it is similar in order to promote *g* ◦ *f* and *f* ◦ *g* to be close to identity mappings (either for the distributions or for the samples). Interestingly, our new closed-loop formulation actually "decouples" the data *X*, say, observed from the external world, from their internally represented features *Z*. The objectives (12) and (13) are functions of *only* the internal features *Z*(*θ*) and *Z*ˆ(*θ*, *η*), which can be learned and optimized by adjusting the neural networks *<sup>f</sup>*(·, *<sup>θ</sup>*) and *<sup>g</sup>*(·, *<sup>η</sup>*) alone. There is no need for any additional external metrics or heuristics to promote how "close" the decoded images *X*ˆ are to *X*. This is very different from most VAE/GAN-type methods such as BiGAN and ALI that require additional discriminators (networks) for the images and the features. Some experimental comparison are given in the Appendix A.2. In addition, in Appendix A.8.1, we provide some ablation study to illustrate the importance and benefit of a closed loop for enforcing the consistency between the encoder and decoder.

**Remark: transparent versus hidden distribution of the learned features.** Notice that in our framework, there is no need to explicitly specify a prior distribution either as a target distribution to map to for AE (2) or as an initial distribution to sample from for GAN (3). The common practice in AEs or GANs is to specify the prior distribution as a generic Gaussian. This is however particularly problematic when the data distribution is multi-modal and has multiple low-dimensional structures, which is commonplace for multi-class data. In this case, the common practice in AEs or GANs is to train a conditional GAN for different classes or different attributes. However, here we only need to assume the desired target distribution belonging to the family of LDRs. The specific optimal distribution of the features within this family is then learned from the data directly, and then can be represented *explicitly* as a mixture of independent subspace Gaussians (or equivalently, a mixture of PCAs on independent subspaces). We will give more details in the experimental Section 3 as well as more examples in Appendices A.2–A.4. Although many GAN + VAE-type methods can learn bidirectional encoding and decoding mappings, the distribution of the learned features inside the feature space remains *hidden* or even *entangled*. This makes it difficult to sample the feature space for generative purposes or to use the features for discriminative tasks. (For instance, typically one can only use so-learned features for nearest-neighbor-type classifiers [38], instead of nearest subspace as in this work, see Section 3.3).

#### **3. Empirical Verification on Real-World Imagery Datasets**

This experiment section serves three purposes: First, we empirically justify the proposed formulation for data transcription by demonstrating good properties of the learned encoder, decoder, and representation tuple (*f* , *g*, *z*) from *X*. Second, we compare our method with several representative methods from the GAN family and VAE family. The purpose of the comparison is *not* to compete for any state-of-the-art performance. Instead, we want to convincingly verify the validity of the proposed framework and its potential in going beyond. Finally, we evaluate the so-learned CTRL through both generative tasks (controlled visualization) and discriminative (classification) tasks. More extensive experimental results, evaluations, and ablation studies can be found in the Appendix A.

**Datasets.** We provide extensive qualitative and quantitative experimental results on the following datasets: MNIST [57], CIFAR-10 [58], STL-10 [59], CelebA [60], LSUN bedroom [61], and ImageNet ILSVRC 2012 [62]. The network architectures and implementation details can be found in Appendix A.1 and corresponding Appendix A for each dataset.

#### *3.1. Empirical Justification of CTRL Transcription*

To empirically validate our new framework, we conduct experiments from a small low-variety dataset (MNIST), to a small dataset of diverse real-world objects (CIFAR-10), to higher resolution images (STL-10, CelebA, LSUN-bedroom), to a large-scale diverse image set (ImageNet). The results are evaluated both quantitatively and qualitatively. Implementation details, more experimental results, and ablation studies are given in Appendix A.

**Comparison (IS and FID) with other formulations.** First, we conduct five experiments to fairly compare our formulation with GAN [63] and VAE(-GAN) [64] on MNIST and CIFAR-10. Except for the objective function, everything else is exactly the same for all methods (e.g., networks, training data, optimization method). These experiments are: (1). GAN; (2). GAN with its objective replaced by that of the CTRL-Binary (13); (3). VAE-GAN ; (4). Binary CTRL (13); and (5). Multi-class CTRL (12). Some visual comparison is given in Figure 3. IS [65] and FID [66] scores are summarized in Table 1. Here, for simplicity, we have chosen a uniform feature dimension *d* = 128 for all datasets. If we choose a higher feature dimension, say *d* = 512, for the more complex CIFAR-10 dataset, the visual quality can be further improved, see Table A14 in Appendix A.11.

**Figure 3.** Qualitative comparison on (**a**) MNIST, (**b**) CIFAR-10 and (**c**) ImageNet. First row: original *X*; other rows: reconstructed *X*ˆ for different methods.


**Table 1.** Quantitative comparison on MNIST and CIFAR-10. Average Inception scores (IS) [65] and FID scores [66]. ↑ means higher is better. ↓ means lower is better.

As we see from Table 1, replacing cross-entropy with the Equation (13) can improve the generative quality. The two CTRL formulations are clearly on par with the others in terms of IS and significantly better in FID. Finally, with the same training datasets, the quality of CTRL-Multi is lower than that of CTRL-Binary. This is expected, as the multi-class task is more challenging. Nevertheless, as we will see soon, images decoded by CTRL-Multi align much better with their classes than Binary.

Visualizing correlation of features *Z* and decoded features *Z*ˆ. We visualize the cosine similarity between *Z* and *Z*ˆ learned from the multi-class objective (12) on MNIST, CIFAR-10 and ImageNet (10 classes), which indicates how close *<sup>z</sup>*<sup>ˆ</sup> <sup>=</sup> *<sup>f</sup>* ◦ *<sup>g</sup>*(*z*) is from *<sup>z</sup>*. Results in Figure 4 show that *Z* and *Z*ˆ are aligned very well within each class. The block-diagonal patterns for MNIST are sharper than those for CIFAR-10 and ImageNet, as images in CIFAR-10 and ImageNet have more diverse visual appearances.

**Figure 4.** Visualizing the alignment between *<sup>Z</sup>* and *<sup>Z</sup>*ˆ: <sup>|</sup>*<sup>Z</sup><sup>Z</sup>*ˆ<sup>|</sup> and in the feature space for (**a**) MNIST, (**b**) CIFAR-10, and (**c**) ImageNet-10-Class.

**Visualizing auto-encoding of the data** *X* **and the decoded** *X*ˆ **.** We compare some representative *X* and *X*ˆ on MNIST, CIFAR-10 and ImageNet (10 classes) to verify how close *<sup>x</sup>*<sup>ˆ</sup> <sup>=</sup> *<sup>g</sup>* ◦ *<sup>f</sup>*(*x*) is to *<sup>x</sup>*. The results are shown in Figure 5, and visualizations are created from training samples. Visually, the auto-encoded *x*ˆ faithfully captures major visual features from its respective training sample *x*, especially the pose, shape, and layout. For the simpler dataset such as MNIST, auto-encoded images are almost identical to the original. The visual quality is clearly better than other GAN+VAE-type methods, such as VAE-GAN [34] and BiGAN [38]. We refer the reader to Appendices A.2, A.4 and A.7 for more visualization of results on these datasets, including similar results on transformed MNIST digits. More visualization results for learned models on real-life image datasets such as STL-10, CeleB, and LSUN can be found in the Appendices A.5 and A.6.

**Figure 5.** Visualizing the auto-encoding property of the learned closed-loop transcription (*<sup>x</sup>* <sup>≈</sup> *<sup>x</sup>*<sup>ˆ</sup> <sup>=</sup> *<sup>g</sup>* ◦ *<sup>f</sup>*(*x*)) on MNIST, CIFAR-10, and ImageNet (zoom in for better visualization).

#### *3.2. Comparison to Existing Generative Methods*

Table 2 gives a quantitative comparison of visual quality of our method with others on CIFAR-10, STL-10, and ImageNet. In general, there is a large difference in terms of FID and IS scores between the GAN family and the VAE family of models. SNGAN [31] are commonly used methods in most generative applications, while LOGAN [29] is the stateof-the-art method on ImageNet in terms of FID and IS. More comparisons with existing methods, including results on on the higher-resolution ImageNet dataset, can be found in Table A10 of the Appendix A.7.

As we see, even if the rate-reduction objectives (12) and (13) are not specifically designed nor engineered for visual quality and the networks and hyper-parameters adopted in our experiments are rather basic compared to many of the state-of-the-art generative methods, our method is still rather competitive in terms of these metrics. In our current implementation, the original objectives are used without any other heuristics or regularization. The simplicity of our framework and formulation suggests that there is significant room for further improvement. For instance, in all experiments on all datasets, we have chosen a feature dimension of *d* = 128 for simplicity and uniformity. In the last Appendix A.11, we have conducted an ablation study on using a higher feature dimension *d* = 512. The visual quality of the learned model can be significantly improved (as shown in Figure A22 and Table A14 of Appendix A.11).

In fact, compared to these methods, our method has learned not just any generative model. It has learned a *structured* generative model that has many additional beneficial properties that we now present.


**Table 2.** Comparison of CIFAR-10 and STL-10. Comparison with more existing methods and on ImageNet can be found in Table A10 in the Appendix A. ↑ means higher is better. ↓ means lower is better.

#### *3.3. Benefits of the Learned LDR Transcription Model*

As we have argued before, the learned LDR transcription model (including the feature *z*, the encoder *f* , and the decoder *g*) can be used for both generative and discriminative purposes. In particular, unlike almost all existing generative methods, the internal structures or distribution of the learned *z* are no longer "hidden" as they have clear subspace structures. Hence, we can easily derive an explicit (parametrizable) model for the distribution of the learned features as a mixture of independent subspace-like Gaussians. This gives us full control in sampling the learned distribution for generative purposes.

**Principal subspaces and principal components for the feature.** To be more specific, given the learned *<sup>k</sup>*-class features <sup>∪</sup>*<sup>k</sup> <sup>j</sup>*=1*Z<sup>j</sup>* for the training data, we have observed that the leading singular subspaces for different classes are all approximately orthogonal to each other: *Z<sup>i</sup>* ⊥ *Z<sup>j</sup>* (see Figure 4). This corroborates with our above discussion about the theoretical properties of the rate-reduction objective. They essentially span *k* independent principal subspaces. We can further calculate the mean *<sup>z</sup>*¯*<sup>j</sup>* and the singular vectors {*v<sup>i</sup> j* } *rj i*=1 (or principal components) of the learned features *Z<sup>j</sup>* for each class. Although we conceptually view the support of each class is a subspace, the actual support of the features is close to being on the sphere due to feature (scale) normalization. Hence, it is more precise to find its mean and its support centered around the mean. Here, *rj* is a rank we may choose to model the dimension of each principal subspace (say, based on a common threshold on the singular values). Hence, we obtain an explicit model for how the feature *z* is distributed in each of the *k* principal subspaces in the feature space R*d*:

$$\mathbf{z}\_{j} \sim \bar{\mathbf{z}}\_{j} + \sum\_{l=1}^{r\_{j}} n\_{l}^{\bar{j}} \sigma\_{j}^{l} \mathbf{v}\_{j}^{l}, \quad \text{where} \quad n\_{l}^{\bar{j}} \sim \mathcal{N}(0, 1), \ j = 1, \ldots, k. \tag{14}$$

Hence, this essentially gives an explicit mixture of a subspace-like Gaussians model for the learned features: statistical differences between different classes are modeled as *k* independent principal subspaces; statistical differences within each class *j* are modeled as *rj* independent principal components in the *j*th subspace.

**Decoding samples from the feature distribution.** Using the CIFAR-10 and CelebA datatsets, we visualize images decoded from samples of learned feature subspace. For the CIFAR-10 dataset, for each class *j*, we first compute the top four principal components of the learned features *<sup>Z</sup><sup>j</sup>* (via SVD). For each class *<sup>j</sup>*, we then compute <sup>|</sup>*z<sup>i</sup> j* , *v<sup>l</sup> j* |, the cosine similarity between the *l*-th principal direction *v<sup>l</sup> <sup>j</sup>* and feature sample *<sup>z</sup><sup>i</sup> j* . After finding the top five *z<sup>i</sup> <sup>j</sup>* according to <sup>|</sup>*z<sup>i</sup> j* , *v<sup>l</sup> j* <sup>|</sup> for each class *<sup>j</sup>*, we reconstruct images *<sup>x</sup>*ˆ*<sup>i</sup> <sup>j</sup>* <sup>=</sup> *<sup>g</sup>*(*z<sup>i</sup> j* ). Each row of Figure 6 is for one principal component. We observe that images in the same row share the same visual attributes; images in different rows differ significantly in visual characteristics such as shape, background, and style. See Figure A7 of Appendix A.4 for more visualization of principal components learned for all 10 classes of CIFAR-10. These results clearly demonstrate that the principal components in each subspace of the Gaussian disentangles different visual attributes. In addition, we do not observe any mode dropping for any of the classes, although the dimensions of the classes were not known a priori.

**Figure 6. CIFAR-10 dataset.** Visualization of top 5 reconstructed *x*ˆ = *g*(*z*) based on the closest distance of *z* to each row (top 4) of principal components of data representations for class 7—'Horse' and class 8—'Ship'.

**Disentangled visual attributes as principal components.** For the CelebA dataset, we calculate the principal components of all learned features in the latent space. Figure 7a shows some decoded images along these principal directions. Again, these principal components seem to clearly *disentangle* visual attributes/factors such as wearing a hat, changing

hair color, and wearing glasses. More examples can be found in Appendix A.6. The results are consistent with *the property of MCR*<sup>2</sup> *that promotes diversity of the learned features*.

(**a**) Disentangled attributes as principal components (**b**) Interpolation between distinct samples

**Figure 7. CelebA dataset.** (**a**): Sampling along three principal components that seem to correspond to different visual attributes; (**b**): Samples decoded by interpolating along the line between features of two distinct samples.

**Linear interpolation between features of two distinct samples.** Figure 7b shows interpolating features between pairs of training image samples of the CeleA dataset, where for two training images *x*<sup>1</sup> and *x*2, we reconstruct based on their linearly interpolated feature representations by *<sup>x</sup>*<sup>ˆ</sup> <sup>=</sup> *<sup>g</sup>*(*<sup>α</sup> <sup>f</sup>*(*x*1)+(<sup>1</sup> <sup>−</sup> *<sup>α</sup>*)*f*(*x*2)), *<sup>α</sup>* <sup>∈</sup> [0, 1]. The decoded images show continuous morphing from one sample to another in terms of visual characteristics, as opposed to merely a superposition of the two images. Similar interpolation results between two digits in the MNIST dataset can be found in Figure A3 of the Appendix A.2.

**Encoded features for classification.** Notice that not only is the learned decoder good for generative purposes, but the encoder is also good for discriminative tasks. In this experiment, we evaluate the discriminativeness of the learned CTRL model by testing how well the encoded features can help classify the images. We use features of the training images to compute the learned subspaces for all classes, then classify features of the test images based on a simple nearest subspace classifier. Many other encoding methods train a classifier (say, with an additional layer) after the learned features. Results in Table 3 show that our model gives competitive classification accuracy on MNIST compared to some of best VAE-based methods. We also tested the classification on CIFAR-10, and the accuracy is currently about 80.7%. As expected, the representation learned with the multi-class objective is very discriminative and good for classification tasks. Be aware that all generative models, GANs, VAEs, and ours, are not specifically engineered for classification tasks. Hence, one should not expect the classification accuracy to compete with supervised-trained classifiers yet. This demonstrates that the learned CTRL model is not only generative but also discriminative.

**Table 3.** Classification accuracy on MNIST compared to classifier-based VAE methods [42]. Most of these VAE-based methods require auxiliary classifiers to boost classification performance.


#### **4. Open Theoretical Problems**

So far, we have given theoretical intuition and derivation for the formulation of closedloop transcription, as well as empirical evidence to showcase both the performance and potential of this formulation. In this section, we take a step back to explore the theoretical underpinnings of the closed-loop LDR transcription. We organize this section by discussing three primary objectives associated with learning an LDR representation:

1. *Learn a simple linear discriminative representation f*(*X*) *of the data X*, which we can reliably use to classify the data.


These three objectives encompass the overarching principle of CTRL transcription, and indeed each of these objectives are tied to a wide array of mathematical and theoretical problems. We now outline some of the most important theoretical questions or hypotheses implicated by our results, which we leave for future work to study and to answer, likely by a broader range of research communities.

#### *4.1. Distributions of the LDR Representation*

Our primary mode of optimizing for a "simple representation" is through the LDR framework proposed in [2]. One important open theoretical problem is finding the right energy function to optimize in order to promote LDR. It was shown in [2] that an LDR can be learned for the multi-class data by maximizing the MCR2 objective Δ*R*(*Z*) given in (5). This motivates the first two terms in our objective function (12): maximizing Δ*R*(*Z*), Δ*R*(*Z*ˆ) promotes their representations to be LDRs.

Although the authors in [2] have shown the MCR2 objective can promote the features learned to be in orthogonal subspaces and characterized the optimal second moments of the distributions, there remain open questions regarding the optimal distributions within the subspaces. A standing hypothesis is that the optimal distributions should be Gaussian. There is indeed already theoretical work on similar energy functions: the Brascamp–Lieb inequalities [67], where the authors study a functional similar to the rate-reduction objective which, in certain contexts, is maximized uniquely by Gaussians. Hence, an important future theoretical direction for the CTRL transcription is to exactly characterize distributional properties of the extremals (both minima and maxima) of the MCR2 objective or its variants. Such results can further justify the use of Gaussian models (14) to characterize the learned features within the subspaces.

We also notice that the so-learned LDR features have additional striking properties, as shown by examples in Figure 7. Distinctive visual attributes of the imagery data seem to be clearly disentangled by different principal components of the distribution, and along each principal direction, one can linearly interpolate the features, whereas the original data are nonlinear and cannot be directly interpolated. These results go beyond the guarantees given by [2], and an open theoretical problem is that of studying just how the CTRL transcription learns to disentangle and linearize such visual attributes. This understanding is crucial to extend the CTRL transcription framework beyond the 2D vision domain.

#### *4.2. Self-Consistency in the Learned Reconstruction*

If the learned encoder *Z* = *f*(*X*) is an embedding of the data submanifolds to the subspaces, it should admit an inverse (decoding) mapping *X*ˆ = *g*(*Z*). As distributional distance in the data space is hard to come by, the rate reduction Δ*R Z*, *Z*ˆ gives a welldefined distribution distance between *Z* and *Z*ˆ which is used to enforce similarity between *X* and *X*ˆ in our formulation. Notice that, unlike the KL-divergence or the JS-divergence, the rate reduction is well-defined for degenerate distributions and easily computable in closed-form between mixtures of (degenerate) Gaussians. The third term of Equation (12), ∑*k <sup>j</sup>*=<sup>1</sup> Δ*R <sup>Z</sup>j*(*θ*), *<sup>Z</sup>*ˆ*j*(*θ*, *<sup>η</sup>*) , is exactly this distributional distance, which is minimized only when the estimated second moments of *Z<sup>j</sup>* and *Z*ˆ*<sup>j</sup>* are the same. While this distributional distance seems weaker than sample-wise -2-distance, we observe strong reconstruction performance nevertheless.

Notice that the current objectives (12) or (13) do not impose any constraints on the mappings of individual samples. That is, they do not explicitly specify how an individual sample *x* should be related to its decoded version *x*ˆ = *g*(*f*(*x*)), or how their corresponding features *z* and *z*ˆ are related. Hence, theoretically, nothing is known about relationships between individual samples and their features. However, somewhat surprisingly, experimental results with the multi-class objective (12) in next section suggest that they actually can be rather close, at least for the given training samples *X*. For example, see Figure 5. Of course, one could consider explicitly imposing certain sample-wise requirements in the objectives, such as enforcing *x<sup>i</sup>* to be close to *x*ˆ*<sup>i</sup>* = *g*(*f*(*x<sup>i</sup>* )). It has been observed empirically in GANs or VAEs that imposing such sample-wise similarity or dissimilarity would improve visual quality around samples of interest, such as the DC-VAE [42] and the OpenGAN [68]. However, theoretically, how such sample-wise distances or constraints may affect the difficulty or accuracy of learning the correct support and density of the distributions remains an open problem.

#### *4.3. Properties of the Closed-Loop Minimax Game*

Above are the two primary objectives for CTRL transcription: while the encoder *f* tries to maximize the expressiveness and discriminativeness of the learned LDR representation, the decoder *g* tries to minimize the reconstruction error and coding rates. The competing objectives of the encoder *f* and the decoder *g* naturally lead to a two-player game. In this paper, we have formulated this game as a zero-sum game, namely Equation (12). Likewise, we have also implemented the most straightforward algorithm for solving this zero-sum game: gradient descent–ascent (GDA) [52], where the minimizer and maximizer take alternating gradient steps. These simplifications into a GDA-optimized zero-sum game were made in order to create a concrete algorithm for our experimentation. However, simplifying to a zero-sum game and GDA is certainly not the only way to solve the more general game described above. This game-theoretic formulation puts CTRL transcription outside of the theoretical realm of [2], since we are no longer finding pure maximizers of Δ*R*(*Z*), but rather stable minimax equilibria.

As is the case with GANs, these equilibria may not necessarily be Nash equilibria [50], but rather the more general sense of Stackelberg [69]. So, the problem of studying minimax equilibria of (12) is likely, in its most general form, quite challenging. Nevertheless, our experiments suggest such equilibria tend to be well-behaved, e.g., having a large range of attraction. Our extensive empirical experiments and ablation studies indicate that, in general, the minimax objective converges rather stably to good equilibria for all the real datasets without any special optimization tricks or particular requirements on the networks. The only important factor for the stability of the optimization seems to be a large enough batch size (see Appendix A.10). These observations can be further corroborated with analysis on simpler models: our ongoing work suggests that if we restrict our attention to simplified data structures (e.g., *X* distributed on a linear subspace), then one can provide theoretical guarantees that the equilibria become efficiently and correctly solvable by the minimax formulation. Extending such analysis to more sophisticated data structures (multiple subspaces, nonlinear submanifolds) remains an exciting new directions for future research.

Despite many possible pathological solutions to the minimax game, empirically, as we have presented in the previous section (alongside many examples in the Appendix A), the solution found by the simple GDA algorithm generally strikes a good trade-off between expressiveness and parsimony of the learned model. The solution automatically determines the proper dimensions for different classes. Ablation studies in Appendix A.10 on the large ImageNet dataset further suggest that this formulation is insensitive to overparameterization by increasing network width, as long as the batch size grows accordingly. However, a rigorous justification for such good model-selection properties remains widely open.

#### **5. Conclusions and Future Work**

This work provides a novel formulation for learning a *both generative and discriminative* representation for a multi-class, multi-dimensional, possibly nonlinear, distribution of real-world data. We have provided compelling empirical evidence that the distribution of most datasets can be effectively mapped to an LDR, a union of independent principal subspaces and principal components. The objective function is entirely based on an intrinsic information-theoretic measure, the rate reduction, without any other heuristics or regularizing terms. The objective can be achieved with a closed-loop minimax game between the two encoder and the decoder networks without any additional network(s).

The main purpose of this paper is to demonstrate the conceptual simplicity and practical potential of this new framework for distribution/representation learning, instead of striving for state-of-the-art performance with heavy engineering. Nevertheless, with our preliminary implementation, a more informative LDR of the data can be effectively learned with a simple closed-loop transcription for a variety of real-world, multi-class, multi-modal visual datasets, from small to large, from low-resolution to higher-resolution, from domainspecific to diverse categories. The so-learned encoder *f* already enjoys the benefits of AE/VAEs for their discriminative property and the decoder *g* with the benefits of GANs for their good generative visual quality. However, probably more importantly, the internal structures of the learned feature representation has now become transparent, hence *fully interpretable and controllable* (for generative purposes): visual differences between classes are naturally "disentangled" as independent subspaces, while diverse visual attributes within each class are "disentangled" as principal components within each subspace. From extensive ablation studies given in the Appendix A, we see that the rate-reduction-based objective can be stably optimized across a wide range of datasets and network architectures without any additional regularizations or engineering tricks. Both the *feedback closed-loop* and the *rate-reduction measure* play indispensable roles in fostering the ease and success of finding the CTRL transcription.

One may notice that there are many ways this simple formulation can be significantly improved or extended. Firstly, in this work, we have simply adopted networks that were designed for GANs, but they may not be optimal for the rate-reduction-type objectives. For example, our ablation study already suggests that some of the components of such networks such as spectral normalization are not quite essential. Characteristics from the white-box ReduNet [2] derived from optimizing rate reduction can be explored in the future. Secondly, notice that our rate-reduction objectives do not impose any requirements on how individual samples should be encoded or decoded although the results from the multi-class objective indicate a certain level of alignment on the individual samples. Recent studies such as DC-VAE [42] or OpenGAN [68] suggest that imposing additional regularization on individual samples may further improve decoded visual quality. Such regularization can certainly be incorporated into this new framework. Last but not the least, compared to GANs and VAEs, our method leads to an *explicit* structured model for the feature distribution: a mixture of incoherent subspace Gaussians. Such an explicit model has the potential of making many subsequent tasks easier and better: better control of feature sampling for decoding and synthesis [70], designing more robust generators and classifiers for noise and corruptions based on the low-dimensional structures identified, or even extending to the settings of incremental and online learning [71,72]. We leave all these new directions, together with all the open theoretical problems posed in Section 4, for future investigation.

**Author Contributions:** This work has been the result of a successful team effort. In particular, the first four authors have contributed almost equally to this work. X.D.: investigation, methodology, project administration, software, writing—original draft preparation; S.T.: investigation, methodology, software, visualization, writing—original draft preparation; M.L.: investigation, software, visualization, writing—original draft preparation; Z.W.: investigation, software, visualization, writing original draft preparation; M.P.: formal analysis, writing—original draft preparation; K.H.R.C.: validation, writing—review and editing; P.Z.: formal analysis, writing—review and editing; Y.Y.: validation, writing—review and editing; X.Y.: resources, writing—review and editing; H.-Y.S.: resources, writing—review and editing; Y.M.: conceptualization, formal analysis, funding acquisition, methodology, supervision, writing—original draft preparation, writing—review and editing; All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by ONR grants N00014-20-1-2002 and N00014-22-1-2102, the joint Simons Foundation-NSF DMS grant #2031899, as well as partial support from Berkeley FHL Vive Center for Enhanced Reality and Berkeley Center for Augmented Cognition, Tsinghua-Berkeley Shenzhen Institute (TBSI) Research Fund, and Berkeley AI Research (BAIR).

**Data Availability Statement:** Data and results can be found in Section 3 and Appendix A.

**Acknowledgments:** Earliest ideas of this work were germinated during a hiking event of Ma's group on Berkeley hills during the summer of 2020. Former group members Chong You (now at Google) and Yichao Zhou (now at Apple) were part of a stimulating discussion on possible extensions or applications of a new rate-reduction framework being developed then. During the preparation of this work, we consulted several experts on some of the related topics. The authors would like to thank Jiantao Jiao of UC Berkeley for discussions about the theoretical conditions for learning distributions via GANs. We thank Benjamin Haeffele of Johns Hopkins University for sharing thoughts on how to learn subspaces correctly and on how to optimize the rate-reduction objectives efficiently. We would also like to thank Shankar Sastry and Manxi Wu of UC Berkeley and Chaobing Song of Univ. of Wisconsin-Madison for informative discussions on how to solve minimax games correctly and efficiently, as well as Chih-Yuan Chiu and Druv Pai of UC Berkeley for engaging discussions on theoretical directions for the CTRL transcription. Last but not the least, we would like to thank Stefano Soatto of UCLA for stimulating discussions and sometimes heated debates on how information can be efficiently and effectively encoded in deep networks.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **Appendix A**

*Appendix A.1. Experiment Settings and Implementation Details*

**Network backbones.** For MNIST, we use the standard CNN models in Tables A1 and A2, following the DCGAN architecture [63]. We resize the MNIST image resolution from 28 × 28 to 32 × 32 to fit DCGAN architecture. All *α* in lReLU (lReLU is short for Leaky-ReLU) of the encoder are set to 0.2.

We adopt ResNet architectures for CIFAR-10 shown in Tables A3 and A4, and STL-10 shown in Tables A5 and A6. Each ResBlock up is same as Resnet, but add an up-sampler after the first conv layer. All batch normalization layers of ResBlock in the encoder are replaced with spectral normalization layer.

Finally, we use the same architecture for CelebA, LSUN-bedroom, and ImageNet-128 (see Tables A7 and A8) as all three datasets have the same 128 × 128 resolution. Again, each ResBlock up is same as Resnet, but add an up-sampler after the first conv layer. All batch-normalization layers in the encoder are replaced with spectral normalization layer. All experiments utilize this lightweight PyTorch library "mimicry" [73] that provides implementations of some popular state-of-the-art GANs and evaluation metrics.


**Table A1.** Decoder for MNIST.

#### **Table A2.** Encoder for MNIST.


#### **Table A3.** Decoder for CIFAR-10.


#### **Table A4.** Encoder for CIFAR-10.


**Table A5.** Decoder for STL-10.


#### **Table A6.** Encoder for STL-10.


**Table A7.** Decoder for CelebA-128, LSUN-bedroom-128, and ImageNet-128.


#### **Table A8.** Encoder for CelebA-128, LSUN-bedroom-128, and ImageNet-128.


**Optimization and training details.** Across all of our experiments, we use Adam [74] as our optimizer, with hyperparameters *<sup>β</sup>*<sup>1</sup> = 0.5, *<sup>β</sup>*<sup>2</sup> = 0.999. We adopt the simple gradient descent–ascent algorithm for alternating minimizing and maximizing the objectives. The initial value of learning rate is set to be 0.00015 and is scheduled with linear decay. We choose <sup>2</sup> = 0.5 for both Equations (12) and (13) in all CTRL experiments. For all CTRL-Multi experiments on ImageNet, we only choose 10 classes. The details of the 10 classes are shown in Table A9. Most experiments are trained on RTX 3090 GPUs.


**Table A9.** ID and correspond category for 10 classes of ImageNet.

#### *Appendix A.2. MNIST*

**Settings.** On MNIST dataset, we train our model using DCGAN [63] architecture with our proposed objectives CTRL-Multi (12) and CTRL-Binary (13). The learning rate is set to 10−<sup>4</sup> and the batch size is set to 2048. We train our model with 15,000 iterations.

**More results illustrating auto-encoding.** Here, we give more reconstruction results, or *X*ˆ , from CTRL-Multi and CTRL-Binary objectives, compared to their corresponding original input *X*. As shown in the Figure A1, for the CTRL-Binary objective, it can generate clean digit-like images but the decoded *X*ˆ might resemble digits from similar but different classes to the input data *X* since the CTRL-Binary tends to only align the distribution of all digits.

In contrast, with the CTRL-Multi objective, the decoded *X*ˆ not only are coherent with the correct class with the input data *X*, but also show very clear one-to-one mapping between individual samples *x* and *x*ˆ, although the objective (12) does not enforce that. Comparing with the results from VAE-GAN [34] and BiGAN [38], our decoded images make less errors in reconstruction and preserve much better the individual characteristics of the original samples.

**Figure A1.** The comparison of the reconstruction results of different methods with the input data.

**Images decoded from random samples on the learned multi-class LDR.** Since our CTRL-Multi objective function maps input data of each class into a different (orthogonal) subspace in the feature space, we can generate images conditioned on each class by random sampling *z* in the subspace of each class and then decode them back to the input space as *x*ˆ.

To perform random sampling in the learned subspace, we first calculate the mean feature *z*¯*<sup>j</sup>* and the singular vectors *v<sup>i</sup> <sup>j</sup>* from the SVD (or principal components) of the learned features *Z<sup>j</sup>* of the training data in the class *j*, where index *i* represents the *i*th principal components. We only use the top *r* = 8 principal components of each class on MNIST dataset. These statistics of the subspace can be used for guiding the random sampling. Then, we sample *z* randomly along the principal components and around the mean feature as

$$\mathbf{z}\_{random\\_j} = \mathbf{z}\_j + \alpha \sum\_{i=1}^r n\_i \* \sigma\_j^i \* \sigma\_{j\prime}^i \tag{A1}$$

where *z*¯*<sup>j</sup>* is the mean feature of class *j*, *σ<sup>i</sup> <sup>j</sup>* and *<sup>v</sup><sup>i</sup> <sup>j</sup>* are the *i*-th singular value and principal component of class *<sup>j</sup>*, *ni* are i.i.d. Gaussian <sup>N</sup> (0, 1) random variables. That is, the feature in each subspace/class is modeled by an *r*-dimensional multivariate Gaussian, with variances *σi <sup>j</sup>* which characterize variances of the training data in the feature space. Here, *α* is a hyperparameter that controls the sampling range. As for the visualization of random generated images *<sup>g</sup>*(*zrandom*\_*j*) conditioned on the given class, we compare our method with some other conditional generation methods such as ACGAN [25] and InfoGAN [21] (for ACGAN and InfoGAN, we generate images conditioned on class labels with randomly sampled latent *z* according the procedures mentioned in their respective works). Our model can give realistic and correct conditional generation results with high diversity in each class, while other methods may make mistakes in the generation between some similar classes such as classes 3 and 5 for InfoGAN.

**Figure A2.** Comparison of randomly generated images conditioned on each class.

**Interpolation between samples in different classes.** We randomly sample some images from each class. For each image *x*1, we randomly sample another image *x*<sup>2</sup> from a different class. For such a pair of images *x*<sup>1</sup> and *x*2, we reconstruct them based on their linearly interpolated feature representations by *<sup>x</sup>*<sup>ˆ</sup> <sup>=</sup> *<sup>g</sup>*(*<sup>α</sup> <sup>f</sup>*(*x*1)+(<sup>1</sup> <sup>−</sup> *<sup>α</sup>*)*f*(*x*2)), *<sup>α</sup>* <sup>∈</sup> [0, 1], the results of which are shown in the Figure A3. For each row in the figure from left to the right, the reconstructed images continuously morph from one digit to a different digit with a natural transition in shape rather than a simple superposition of the two images. This also confirms that space between subspaces for the digits does not represent valid digits but only shapes with digit-like strokes. Hence for generative purposes, knowing the supports of valid digits is extremely important.

**Figure A3.** Images generated from the interpolation between samples in different classes.

#### *Appendix A.3. Transformed MNIST*

**Settings.** In this experiment, we verify that the CTRL-Multi objective can preserve diverse data modes in the learned feature embeddings. We construct a transformed MNIST dataset with five modes: normal, large (1.5×), small (0.5×), rotate 45◦ left, and rotate 45◦ right. Each image data point will be randomly transformed to one of the modes. Representative examples of such training data can be found in Figure A4a. We train the model with learning rate 1 <sup>×</sup> <sup>10</sup>−<sup>4</sup> and batch size 2048 for 15,000 iterations.

**Auto-encoding results.** Figure A4b gives the decoded results of the training data with different modes. Even though the data are now much more diverse for each class, decoder learned from the CTRL-Multi objective can still achieve high sample-wise similarity to the original images.

(**a**) Original *X* (**b**) Decoded *X*ˆ

**Figure A4.** Original (training) data *X* and their decoded version *X*ˆ on the transformed MNIST.

**Identifying different modes.** Similar to the earlier experiments of Figure 6 for CIFAR-10 in the main paper, we find the top principal components of features of each class *Z<sup>j</sup>* (via SVD) and generate new images using the learned decoder *g* from features of the training images aligned the best with these components.

In Figure A5, we select three classes 0, 1, 2 and visualize samples from the top *r* = 8 principal components for each class. Each row represents one principal component direction. As can be seen in the figure, the decoded images along each principal component shows a similar mode and the modes along different component directions are rather incoherent. All major modes of the original data can be identified as one of these principal component directions. This clearly shows that the CTRL-Multi objective can keep the different modes within each class of the data *X<sup>j</sup>* as the principal component directions of *Zj*, and these modes can also be retained in the decoded images *X*ˆ*j*.

**Figure A5.** The reconstructed images *X*ˆ from the features *Z* best aligned along top-8 principal components on the transformed MNIST dataset. Each row represents a different principal component.

#### *Appendix A.4. CIFAR-10*

**Settings.** For all experiments on CIFAR-10, we follow the common training hyperparameters in Appendix A.1. Beyond that, for each experiment, we run 450,000 iterations with batch size 1600.

**Images decoded from random samples on the CTRL-Multi.** We sample *z* in the feature space randomly along the principal components and around the mean feature of each class *Z<sup>j</sup>* as in the MNIST case, according to Equation (A1). The generated images from the sampled features are illustrated in Figure A6, one row per class. As we see, the generator learned from the CTRL-Multi objective is capable of generating diverse images for each class.

Further, for visualization of random generated images *<sup>g</sup>*(*zrandom*\_*j*) conditioned on the given class, we compare our method with some other conditional generation method such as ACGAN [25] and InfoGAN [21]. For all three experiments, we have randomly sampled 8 images per class in CIFAR-10. For more complex datasets such as CIFAR-10, our model can give more realistic conditional generation results for different classes with high diversity within each class.

**Figure A6.** Comparison of randomly generated images conditioned on each class.

**Generating images along different PCA components for each class.** For each class, we first compute the top 10 principal components (singular vectors of the SVD) of *Z* and then for each of the top singular vectors, we display in each row the top 10 reconstructed image *X*ˆ whose *Z* are closest to the singular vector using methods described in the main body of the paper, Section 3.3. The results are given in Figure A7. Notice that images in each row are very similar as they are sampled along the same principal component, whereas images in different rows are very different as they are orthogonal in the feature space. These results indicate that the features learned by our method can not only disentangle different classes as orthogonal subspaces but can also disentangle different visual attributes within each class as (orthogonal) principal components within each subspace.

**Figure A7.** Reconstructed images *X*ˆ from features *Z* close to the principal components learned for the 10 classes of CIFAR-10.

#### *Appendix A.5. STL-10*

**Settings.** For all experiments on STL-10, we follow the common training hyperparameters in Appendix A.1. For the CTRL-Binary setting, we train 150,000 iterations. For the CTRL-Multi setting, we initialize the weights from the 20,000-th iteration of CTRL- Binary checkpoint and train for another 80,000 iterations (with the CTRL-Multi objective). The IS and FID scores on the STL-10 dataset are reported in Table A10, on par or even better than existing methods such as SNGAN [31] or DC-VAE [42].

**Visualizing auto-encoding property for the CTRL-Binary.** We visualize the original images *x* and their decoded *x*ˆ generated by the LDR model learned from the CTRL-Binary objective. The results are shown in Figure A8 for STL-10.

**Figure A8.** Visualizing the original *x* and corresponding decoded *x*ˆ results on STL-10 dataset. Note the model is trained from the CTRL-Binary objective hence sample- or class-wise correspondence is relatively poor, but the decoded image quality is very good.

#### *Appendix A.6. Celeb-A and LSUN*

To verify that our formulation works on images of higher resolution, we conduct experiments on the Celeb-A and LSUN datasets, which have a resolution of 128 × 128.

**Settings.** For all experiments on these datasets, we follow the common training hyperparameters in Appendix A.1. We choose a 300 batch size for Celeb-A and LSUN. Both of them are trained with the CTRL-Binary objective and for 450,000 iterations.

**Generating images along different PCA components.** We calculate the principal components of the learned features *Z* in the latent subspace. We manually choose three principle components which are related to hat, hair color, and glasses (see Figure A9). The three components are 9th, 19th, and 23rd respectively from the overall 128 principal components. These principal directions seem to clearly disentangle visual attributes/factors such as wearing a hat, changing hair color, and wearing glasses.

**Images generated from random sampling of the feature space.** We sample *z* randomly according to the following Gaussian model:

$$
\omega\_{random} = \overline{\mathfrak{z}} + \mathfrak{a} \sum\_{i=1}^{r} n\_i \ast \sigma\_i \ast \sigma\_{i\prime} \tag{A2}
$$

where *z*¯ is the mean feature, *σ<sup>i</sup>* and *v<sup>i</sup>* are the *i*th singular value and singular vector, respectively, *ni* are i.i.d. Gaussian <sup>N</sup> (0, 1) random variables. As before *<sup>α</sup>* is a hyperparameter to control the sampling range. We use the top r = 100 principle components for random sampling. The random generated images are realistic and diverse (see Figure A10).

**Visualizing auto-encoding property for CTRL-Binary.** We visualize the original image *x* and their decoded *x*ˆ using the LDR model learned from the CTRL-Binary objective. The results are shown in Figures A11 and A12 for the Celeb-A dataset and the LSUN dataset, respectively. The CTRL-Binary objective can give very good visual quality for *x*ˆ but cannot ensure sample-to-sample alignment. Nevertheless, the decoded *x*ˆ seems to be very similar to the original *x* in some main visual attributes. We believe the binary objective manages to align only the dominant principal component(s) associated with the most salient visual attributes, say, pose of the face for Celeb-A or layout of the room for LSUN, between features of *X* and *X*ˆ .

(**a**) Hat (**b**) Hair Color (**c**) Glasses

**Figure A9.** Sampling along the 9th, 19th, and 23rd principal components of the learned features *Z* seems to manipulate the visual attributes for generated images on the CelebA dataset.

**Figure A10.** Images decoded from randomly sampled features, as a learned Gaussian distribution (A2), for the CelebA dataset.

(**a**) Original *X* (**b**) Decoded *X*ˆ

**Figure A11.** Visualizing the original *x* and corresponding decoded *x*ˆ results on Celeb-A dataset. The LDR model is trained from the CTRL-Binary objective.

(**a**) Original *X* (**b**) Decoded *X*ˆ

**Figure A12.** Visualizing the original *x* and corresponding decoded *x*ˆ results on LSUN-bedroom dataset. The LDR model is trained from the CTRL-Binary objective.

#### *Appendix A.7. ImageNet*

**Settings.** To verify that the CTRL works on large-scale datasets, we train it on the ImageNet. For all experiments on the ImageNet, we follow the common training hyperparameters in Appendix A.1.

We first train our model with the CTRL-Binary objective with batch size of 1800 on the whole ImageNet ILSVRC 2012 dataset. The number of training iterations is 450,000.

After that, we fine-tune the pretrained model with the CTRL-Multi objective, on 10 selected classes. Information about the 10 classes can be found in Table A9. The fine-tune batch size is 1024, and we train another 35,000 iterations for it. This experiment takes 120 GPU hours on 8 A100-SXM4 GPUs. Note that our choice of batch size is substantially larger than those commonly adopted in other works while training on the ImageNet (e.g., 128 in [31]). We empirically observe that training with a larger batch size generates images of better quality and clearer class alignment. This is consistent with the proposed CTRL-Multi objective as it explicitly encourages alignment of class distributions, therefore benefiting from a larger batch that better captures overall data distributions. We leave a more rigorous study of the effect of batch size for future work.

Due to the heavy computation of such large batch size, we present the intermediate result obtained at the early iteration 35,000 whereas most existing methods run with significantly larger number of iterations. Nevertheless, the intermediate result already verify the efficacy of our framework. In addition, we present the full version of the comparison with existing generative methods in Table A10. We see the IS and FID scores for CTRL-Multi degraded a little after the finetuning. This is expected as learning a more refined separation and alignment of 10 classes is a more challenging task than 2 classes. This is consistently observed from experiments on other datasets too.

**Visualizing feature similarity for CTRL-Multi.** We visualize the cosine similarity among features *Z* of different classes learned from the CTRL-Multi objective in Figure A13. In addition, we provide the visualization of alignment between features *Z* and decoded features features *Z*ˆ. These results demonstrate that not only the encoder has already learnt to discriminate between classes, but also the learned *Z* and *Z*ˆ are aligned clearly within each class.


**Table A10.** Comparison on CIFAR-10, STL-10, and ImageNet. ↑ means higher is better. ↓ means lower is better.

**Figure A13.** Visualizing feature alignment: (**a**) among features |*ZZ*|, (**b**) between features and decoded features <sup>|</sup>*<sup>Z</sup><sup>Z</sup>*ˆ|. These results obtained after 200,000 iterations.

**Visualizing auto-encoding property for CTRL-Multi.** We visualize the original images *X* and their decoded *X*ˆ using the LDR model fine-tuned with the CTRL-Multi objective. The results are shown in Figure A14 for the selected 10 classes in ImageNet. The CTRL-Multi objective can give good visual quality for *X*ˆ as well as sample-to-sample alignment.

(**a**) Original *X* (**b**) Decoded *X*ˆ

**Figure A14.** Visualizing the original *X* and corresponding decoded *X*ˆ results on ImageNet (10 classes). The LDR model is fine-tuned using the CTRL-Multi objective. These visualizations are obtained after 35,000 iterations.

#### *Appendix A.8. Ablation Study on Closed-Loop Transcription and Objective Functions*

To empirically validate the necessity and respective roles of the closed-loop transcription and the rate reduction (Δ*R*) objective, we conduct two sets of experiments. For the first set of experiments, we modify our closed-loop architecture by instantiating more than two networks while keeping the objective function (12) unchanged. For the second set of experiments, we keep the closed-loop architecture but replace all rate reduction (Δ*R*) terms in (12) with corresponding cross-entropy, or remove some of the terms. Experiments here shed insight onto how the closed-loop transcription and the rate reduction affect separately the performance, including sample-wise reconstruction, the alignment of *Z* and *Z*ˆ space, and the diversity of intra-class features.

#### Appendix A.8.1. The Importance of the Closed-Loop

To evaluate the importance of the closed-loop transcription, we experiment on modified versions of the closed-loop architecture (A3). Notice that many architectures have been proposed and experimented before to promote the encoder *f* and decoder *g* to be mutually inverse or cycle consistent (at least for mappings between the data and feature distributions), such as BiGAN [38], VAE-GAN [34], and CycleGAN [56]. However, the cycle consistency is typically enforced through a third discriminator network. (In the case of CycleGAN [56], one needs two additional discriminator networks, one for each domain).

Here, we experiment on whether similar ideas work with the rate-reduction objective. First, we break the closed-loop and use a separate encoder network *<sup>f</sup>* <sup>2</sup> : *<sup>X</sup>*<sup>ˆ</sup> <sup>→</sup> *<sup>Z</sup>*<sup>ˆ</sup> to replace the original encoder *f* . The revised architecture is summarized in the diagram (A4). Second, to emulate the architecture of VAE-GAN [34], we also instantiate an extra encoder network *f* <sup>2</sup> and compute the CTRL-Multi objective using *Z*˜ and *Z*ˆ. The resulting architecture is also summarized in the diagram (A5).

$$\mathbf{X} \xrightarrow{f(\mathbf{x}, \boldsymbol{\theta})} \mathbf{Z} \xrightarrow{\mathcal{G}(\mathbf{z}, \boldsymbol{\eta})} \mathbf{X} \xrightarrow{f(\mathbf{x}, \boldsymbol{\theta})} \mathbf{Z}; \tag{A3}$$

$$\mathbf{X} \xrightarrow{f^1(\mathbf{x}, \theta^1)} \mathbf{Z} \xrightarrow{\mathbf{g}(\mathbf{z}, \eta)} \mathbf{X} \xrightarrow{f^2(\mathbf{x}, \theta^2)} \mathbf{Z};\tag{A4}$$

$$\mathbf{X} \xrightarrow{f^1(\mathbf{x}, \boldsymbol{\theta}^1)} \mathbf{Z} \xrightarrow{\mathbf{g}(\mathbf{z}, \boldsymbol{\eta})} \mathbf{X}, \mathbf{X} \xrightarrow{f^2(\mathbf{x}, \boldsymbol{\theta}^2)} \mathbf{Z}, \mathbf{Z}. \tag{A5}$$

We run experiments on MNIST with the three different architectures, and choose the network from Table A1 for the encoder and Table A2 for the decoder, and the training hyperparameters follow Appendix A.1. The qualitative results are shown in Figure A15. Both architectures (A4) and (A5) failed to generate meaningful images. These experiments show that directly applying rate-reduction objectives without the closed-loop or architectures that loosely enforcing cycle consistency fails to work. Instead, the closed-loop formulation allows us to use only two networks, without the need of any extra network.

**Figure A15.** Qualitative results for ablation study with alternative architectures to the proposed CTRL.

Appendix A.8.2. The Importance of Rate Reduction

By replacing the rate reduction (Δ*R*) terms in the objective function (12) with crossentropy, we introduce a linear mapping *<sup>W</sup>* <sup>∈</sup> <sup>R</sup>*d*×*<sup>k</sup>* to map *<sup>Z</sup>* <sup>∈</sup> <sup>R</sup>*d*×*<sup>n</sup>* from feature space to logits *γ* = *ZW*. We then calculate the softmax cross-entropy function on logits *γ* and one hot label matrix *<sup>Y</sup>*. Here <sup>H</sup>(*γ*, *<sup>Y</sup>*) = <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> <sup>∑</sup>*<sup>k</sup> <sup>j</sup>*=<sup>1</sup> *Yij* log *<sup>e</sup> γij* ∑*k <sup>j</sup>*=<sup>1</sup> *e <sup>γ</sup>ij* is the formulation of softmax cross-entropy function and *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>k</sup>* is one hot label matrix. Then, we can replace the first two terms of (12) (Δ*R Z* and Δ*R Z*ˆ ) with <sup>H</sup>(*<sup>Z</sup><sup>W</sup>*, *<sup>Y</sup>*) and <sup>H</sup>(*Z*<sup>ˆ</sup> *<sup>W</sup>*, *<sup>Y</sup>*). For the third term of (12), we extract *<sup>j</sup>*-th class one hot feature *<sup>γ</sup><sup>j</sup>* = *<sup>Z</sup> <sup>j</sup> <sup>W</sup>*, *<sup>γ</sup>*ˆ*<sup>j</sup>* <sup>=</sup> *<sup>Z</sup>*<sup>ˆ</sup> *<sup>j</sup> <sup>W</sup>* from *<sup>Z</sup>* and *<sup>Z</sup>*ˆ, and define the distance <sup>D</sup>(*γj*, *<sup>γ</sup>*ˆ*j*) = *<sup>e</sup> γj e <sup>γ</sup>j*+*<sup>e</sup> <sup>γ</sup>*ˆ*<sup>j</sup>* of them. For the third term of (12), we further introduce *<sup>k</sup>* linear layers as discriminators {D*j*}*<sup>k</sup> <sup>j</sup>*=<sup>1</sup> for each class. Then, we replace the third term with the GAN's objective function as ∑*<sup>k</sup> <sup>j</sup>*=<sup>1</sup> <sup>E</sup>[log <sup>D</sup>*j*(*Zj*)] + <sup>E</sup>[log(<sup>1</sup> − D*j*(*Z*<sup>ˆ</sup> *j*))] (E[*X*] denote the expectation of *X*). Now, we have the cross-entropy version objective function (A6) for the closed-loop framework. We denote the closed-loop framework with cross-entropy as Closed-loop-CE.

$$\min\_{\eta} \max\_{\theta, \mathcal{W}, \mathcal{D}} \mathcal{T}\_{\mathbf{X}}(\theta, \eta, \mathcal{W}, \mathcal{D}) \ \doteq \mathcal{H}(\mathbf{Z}^{\top} \mathbf{W}, \mathbf{Y}) + \mathcal{H}(\mathbf{Z}^{\top} \mathbf{W}, \mathbf{Y}) +$$

$$\sum\_{j=1}^{k} \mathbb{E}[\log \mathcal{D}\_{j}(\mathbf{Z}\_{j})] + \mathbb{E}[\log(1 - \mathcal{D}\_{j}(\mathbf{Z}\_{j}))].\tag{A6}$$

We run the experiments on MNIST and CIFAR10. The architectures of MNIST and CIFAR10 are given in Tables A1–A4 (In the context of this section, we use the term Decoder and Generator interchangeably; similarly for Encoder and Discriminator).

**Results on MNIST.** The training hyper-parameters of CTRL-Multi and Closed-loop-CE on MNIST are following Appendix A.1. Comparisons between CTRL-Multi and Closedloop-CE are listed in Figures A16–A18.

Figure A16b,c show the reconstructed images *X*ˆ from Closed-loop-CE and CTRL-Multi. Both methods can give sample-wise reconstruction results due to the closed-loop transcription framework. However, comparing training images whose features are best aligned with the principal components of class '2' in Figure A17, we see that the principal components of CE features do not correspond to consistent visual attributes of the images, whereas ours do.

From the heatmaps in Figure A18a,b, we see the features learned by rate reduction possess clear orthogonal subspace structures, whereas those learned by Closed-loop-CE do not. Moreover, Figure A18c,d shows that the learned features of CTRL-Multi have higher singular values for the top principal components of each class, corresponding to a more linearized and diverse feature distribution, whereas those by Closed-loop-CE do not.

**Figure A16.** The comparison of sample-wise reconstruction between the Closed-loop-CE objective and the CTRL-Multi objective.

**Figure A17.** Training samples along different principal components of the learned features of digit '2'.

**Figure A18.** *Cont*.

**Figure A18.** Comparison Closed-loop-CE and CTRL-Multi on <sup>|</sup>*<sup>Z</sup><sup>Z</sup>*ˆ<sup>|</sup> and PCA singular values. (**a**) <sup>|</sup>*<sup>Z</sup><sup>Z</sup>*ˆ<sup>|</sup> from Closed-loop-CE. (**b**) <sup>|</sup>*<sup>Z</sup><sup>Z</sup>*ˆ<sup>|</sup> from CTRL-Multi. (**c**) PCA of learned features by the Closed-loop-CE objective for each class. (**d**) PCA of learned features by the CTRL-Multi objective for each class.

**Failed Attempts on CIFAR-10 with Cross Entropy.** The training hyper-parameters of Closed-loop-CE on CIFAR10 follow Appendix A.1. We perform the grid search on three hyper-parameters: learning rate {1.5 <sup>×</sup> <sup>10</sup>−2, 1.5 <sup>×</sup> <sup>10</sup>−3, 1.5 <sup>×</sup> <sup>10</sup>−4}, batch size (800 or 1600), and inner loop (1,2,3,4), conducting 24 experiments in total. All cases of the Closed-loop-CE fail to converge or experience model collapse on the CIFAR-10 dataset.

Appendix A.8.3. Ablation Study on the CTRL-Multi Objectives

In this section, we investigate the influence of each term of the objective function (12) and see how they affect the learned features *Z*, *Z*ˆ and sample-wise reconstruction. We follow the same experiment setting with CTRL-Multi on MNIST (Appendix A.1), and conduct three experiments, each with a modified version of the original objective. Objective I is the original objective with all three terms, Objective II removes the second term Δ*R*(*Z*ˆ), and Objective III keeps only the third term Δ*R*(*Z*, *Z*ˆ). The results in Figure A19 show that using Objective II we can still maintain the sample-wise reconstruction property, but the image quality is lower when compared those constructed by Objective I (Figure A19b vs. Figure A19c). Objective III loses the sample-wise reconstruction property (Figure A19a vs. Figure A19d). Finally, the results from Figures A20 and A21 show that without the first two terms, the learned features *Z* and *Z*ˆ have poor class-to-class alignment and their principal components do not show clear subspace structure with higher singular values within each class.



**Figure A19.** The influence of the choice of objective functions on the reconstruction: decoded images *X*ˆ from the objective I, II, or III.

**Figure A20.** Correlation <sup>|</sup>*<sup>Z</sup><sup>Z</sup>*ˆ<sup>|</sup> between features *<sup>Z</sup>* and *<sup>Z</sup>*<sup>ˆ</sup> learned with Objective I, II, or III.

**Figure A21.** PCAs of the features learned with Objective I, II, or III.

*Appendix A.9. Ablation Study on Sensitivity to Spectral Normalization*

It is known that spectral normalization is important to improve the stability of training GANs. Here, we test our formulation with and without the spectral normalization. We follow the setting from Appendix A.1 and test on CIFAR10, using the network architecture from Tables A3 andA4. All settings of two experiments are exactly same except with or without spectral normalization. We see that our formulation is stable in both settings and generate similar images. The only difference is that the quantitative scores in terms of IS and FID is higher with the spectral normalization.

**Table A12.** Ablation study the influence of spectral normalization. ↑ means higher is better. ↓ means lower is better.


#### *Appendix A.10. Ablation Study on Trade-Off between Network Width and Batch Size*

Empirically, we observed that for our formulation, the larger the batch size, the better the results. To justify our use of batch size that is larger than those adopted in previous works such as [31], we conduct the following experiment which studies the training behavior of our proposed CTRL-Multi objective. Specifically, we train on the selected 10 classes of ImageNet with varying number of widest channels in our chosen architecture (specified in Appendix A.1) and batch size. We train both the encoder and decoder from scratch without fine-tuning. Other hyper-parameter settings detailed in Appendix A.7 are fixed. We present the results in Table A13. In the table, we denote training sessions that do not produce meaningful images as "failure" and those that do as "success". In the "failure" scenario, we noticed that the second term in the CTRL-Multi objective (12) would collapse to near 0 and could not be recovered, implying the decoder has essentially lost in the minimax game. In the "success" scenario, both the first terms of (12) stay close to each other and neither would collapse to near 0. The results present an interesting diagonal pattern that captures the relationship between batch size and network width. With a wider network and more channels, the network contains a greater capacity but would require a larger batch to stabilize training. This experiment justifies our use of a larger batch in our experiment in Appendix A.7 and also presents an interesting trade-off between network capacity and batch size for training.

**Table A13.** Ablation study on ImageNet about trade-off between batch size (BS) and network width (Channel #).


#### *Appendix A.11. Ablation Study on Feature Dimension*

In this paper so far, for simplicity and uniformity, we have chosen the feature dimension *d* = *nz* to be 128 for all experiments. In practice, however, the choice of feature dimension may affect the performance of the learned features: common practices suggest the larger the model, the better the performance could be. Hence, in this last section, we conduct experiments to show how the feature dimension affects the performance. It is not our intention to find the best feature dimension (nor the best network) with this work. We only want to show that there is room to improve the results presented in this paper.

The baseline experiment is conducted on CIFAR-10 with architectures from Table A2 and Table A1, training hyper-parameters are following the setting in Appendix A.1. Here, we change the feature dimension *nz*, batch size, and learning rate to 512, 8196, and 0.5 × 10−<sup>4</sup> respectively. Figure A22 shows the comparison of (randomly selected, not cherrypicked) reconstructed images with the original ones. We observe a significant improvement in visual quality over the results with a lower feature dimension. The IS and FID scores reported in Table A14 also confirm the improvement.

**Table A14.** IS and FID scores of images reconstructed by LDR models learned with different feature dimensions. ↑ means higher is better. ↓ means lower is better.


**Figure A22.** Reconstruction results by LDR models learned with different feature dimensions.

#### **References**


## *Article* **An Information Theoretic Interpretation to Deep Neural Networks †**

**Xiangxiang Xu 1, Shao-Lun Huang 1,\*, Lizhong Zheng <sup>2</sup> and Gregory W. Wornell <sup>2</sup>**


**Abstract:** With the unprecedented performance achieved by deep learning, it is commonly believed that deep neural networks (DNNs) attempt to extract informative features for learning tasks. To formalize this intuition, we apply the local information geometric analysis and establish an information-theoretic framework for feature selection, which demonstrates the information-theoretic optimality of DNN features. Moreover, we conduct a quantitative analysis to characterize the impact of network structure on the feature extraction process of DNNs. Our investigation naturally leads to a performance metric for evaluating the effectiveness of extracted features, called the H-score, which illustrates the connection between the practical training process of DNNs and the information-theoretic framework. Finally, we validate our theoretical results by experimental designs on synthesized data and the ImageNet dataset.

**Keywords:** deep neural network; information theory; local information geometry; feature extraction

#### **1. Introduction**

Due to the striking performance of deep learning in various application fields, deep neural networks (DNNs) have gained great attention in modern computer science. While it is a common understanding that the features extracted from the hidden layers of DNN are "informative" for learning tasks, the mathematical meaning of informative features in DNN is generally not clear. From the practical perspective, DNN models have obtained unprecedented performance in varying tasks, such as image recognition [1], language processing [2,3], and games [4,5]. However, the understanding of the feature extraction behind these models is relatively lacking, which poses challenges for their application in security-sensitive tasks, such as the autonomous vehicle.

To address this problem, there have been numerous research efforts, including both experimental and theoretical studies [6]. The experimental studies usually focus on some empirical properties of the feature extracted by DNNs, by visualizing the feature [7] or testing its performance on specific training settings [8] or learning tasks [9]. Though such empirical methods have provided some intuitive interpretations, the performance can highly depend on the data and network architecture used. For example, while the feature visualization works well on convolutional neural networks, its application to other networks is typically less effective [10].

In contrast, theoretical studies focus on the analytical properties of the extracted feature or the learning process in DNNs. Due to the complicated structure of DNNs, existing studies were often restricted to the networks of specific structures, e.g., network with infinite width [11] or two-layer network [12,13], to characterize the theoretical behaviors. However, the interpretation of the optimal feature remains unclear, which limits their

**Citation:** Xu, X.; Huang, S.-L.; Zheng, L.; Wornell, G.W. An Information Theoretic Interpretation to Deep Neural Networks. *Entropy* **2022**, *24*, 135. https://doi.org/10.3390/ e24010135

Academic Editor: Raúl Alcaraz

Received: 7 December 2021 Accepted: 12 January 2022 Published: 17 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

further applications. To obtain better interpretability, tools and measures from information theory [14] have recently been applied to connect DNNs with general information processing problems [15]. For instance, the information bottleneck [16,17] employs the mutual information as the metric to quantify the informativeness of features in DNN, and other information metrics, such as the Kullback–Leibler (KL) divergence [18] and Weissenstein distance [19], are also used in different problems. However, there is still a disconnection between these information metrics and the performance objectives of the inference tasks that DNNs want to solve [20]. Therefore, it is, in general, difficult to match the DNN learning with the optimization of a particular information metric.

This paper aims to provide an information-theoretic interpretation to the feature extraction process in DNNs, to bridge the gap between the practical deep learning implementations and information-theoretic characterizations. To this end, we first propose an information-theoretic feature selection framework, which establishes an information metric to measure the performance of each given feature in inference tasks. In addition, we demonstrate that the optimal features extracted by DNNs coincide with the solutions of the information-theoretic feature selection problem, which share the same performance metric. Therefore, our results give an explicit interpretation of the learning goal of the back-propagation (BackProp) and stochastic gradient descent (SGD) operations in deep learning [21], which also lead to a performance metric for evaluating the effectiveness of the extracted features. Finally, we validate our theoretic characterizations using numerical experiments on both synthesized data and the ImageNet [22] dataset for image classification.

#### **2. Preliminaries and Methods**

#### *2.1. Methodological Background*

The main method used in our development is local information geometry [23,24], which characterizes the local geometric properties of the probability distribution space. The local information geometric method is closely related to the conventional Hirschfeld– Gebelein–Rényi (HGR) maximal correlation [25–27] problem, which has attracted increasing interest in the information theory community [28–33], and has also been applied in data analysis [34] and privacy studies [35].

Specifically, we use the local information geometric method to construct and investigate an information-theoretic feature selection problem in Section 3.1, which leads to an information metric of features and also demonstrates an SVD (singular value decomposition) structure of the feature selection process. Following the same analysis framework, we characterize the optimal feature extracted by DNNs in Section 3.2, and demonstrate that the same SVD structure is shared by DNNs. Based on the established connection, we then propose an effectiveness measure for DNNs, with details presented in Section 3.3.

#### *2.2. Notations*

Throughout this paper, we use *X*, X, *PX*, and *x* to represent a discrete random variable, the range, the probability distribution, and the value of *X*. In addition, for any function *<sup>s</sup>*(*X*) <sup>∈</sup> <sup>R</sup>*<sup>k</sup>* of *<sup>X</sup>*, we use *<sup>μ</sup><sup>s</sup>* to denote the mean of *<sup>s</sup>*(*X*), and "˜" to denote the centered variable with mean subtracted, e.g., *s*˜(*X*) *<sup>s</sup>*(*X*) <sup>−</sup> *<sup>μ</sup>s*. Moreover, we use · and · <sup>F</sup> to denote the -2-norm and the Frobenius norm, respectively. All logarithms in our analyses are base *e*, i.e., natural.

#### *2.3. Local Information Geometry*

The following concepts from local information geometry would be useful in our development.

**Definition 1** (-Neighborhood)**.** *Let* P<sup>X</sup> *denote the space of distributions on some finite alphabet* X*, and let* relint(PX) *denote the subset of strictly positive distributions. For a given* > 0*, the -neighborhood of a distribution PX* <sup>∈</sup> relint(PX) *is defined by the <sup>χ</sup>*2*-divergence as*

$$\mathcal{N}\_{\epsilon}^{\mathcal{X}}(P\_{\mathcal{X}}) \triangleq \left\{ P \in \mathcal{P}^{\mathcal{X}} : \sum\_{\mathbf{x} \in \mathcal{X}} \frac{\left(P(\mathbf{x}) - P\_{\mathcal{X}}(\mathbf{x})\right)^{2}}{P\_{\mathcal{X}}(\mathbf{x})} \leq \epsilon^{2} \right\}.$$

**Definition 2** (-Dependence)**.** *The random variables X*,*Y are called -dependent if PXY* ∈ <sup>N</sup>X×<sup>Y</sup> (*PXPY*)*.*

**Definition 3** (-Attribute)**.** *A random variable <sup>U</sup> is called an -attribute of <sup>X</sup> if PX*|*U*(·|*u*) <sup>∈</sup> N<sup>X</sup> (*PX*)*, for all u* <sup>∈</sup> <sup>U</sup>*.*

We will focus on the small regime, which we refer to as the *local analysis regime*. In addition, for any *<sup>P</sup>* <sup>∈</sup> <sup>P</sup>X, we define the *information vector <sup>φ</sup>* and *feature function <sup>L</sup>*(*x*) corresponding to *<sup>P</sup>*, with respect to a reference distribution *PX* <sup>∈</sup> relint(PX), as

$$\phi(\mathbf{x}) \triangleq \frac{P(\mathbf{x}) - P\chi(\mathbf{x})}{\sqrt{P\_X(\mathbf{x})}}, \quad L(\mathbf{x}) \triangleq \frac{\phi(\mathbf{x})}{\sqrt{P\_X(\mathbf{x})}}.\tag{1}$$

This gives a three way correspondence *<sup>P</sup>* <sup>↔</sup> *<sup>φ</sup>* <sup>↔</sup> *<sup>L</sup>* for all distributions in <sup>N</sup><sup>X</sup> (*PX*), which will be useful in our derivations.

#### *2.4. Modal Decomposition*

Given a pair of discrete random variables *<sup>X</sup>*,*<sup>Y</sup>* with the joint distribution *PXY*(*x*, *<sup>y</sup>*), the <sup>|</sup>Y|×|X<sup>|</sup> matrix **<sup>B</sup>**˜ is defined as

$$\tilde{\mathbf{B}}(\underline{y},\mathbf{x}) \triangleq \frac{P\_{XY}(\mathbf{x},\underline{y}) - P\chi(\mathbf{x})P\_Y(\underline{y})}{\sqrt{P\_X(\mathbf{x})P\_Y(\underline{y})}},\tag{2}$$

where **B**˜(*y*, *x*) is the (*y*, *x*)th entry of **B**˜ . The matrix **B**˜ is referred to as the canonical dependence matrix (CDM) [24]. The SVD of **B**˜ is referred to as the *modal decomposition* [24] of the joint distribution *PXY*, which has the following property [18].

**Lemma 1.** *The SVD of* **B**˜ *can be written as* **B**˜ = ∑*<sup>K</sup> <sup>i</sup>*=<sup>1</sup> *σ<sup>i</sup> ψ<sup>Y</sup> i ψ<sup>X</sup> i* T *, where K* min{|X|, |Y|}*, and <sup>σ</sup><sup>i</sup> denotes the ith singular value with the ordering* <sup>1</sup> <sup>≥</sup> *<sup>σ</sup>*<sup>1</sup> ≥···≥ *<sup>σ</sup><sup>K</sup>* <sup>=</sup> <sup>0</sup>*, and <sup>ψ</sup><sup>Y</sup> <sup>i</sup> and <sup>ψ</sup><sup>X</sup> <sup>i</sup> are the corresponding left and right singular vectors with ψ<sup>X</sup> <sup>K</sup>* (*x*) = '*PX*(*x*) *and <sup>ψ</sup><sup>Y</sup> <sup>K</sup>*(*y*) = '*PY*(*y*)*.*

This SVD decomposes the feature spaces of *X*,*Y* into maximally correlated features. To see that, consider the generalized canonical correlation analysis (CCA) problem:

$$\max\_{\begin{subarray}{c}\mathbb{E}[f\_i(X)] = \mathbb{E}[g\_i(Y)] = 0\\ \mathbb{E}\left[f\_i(X)f\_j(X)\right] = \mathbb{E}\left[g\_i(Y)g\_j(Y)\right] = \delta\_{ij} \end{subarray}} \mathbb{E}\left[f\_i(X)\right]g\_i(Y)\Big|,\tag{3}$$

where *δij* denotes the Kronecker delta function. It can be shown that for any 1 ≤ *k* ≤ *<sup>K</sup>* <sup>−</sup> 1, the optimal features are *fi*(*x*) = *<sup>ψ</sup><sup>X</sup> <sup>i</sup>* (*x*)/ '*PX*(*x*), and *gi*(*y*) = *<sup>ψ</sup><sup>Y</sup> <sup>i</sup>* (*y*)/ '*PY*(*y*), for *<sup>i</sup>* <sup>=</sup> 0, ... , *<sup>K</sup>* <sup>−</sup> 1, where *<sup>ψ</sup><sup>X</sup> <sup>i</sup>* (*x*) and *<sup>ψ</sup><sup>Y</sup> <sup>i</sup>* (*y*) are the *<sup>x</sup>*th and *<sup>y</sup>*th entries of *<sup>ψ</sup><sup>X</sup> <sup>i</sup>* and *<sup>ψ</sup><sup>Y</sup> <sup>i</sup>* , respectively [18]. The special case *k* = 1 corresponds to the HGR maximal correlation [25–27], and the optimal features can be computed from the ACE (Alternating Conditional Expectation) algorithm [36].

#### *2.5. Deep Neural Networks*

The architecture of deep neural networks (under log-loss) can be depicted as Figure 1, where *X* is the input data, e.g., images, audios, or natural languages. Moreover, *Y* is the objective to predict, which can represent a discrete label in classification tasks, or represent target natural languages in machine translations [37]. Specifically, for given data *X*, the network produces a (trainable) feature mapping to generate *k*-dimensional feature *<sup>s</sup>*(*x*)=(*s*1, ... ,*sk*)T. In practice, the feature mapping block (depicted as the gray block in Figure 1) is typically composed of hundreds and thousands of functional components (e.g., residual block [1]) with different types of layers, and may contain recurrent structure, e.g., LSTM (Long Short-Term Memory) [38]. In general, the internal structure of the feature mapping can have various different types of designs, depending on the learning tasks.

**Figure 1.** A deep neural network that uses data *X* to predict *Y*. All hidden layers together map the input data *<sup>X</sup>* to *<sup>k</sup>*-dimensional feature *<sup>s</sup>*(*x*)=(*s*1, ... ,*sk*)T. Then, the probabilistic prediction *<sup>P</sup>*˜ *<sup>Y</sup>*|*<sup>X</sup>* of *Y* is computed from *s*(*x*), *v*(*y*), and *b*(*y*), where *v* and bias *b* are the weights and bias in the last layer.

After obtaining the feature *s*(*X*), the *Y* is then predicted by the probability distribution *<sup>P</sup>*˜(*s*,*v*,*b*) *<sup>Y</sup>*|*<sup>X</sup>* of the form

$$\bar{P}\_{Y|X}^{(s,v,b)}(y|x) \triangleq \frac{e^{\upsilon^{\mathbb{T}}(y)s(x) + b(y)}}{\sum\_{y' \in \mathcal{Y}} e^{\upsilon^{\mathbb{T}}(y')s(x) + b(y')}} \tag{4}$$

which is obtained by applying the softmax function [39] on *<sup>v</sup>*T(*y*)*s*(*x*) + *<sup>b</sup>*(*y*), where *<sup>v</sup>*(·) and *<sup>b</sup>*(·) are the weights and biases in the last layer, respectively (this is equivalent to the common practice that denotes weight and biases by the matrix [*v*(1), ... , *<sup>v</sup>*(|Y|)]<sup>T</sup> and the vector [*b*(1), ... , *<sup>b</sup>*(|Y|)]T, respectively. However, as we will show later, expressing weights *v* and biases *b* as mappings of *y* can better illustrate their roles in feature selection). We will use *P*˜ *<sup>Y</sup>*|*<sup>X</sup>* to refer to *<sup>P</sup>*˜(*s*,*v*,*b*) *<sup>Y</sup>*|*<sup>X</sup>* when there is no ambiguity.

Then, for a given training set of labeled samples (*xi*, *yi*), for *<sup>i</sup>* = 1, ... , *<sup>N</sup>*, all the parameters in the network, including *v*, *b*, as well as those in the feature mapping block, are chosen to maximize the log-likelihood function (or, equivalently, minimize the log-loss)

$$\frac{1}{N} \sum\_{i=1}^{N} \log \mathcal{P}\_{Y|X}(y\_i|x\_i). \tag{5}$$

The procedure of choosing such parameters is called the training of network, which can be performed by stochastic gradient descent (SGD) or its variants [21]. With a trained network, the label *y*ˆ for a new data sample *x* can be predicted by the maximum a posteriori (MAP) estimation, i.e., *<sup>y</sup>*<sup>ˆ</sup> <sup>=</sup> arg max*y*∈<sup>Y</sup> *<sup>P</sup>*˜ *<sup>Y</sup>*|*X*(*y*|*x*). Specifically, when we make predictions for samples in a test dataset, the proportion of samples with correct prediction (i.e., *y*ˆ = *y*) over all samples is called the test accuracy.

#### **3. Results**

#### *3.1. Information-Theoretic Feature Selection*

Suppose that, given random variables *X*,*Y* with joint distribution *PXY*, we want to infer about an attribute *V* of *Y* from observed i.i.d. samples *x*1, ... , *xn* of *X*. When the statistical model *PX*|*<sup>V</sup>* is known, the optimal decision rule is the log-likelihood ratio test, where the log-likelihood function can be viewed as the optimal feature for inference. However, in many practical situations [18], it is hard to identify the model of the targeted attribute, and it is necessary to select low-dimensional informative features of *X* for inference tasks before knowing the model. An information-theoretic formulation of such feature selection problem is the universal feature selection problem [24], which we formalize as follows.

To begin, for an attribute *V*, we refer to C<sup>Y</sup> = , <sup>V</sup>, {*PV*(*v*), *<sup>v</sup>* <sup>∈</sup> <sup>V</sup>}, {*φY*|*<sup>V</sup> <sup>v</sup>* , *<sup>v</sup>* <sup>∈</sup> <sup>V</sup>} - , as the *configuration* of *<sup>V</sup>*, where *<sup>φ</sup>Y*|*<sup>V</sup> <sup>v</sup>* <sup>↔</sup> *PY*|*V*(·|*v*) is the information vector specifying the corresponding conditional distribution *PY*|*V*(·|*v*). The configuration of *<sup>V</sup>* models the statistical correlation between *V* and *Y*. In the sequel, we focus on the local analysis regime, for which we assume that all the attributes *V* of our interests to detect are -attributes of *Y*. As a result, the corresponding configuration satisfies *φY*|*<sup>V</sup> <sup>v</sup>* <sup>≤</sup> , for all *<sup>v</sup>* <sup>∈</sup> <sup>V</sup>. We refer to such configurations as *-configurations*. The configuration of *V* is unknown in advance but assumed to be generated from a *rotational invariant ensemble (RIE)*.

**Definition 4** (RIE)**.** *Two configurations* C<sup>Y</sup> *and* C˜ <sup>Y</sup> *defined as*

$$\begin{aligned} \mathbb{C}\_{\mathcal{Y}} & \triangleq \left\{ \mathcal{V}, \{ P\_{\mathcal{V}}(v), \ v \in \mathcal{V} \}, \; \{ \Phi\_{v}^{\mathcal{Y}|\mathcal{V}}, v \in \mathcal{V} \} \right\}, \\\\ \mathbb{\tilde{C}}\_{\mathcal{Y}} & \triangleq \left\{ \mathcal{V}, \{ P\_{\mathcal{V}}(v), \ v \in \mathcal{V} \}, \; \{ \bar{\Phi}\_{v}^{\mathcal{Y}|\mathcal{V}}, v \in \mathcal{V} \} \right\} \end{aligned}$$

*are called rotationally equivalent, if there exists a unitary matrix* **<sup>Q</sup>** *such that <sup>φ</sup>*˜ *<sup>Y</sup>*|*<sup>V</sup> <sup>v</sup>* = **<sup>Q</sup>** *<sup>φ</sup>Y*|*<sup>V</sup> <sup>v</sup> , for all v* ∈ V*. Moreover, a probability measure defined on a set of configurations is called an RIE, if all rotationally equivalent configurations have the same measure.*

The RIE can be interpreted as assigning a uniform measure to the attributes with the same level of distinguishability. To infer about the attribute *V*, we construct a *k*-dimensional feature vector *<sup>h</sup><sup>k</sup>* = (*h*1,..., *hk*), for some 1 <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *<sup>K</sup>* <sup>−</sup> 1, of the form

$$h\_i = \frac{1}{n} \sum\_{l=1}^{n} f\_i(\mathbf{x}\_l), \quad i = 1, \dots, k,\tag{6}$$

for some choices of feature functions *fi*. Our goal is to determine the *fi* such that the optimal decision rule based on *h<sup>k</sup>* achieves the smallest possible error probability, where the performance is averaged over the possible C<sup>Y</sup> generated from an RIE. In turn, we denote *ξX <sup>i</sup>* <sup>↔</sup> *fi* as the corresponding information vector, and define the matrix **<sup>Ξ</sup>***<sup>X</sup>* - [*ξ<sup>X</sup>* <sup>1</sup> ··· *<sup>ξ</sup><sup>X</sup> k* ].

**Theorem 1** (Universal Feature Selection)**.** *For <sup>v</sup>*, *<sup>v</sup>* <sup>∈</sup> <sup>V</sup>*, let Ehk* (*v*, *<sup>v</sup>* ) *be the error exponent associated with the pairwise error probability distinguishing v and v based on hk, then the expected error exponent over a given RIE defined on the set of -configurations is given by*

$$\mathbb{E}\left[E\_{h^k}(\upsilon, \upsilon')\right] = \frac{\mathbb{C}\_0}{2} \cdot \left\| \tilde{\mathbf{B}} \Xi^X \left( \left(\boldsymbol{\Xi}^X\right)^\mathrm{T} \boldsymbol{\Xi}^X \right)^{-\frac{1}{2}} \right\|\_F^2 + o(\epsilon^2), \tag{7}$$

*where C*<sup>0</sup> - 1 <sup>4</sup>|Y<sup>|</sup> · <sup>E</sup> *φY*|*<sup>V</sup> <sup>v</sup>* <sup>−</sup> *<sup>φ</sup>Y*|*<sup>V</sup> v* 2 *is independent of the choices of fi's, and the expectations* <sup>E</sup>[·] *are taken over this RIE.*

**Proof.** See Appendix A.

As a result of (7), designing the *ξ<sup>X</sup> <sup>i</sup>* as the singular vectors *<sup>ψ</sup><sup>X</sup> <sup>i</sup>* of **<sup>B</sup>**˜ , for *<sup>i</sup>* <sup>=</sup> 1, ... , *<sup>k</sup>*, optimizes (7) for all RIEs, pairs of (*v*, *v* ), and -configurations. Thus, the feature functions corresponding to *ψ<sup>X</sup> <sup>i</sup>* are *universally optimal* for inferring the unknown attribute *V*. More-

over, (7) naturally leads to an information metric **B**˜ **Ξ***X***Ξ***X*<sup>T</sup> **<sup>Ξ</sup>***X*<sup>−</sup> <sup>1</sup> 2 2 for any feature

 F **Ξ***<sup>X</sup>* of *X*, measured by projecting the normalized **Ξ***<sup>X</sup>* through a linear projection **B**˜ . This information metric quantifies how informative a feature of *X* is when solving inference problems with respect to *Y* and is optimized when designing features by singular vectors of **B**˜ . Thus, we can interpret the universal feature selection as solving the most informative features for data inferences via the SVD of **B**˜ , which also coincides with the maximally correlated features in (3). Later, we will show that the feature selection in DNNs shares the same information metric as universal feature selection in the local analysis regime.

#### *3.2. Feature Extraction in Deep Neural Networks*

#### 3.2.1. Network with Ideal Expressive Power

For convenience of analysis, we first consider the ideal case where the neural network can express any feature mapping *<sup>s</sup>*(·) as desired. While this assumption can be rather strong, the existence of such ideal networks is guaranteed by the universal approximation theorem [40]. In addition, one goal of practical network designs is to approximate the ideal networks and obtain sufficient expressive power. For such networks, we will show that when *X*,*Y* are -dependent, the extracted feature *s*(*x*) and weights *v*(*y*) coincide with the solutions of the universal feature selection.

To begin, we use *PXY* to denote the joint empirical distribution of the labeled samples (*xi*, *yi*), *<sup>i</sup>* = 1, ... , *<sup>N</sup>*, and *PX*, *PY* to denote the corresponding marginal distributions. Then, the objective function of (5) is the empirical average of the log-likelihood function

$$\frac{1}{N} \sum\_{i=1}^{N} \log \vec{P}\_{Y|X}(y\_i|x\_i) = \mathbb{E}\_{P \times Y} \left[ \log \vec{P}\_{Y|X}(Y|X) \right].$$

Therefore, maximizing this empirical average is equivalent as minimizing the KL divergence:

$$D(s^\*, v^\*, b^\*) = \underset{(s, v, b)}{\text{arg min }} D(P\_{XY} || P\_X \tilde{P}\_{Y|X}^{(s, v, b)}).\tag{8}$$

This can be interpreted as finding the best fitting to empirical joint distribution *PXY* by distributions of the form *PX <sup>P</sup>*˜(*s*,*v*,*b*) *<sup>Y</sup>*|*<sup>X</sup>* . In our development, it is more convenient to denote the bias by *<sup>d</sup>*(*y*) = *<sup>b</sup>*(*y*) <sup>−</sup> log *PY*(*y*), for *<sup>y</sup>* <sup>∈</sup> <sup>Y</sup>. Then, the following lemma illustrates the explicit constraint on the problem (8) in the local analysis regime.

**Lemma 2.** *If X*,*Y are -dependent, then the optimal v*, *d for* (8) *satisfy*

$$|\tilde{v}^{\mathsf{T}}(\underline{y})s(x) + \tilde{d}(\underline{y})| = O(\epsilon), \quad \text{for all } \mathbf{x} \in \mathfrak{X}, \; \underline{y} \in \mathsf{Y}.\tag{9}$$

**Proof.** See Appendix B.

In turn, we take (9) as the constraint for solving the problem (8) in the local analysis regime. Moreover, we define the information vectors for zero-mean vectors *s*˜, *v*˜ as *ξ<sup>X</sup>* ' (*x*) = *PX*(*x*)*s*˜(*x*), *<sup>ξ</sup>Y*(*y*) = '*PY*(*y*) *<sup>v</sup>*˜(*y*), and define matrices

$$\Xi^Y \triangleq \begin{bmatrix} \xi^Y(1) & \cdots & \xi^Y(|\mathcal{Y}|) \end{bmatrix}^T, \quad \Xi^X \triangleq \begin{bmatrix} \xi^X(1) & \cdots & \xi^X(|\mathcal{X}|) \end{bmatrix}^T.$$

**Lemma 3.** *The KL divergence* (8) *in the local analysis regime* (9) *can be expressed as*

$$D(P\_{XY} \| \| P\_X \tilde{P}\_{Y|X}^{(s,\nu,b)}) = \frac{1}{2} \| \| \tilde{\mathbf{B}} - \boldsymbol{\Xi}^Y \left( \boldsymbol{\Xi}^X \right)^T \|\_{\rm F}^2 + \frac{1}{2} \eta^{(\upsilon,b)}(s) + o(\epsilon^2),\tag{10}$$

*where <sup>η</sup>*(*v*,*b*)(*s*) - <sup>E</sup>*PY* (*μ*<sup>T</sup> *<sup>s</sup> <sup>v</sup>*˜(*Y*) + ˜*d*(*Y*))<sup>2</sup> *.*

#### **Proof.** See Appendix C.

Lemma 3 reveals key insights for feature selection in neural networks. To see this, we consider the following two learning problems: learning the optimal weight *v* for given *s* and learning the optimal feature *s* for given *v*.

For the case that *s* is fixed, we can optimize (10) with **Ξ***<sup>X</sup>* fixed and obtain the following optimal weights:

**Theorem 2.** *For fixed* **Ξ***<sup>X</sup> and μs, the optimal* **Ξ***Y*<sup>∗</sup> *to minimize* (10) *is given by*

$$
\boldsymbol{\Xi}^{Y\*} = \boldsymbol{\bar{\mathsf{B}}} \boldsymbol{\Xi}^{X} \left( \left( \boldsymbol{\Xi}^{X} \right)^{\mathrm{T}} \boldsymbol{\Xi}^{X} \right)^{-1}, \tag{11}
$$

*and the optimal weights v*˜<sup>∗</sup> *and bias* ˜*d*<sup>∗</sup> *are*

$$\mathfrak{v}^\*(y) = \mathbb{E}\_{\mathbb{P}\_{\mathcal{X}|Y}} \left[ \mathbf{A}\_{\mathfrak{z}(X)}^{-1} \mathfrak{z}(X) \; \middle| \; Y = y \right], \quad \bar{d}^\*(y) = -\mu\_s^T \, \mathfrak{v}(Y). \tag{12}$$

*where* **<sup>Λ</sup>***s*˜(*X*) *denotes the covariance matrix of <sup>s</sup>*˜(*X*)*.*

#### **Proof.** See Appendix D.

Specifically, when *s*(*x*) = *x*, Theorem 2 gives the optimal weights for softmax regression. Note that Equation (11) can be viewed as a projection of the input feature *s*˜(*x*), to a feature *v*(*y*) computable from the value of *y*, which is the most correlated feature to *s*˜(*x*). The solution is given by the operation that left multiplies **B**˜ matrix, which we refer to as *forward feature projection*.

**Remark 1.** *While we assume the continuous input s*(*x*) *is a function of a discrete variable X, we only need the labeled samples between s and Y to compute the weights and bias from the conditional expectation* (12)*, and the correlation between X and s is irrelevant. Thus, our analysis for weights and bias can be applied to continuous input networks by just ignoring X and taking s as the real input to network.*

We then consider the "backward feature projection" problem, which attempts to find informative feature *s*∗(*X*) to minimize the loss (10) with given weights and bias. In particular, we can show that the solution of this backward feature projection is precisely symmetric to the forward one.

**Theorem 3.** *For fixed* **Ξ***<sup>Y</sup> and* ˜*d, the optimal* **Ξ***X*<sup>∗</sup> *to minimize* (10) *is given by*

$$
\boldsymbol{\Xi}^{X\*} = \mathbf{B}^{\mathrm{T}} \boldsymbol{\Xi}^{Y} \left( \left( \boldsymbol{\Xi}^{Y} \right)^{\mathrm{T}} \boldsymbol{\Xi}^{Y} \right)^{-1}, \tag{13}
$$

*and the optimal feature function s*∗*, which are decomposed to s*˜ ∗ *and μ*∗ *<sup>s</sup> , is given by*

$$\begin{aligned} \overline{s}^\*(\mathbf{x}) &= \mathbb{E}\_{\mathcal{P}\_{\mathcal{Y}|\mathcal{X}}} \Big[ \mathbf{A}^{-1}\_{\overline{\boldsymbol{\sigma}}(\mathcal{Y})} \, \overline{\boldsymbol{\sigma}}(\mathcal{Y}) \Big] X = \mathbf{x} \Big], \\ \mu^\*\_s &= -\mathbf{A}^{-1}\_{\overline{\boldsymbol{\sigma}}(\mathcal{Y})} \, \mathbb{E}\_{\mathcal{P}\_{\mathcal{Y}}} \Big[ \overline{\boldsymbol{\sigma}}(\mathcal{Y}) \, \overline{d}(\mathcal{Y}) \Big], \end{aligned} \tag{14}$$

*where* **<sup>Λ</sup>***v*˜(*Y*) *denotes the covariance matrix of <sup>v</sup>*˜(*Y*)*.*

#### **Proof.** See Appendix D.

Finally, when both *s* and (*v*, *b*) (and hence **Ξ***X*, **Ξ***Y*, *d*) can be designed, the optimal (**Ξ***Y*, **Ξ***X*) corresponds to the low rank factorization of **B**˜ , and the solutions coincide with the universal feature selection.

**Theorem 4.** *The optimal solutions for weights and bias to minimize* (10) *are given by* ˜*d*(*y*) = <sup>−</sup>*μ*<sup>T</sup> *<sup>s</sup> <sup>v</sup>*˜(*y*)*, and* (**Ξ***Y*, **<sup>Ξ</sup>***X*)<sup>∗</sup> *chosen as the largest k left and right singular vectors of* **<sup>B</sup>**˜ *.*

#### **Proof.** See Appendix E.

Therefore, we conclude that the learning of neural networks, when both *s* and (*v*, *b*) are designable, is to extract the most correlated aspects of the input data *X* and the label *Y* that are informative features for data inferences from universal feature selection.

In the practical learning process of DNN, the BackProp updates the weights of the softmax layer and those on the previous layer(s) in an iterative manner. As we have illustrated in Lemma 3, such iterative updates will converge to the same solution as the alternating between the forward feature projection (11) and the backward feature projection (13), which is indeed the power method to solve the SVD for **B**˜ [41], also known as the Alternating Conditional Expectation (ACE) algorithm [36].

**Remark 2.** *From Theorem 4, for a neural network with sufficient expressive power, the trained feature depends only on the distribution of input data rather than the training process. It is worth mentioning that this result does not contradict the practice that trained weights in hidden layers can be different during each training run. In fact, due to the over-parameterized nature of practical network designs, there exist multiple choices of weights in hidden layers to express the same optimal feature s*(*x*)*.*

#### 3.2.2. Network with Restricted Expressive Power

The analysis of the previous section has considered neural networks with ideal expressive power, where the feature *s*(*X*) can be selected as any desired function. In general, however, the form of feature functions that can be generalized is often limited by the network structure. In the following, we consider networks with restricted expressive power to characterize the impacts of network structure on the extracted feature.

For illustration, we consider the neural network with a hidden layer of *k* nodes, and a zero-mean continuous input *<sup>t</sup>* = [*t*<sup>1</sup> ··· *tm*] <sup>T</sup> <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* to this hidden layer, where *<sup>t</sup>* is assumed to be a function *t*(*x*) of some discrete variable *X*. Our goal is to analyze the weights and bias in this layer with labeled samples (*t*(*xi*), *yi*). Assume the activation function of the hidden layer is a generally smooth function *<sup>σ</sup>*(·), then the output *sz*(*X*) of the *<sup>z</sup>*-th hidden node is

$$s\_z(\mathbf{x}) = \sigma \left( w^\mathrm{T}(z)t(\mathbf{x}) + c(z) \right), \quad \text{for } z = 1, \dots, k, \ \mathbf{x} \in \mathfrak{X}, \tag{15}$$

where *<sup>w</sup>*(*z*) <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* and *<sup>c</sup>*(*z*) <sup>∈</sup> <sup>R</sup> are the weights and bias from input layer to hidden layer as shown in Figure 2. We denote *<sup>s</sup>* = [*s*<sup>1</sup> ··· *sk*] <sup>T</sup> as the input vector to the output classification layer.

**Figure 2.** A multi-layer neural network, where the expressive power of the feature mapping *<sup>s</sup>*(·) is restricted by the hidden representation *t*. All hidden layers previous to *t* are fixed, represented by the "pre-processing" module.

To interpret the feature selection in hidden layers, we fix (*v*(*y*), *b*(*y*)) at the output layer and consider the problem of designing (*w*(*z*), *c*(*z*)) to minimize the loss function (8) at the output layer. Ideally, we should have picked *w*(*z*) and *c*(*z*) to generate *s*(*x*) to match *s*∗(*x*) from (14), which minimizes the loss. However, here we have the constraint that *s*(*x*) must take the form of (15) and, intuitively, the network should select *w*(*z*), *c*(*z*) so that *s*(*x*) is close to *s*∗(*x*). Our goal is to quantify the notion of such closeness.

To develop insights on feature selection in hidden layers, we again focus on the local analysis regime, where the weights and bias are assumed to satisfy the local constraint

$$\left| \left| \vec{v}^{\mathsf{T}}(y)s(x) + \vec{d}(y) \right| = O(\epsilon), \quad \left| w^{\mathsf{T}}(z)\vec{l}(x) \right| = O(\epsilon), \,\forall x, y, z. \tag{16}$$

Then, since *t* is zero-mean, we can express (15) as

$$\mathbf{s}\_z(\mathbf{x}) = \sigma\left(\mathbf{w}^\mathrm{T}(z)t(\mathbf{x}) + \mathbf{c}(z)\right) = \mathbf{w}^\mathrm{T}(z)\mathbf{l}(\mathbf{x}) \cdot \sigma'(\mathbf{c}(z)) + \sigma(\mathbf{c}(z)) + o(\epsilon),\tag{17}$$

Moreover, we define a matrix **<sup>B</sup>**˜ <sup>1</sup> with the (*z*, *<sup>x</sup>*)th entry **<sup>B</sup>**˜ <sup>1</sup>(*z*, *<sup>x</sup>*) = <sup>√</sup>*PX*(*x*) *<sup>σ</sup>*(*c*(*z*)) *s*˜ ∗ *<sup>z</sup>* (*x*), which can be interpreted as a generalized CDM for the hidden layer. Furthermore, we denote *ξX* <sup>1</sup> (*x*) = '*PX*(*x*) ˜*t*(*x*) as the information vector of ˜*t*(*x*) with the matrix **<sup>Ξ</sup>***<sup>X</sup>* <sup>1</sup> defined as **Ξ***<sup>X</sup>* <sup>1</sup> - *ξX* <sup>1</sup> (1) ··· *<sup>ξ</sup><sup>X</sup>* <sup>1</sup> (|X|) T , and we also define

$$\mathbf{W} \stackrel{\scriptstyle \Delta}{=} \begin{bmatrix} w(1) & \cdots & w(k) \end{bmatrix}^{\rm T} \tag{18}$$

$$\mathbf{J} \triangleq \text{diag}\{\sigma'(\mathfrak{c}(1)), \sigma'(\mathfrak{c}(2)), \dots, \sigma'(\mathfrak{c}(k))\}. \tag{19}$$

The following theorem characterizes the loss (8).

**Theorem 5.** *Given the weights and bias* (*v*, *b*) *at the output layer, and for any input feature s, we denote* L(*s*) *as the loss* (8) *evaluated with respect to* (*v*, *b*) *and s. Then, with the constraints* (16)

$$\mathcal{L}(s) - \mathcal{L}(s^\*) = \frac{1}{2} \left\| \Theta \mathbf{B}\_1 - \Theta \mathbf{W} (\Xi\_1^X)^\top \right\|\_{\mathbb{F}}^2 + \frac{1}{2} \kappa^{(v,b)}(s, s^\*) + o(\epsilon^2), \tag{20}$$

*where* Θ - ( **Ξ***Y*<sup>T</sup> **<sup>Ξ</sup>***Y*)1/2**J***, and the term <sup>κ</sup>*(*v*,*b*)(*s*,*s*∗)=(*μ<sup>s</sup>* <sup>−</sup> *<sup>μ</sup>s*<sup>∗</sup> )T**Λ***v*˜(*Y*)(*μ<sup>s</sup>* <sup>−</sup> *<sup>μ</sup>s*<sup>∗</sup> )*.*

**Proof.** See Appendix F.

Equation (20) quantifies the closeness between *s* and *s*∗ in terms of the loss (8). Then, our goal is to minimize (20), which can be separated to two optimization problems:

$$\mathbf{W}^\* = \underset{\mathbf{W}}{\arg\min} \left\| \Theta \tilde{\mathbf{B}}\_1 - \Theta \mathbf{W} \left( \Xi\_1^X \right)^\mathrm{T} \right\|\_{\mathrm{F}'}^2 \tag{21}$$

$$
\mu\_s^\* = \underset{\mu\_s}{\text{arg min }} \kappa^{(v,b)}(s, s^\*). \tag{22}
$$

Note that the optimization problem (21) is similar to the one that appeared in Lemma 3, and the optimal solution is given by **<sup>W</sup>**<sup>∗</sup> = **<sup>B</sup>**˜ <sup>1</sup>**Ξ***<sup>X</sup>* 1 **Ξ***<sup>X</sup>* 1 T **Ξ***<sup>X</sup>* 1 −<sup>1</sup> . Therefore, solving the optimal weights in the hidden layer can be interpreted as projecting *s*˜ <sup>∗</sup>(*x*) to the subspace of feature functions spanned by *t*(*x*) to find the closest expressible function. In addition, the problem (22) is to choose *μ<sup>s</sup>* (and hence the bias *c*(*z*)) to minimize the quadratic term similar to *<sup>η</sup>*(*v*,*b*)(*s*) in (10). Similar to the analyses of parameters in the last layer, we can obtain analytical solutions for hidden layer parameters, e.g., *μ*∗ *<sup>s</sup>* and *w*∗, with detailed discussions provided in Appendix G.

Overall, we observe the correspondence between (11), (14), and (21), (22), and interpret both operations as feature projections. Our argument can be generalized to any intermediate layer in a multi-layer network, with all the previous layers viewed as the fixed pre-processing that specifies *t*(*x*), and all the layers after determining *s*∗. Then, the iterative procedure in back-propagation can be viewed as alternating projection finding the fixed-point solution over the entire network. This final fixed-point solution, even under the local assumption, might not be the SVD solution as in Theorem 4. This is because the limited expressive power of the network often makes it impossible to generate the desired feature function. In such cases, the concept of feature projection can be used to quantify this gap, and thus to measure the quality of the selected features.

#### *3.3. Scoring Neural Networks*

Given a learning problem, it is useful to tell whether or not some extracted features are informative [42]. Our previous development naturally gives rise to a performance metric.

**Definition 5.** *Given a feature <sup>s</sup>*(*x*) <sup>∈</sup> <sup>R</sup>*<sup>k</sup> and weight <sup>v</sup>*(*y*) <sup>∈</sup> <sup>R</sup>*<sup>k</sup> with the corresponding information matrices* **Ξ***<sup>X</sup> and* **Ξ***Y, the H-score H*(*s*, *v*) *is defined as*

$$H(s,v) \triangleq \frac{1}{2} \left\lVert \left\| \bar{\mathbf{B}} \right\rVert \right\rVert\_{\mathcal{F}}^2 - \frac{1}{2} \left\lVert \left\| \bar{\mathbf{B}} - \boldsymbol{\Xi}^Y \left( \boldsymbol{\Xi}^X \right)^\mathbf{T} \right\rVert\_{\mathcal{F}}^2 = \mathbb{E}\_{P\_{\mathcal{X}\mathcal{Y}}} \left[ \bar{\mathbf{s}}^\mathbf{T}(X) \, \bar{\boldsymbol{\sigma}}(Y) \right] - \frac{1}{2} \operatorname{tr} \left( \boldsymbol{\Lambda}\_{\tilde{s}(X)} \boldsymbol{\Lambda}\_{\tilde{v}(Y)} \right). \tag{23}$$

*In addition, for given s*(*x*)*, we define the single-sided H-score H*(*s*) *as*

$$H(\mathbf{s}) \stackrel{\Delta}{=} \max\_{\upsilon} H(\mathbf{s}, \upsilon) \tag{24}$$

$$=\frac{1}{2}||\tilde{\mathbf{B}}||\_{\mathcal{F}}^{2}-\frac{1}{2}||\tilde{\mathbf{B}}-\tilde{\mathbf{B}}\,\boldsymbol{\Xi}^{X}\left(\left(\boldsymbol{\Xi}^{X}\right)^{\mathrm{T}}\boldsymbol{\Xi}^{X}\right)^{-1}\left(\boldsymbol{\Xi}^{X}\right)^{\mathrm{T}}||\_{\mathcal{F}}^{2}\tag{25}$$

$$\mathcal{I} = \frac{1}{2} \left\| \mathbb{B} \mathbb{B}^{X} \left( \left( \mathbb{B}^{X} \right)^{\mathrm{T}} \mathbb{B}^{X} \right)^{-\frac{1}{2}} \right\|\_{\mathrm{F}}^{2} = \frac{1}{2} \mathbb{E}\_{\mathrm{P}\_{\mathrm{Y}}} \left[ \left\| \mathbb{E}\_{\mathrm{P}\_{\mathrm{X}|\mathrm{Y}}} \left[ \mathbb{A}\_{\mathfrak{z}(\boldsymbol{X})}^{-1/2} \, \tilde{\mathfrak{s}}(\boldsymbol{X}) \, \Big|\, \mathcal{Y} \right] \right\|^{2} \right]. \tag{26}$$

H-score can be used to measure the quality of features generated at any intermediate layer of the network. It is related to (20) when choosing the optimal bias and Θ as the identity matrix. This can be understood as taking the output of this layer *s*(*x*) and directly feeding it to a softmax output layer with *v*(*y*) used as the weights, and *H*(*s*, *v*) measures the resulting performance. Note that *v*(*y*) here can be an arbitrary function of *Y*, not necessarily the weights on the next layer computed by the network. When the optimal *v*∗(*y*) as defined in (12) is used, the resulting performance becomes the one-sided H-score *H*(*s*), which measures the quality of *s*(*x*). In addition, by comparing (26) with (7), the performance measure *H*(*s*) also coincides with the information metric (7), up to a scale factor.

Specifically, for a given dataset and a feature extractor that generate *<sup>s</sup>*(·), the H-score *H*(*s*) can be efficiently computed from the second equation of (26). In addition, when we use H-score to compare the performance of different feature extractors (models), the model complexity has to be taken into account to reduce overfitting. To this end, we adopt Akaike information criterion (AIC) and define *AIC-corrected H-score*

$$H\_{\rm ABC}(s) \triangleq H(s) - \frac{n\_{\rm P}}{n\_{\rm s}} \tag{27}$$

for comparing different models, where *n*<sup>p</sup> and *n*<sup>s</sup> represent the number of parameters in the model and the training sample size, respectively.

In current practice, the cross-entropy <sup>E</sup>*PXY* log *<sup>P</sup>*˜(*v*,*b*) *Y*|*X* is often used as the performance metric. One can, in principle, also use log-loss to measure the effectiveness of the selected feature at the output of an intermediate layer [42]. However, one problem of this metric is that, for a given problem, it is not clear what value of log-loss one should expect, as the log-loss is generally unbounded. In contrast, the H-score can be directly computed from the data samples and has a clear upper bound. Indeed, it follows from Lemma 1 that, for *k*-dimensional feature *s* and weights *v*, we have the sequence of inequalities

$$H(s,v) \le H(s) \le \frac{1}{2} \sum\_{i=1}^{k} \sigma\_i^2 \le \frac{k}{2} \tag{28}$$

where *σ<sup>i</sup>* indicates the *i*th singular value of **B**˜ .

In particular, the first "≤" follows from the definition (24), and the gap between *<sup>H</sup>*(*s*, *<sup>v</sup>*) and *<sup>H</sup>*(*v*) measures the optimality of the weights *<sup>v</sup>*; the second "≤" follows from the first equality of (26), and the gap between two sides characterizes the difference between the chosen feature and the optimal solution, which is a useful measure of how restrictive (lack of expressive power) the network structure is; the last "≤" follows from the fact that *σ<sup>i</sup>* ≤ 1 (cf. Lemma 1), which measures the dependency between data variable and label for the given dataset. In Section 3.4.3, we validate this metric on real data.

#### *3.4. Experiments*

This section presents experiments for validating our theoretical characterizations, with corresponding code available at https://github.com/XiangxiangXu/dnn (accessed on 7 December 2021). Specifically, all DNN models used in Section 3.4.3 are available at https://keras.io/applications/ (accessed on 7 December 2021).

#### 3.4.1. Experimental Validation of Theorem 4

We first validate Theorem 4, the optimal feature extracted by network with ideal expressive power. Here, we consider the discrete data with alphabet sizes, <sup>|</sup>X<sup>|</sup> <sup>=</sup> 8 and <sup>|</sup>Y<sup>|</sup> <sup>=</sup> 6, and construct the network as shown in Figure 3. Specifically, the network input is the one-hot encoding of *<sup>X</sup>*, i.e., [**1***X*(1), ... , **<sup>1</sup>***X*(|X|)]T, where **<sup>1</sup>***X*(*x*) takes one if and only if *X* = *x*, and takes zero otherwise. Then, the feature *s*(*X*) is generated by a linear layer, with sigmoid function used as the activation function. For ease of comparison and presentation, we set feature dimension to *k* = 1, since otherwise the optimal feature (cf. Theorem 4) lies in a subspace and is non-unique. It can be verified that this network has ideal expressive power, i.e., with proper weights in the first layer, *s*(*X*) can express any desired function up to scaling and shifting.

**Figure 3.** A simple neural network with ideal expressive power, which can generate any *k* = 1 dimensional feature *s* of *X* by tuning the weights in the first layer.

To compare the result trained by the neural network and that in Theorem 4, we first randomly generate a distribution *PXY*, and then draw independently *<sup>n</sup>* = 100,000 pairs of (*X*,*Y*) samples. We then train the network using batch gradient descent, where we have applied Nesterov momentum [43] with the momentum hyperparameter being 0.9. In addition, we set the learning rate to 4 with a decay factor of 0.01 and clip gradients with norm exceeding 0.5. After training, the learned values of *s*(*x*), *v*(*y*) and *b*(*y*) are shown in Figure 4 and compared with theoretical results. From the figure, we can observe that the training results match our theoretical analyses.

**Figure 4.** The trained feature *s*, weights *v*, and bias *b* of the network in Figure 3, which are compared with the corresponding theoretical results to show their coincidences.

3.4.2. Experimental Validation of Theorem 5

In addition, we validate Theorem 5 by the neural network depicted in Figure 5, with the same settings of *X*,*Y*. Specifically, the number of neurons in hidden layers are set to *m* = 4 and *k* = 3, where *t*(*X*) is randomly generated from *X*, and we have chosen sigmoid function as the activation function *<sup>σ</sup>*(·) to generate *<sup>s</sup>*(*x*). We then fix the weights and bias at the output layer and train the weights *w*(1), *w*(2), *w*(3) and bias *c* in the hidden layer to optimize the log-loss. Specifically, we use the batch gradient descent with the Nesterov momentum hyperparameter being 0.9. In addition, we set the learning rate to 4 with a decay factor of 10−<sup>6</sup> and clip gradients with norm exceeding 0.1. After training, Figure 6 shows the matching between the learned results and the corresponding theoretical values.

**Figure 5.** The designed network for validating the impact of network structure on feature extraction, with *m* = 4 and *k* = 3 neurons in two hidden layers. Our goal is to compare the learned weights *w*(1), *w*(2), *w*(3) and bias *c* in the hidden layer with our theoretic characterizations in Section 3.2.2.

**Figure 6.** The trained weights *w* and bias *c* of the network in Figure 5, which are compared with the corresponding theoretical results to show their coincidences.

#### 3.4.3. Experimental Validation of H-Score

To validate H-score as a performance measure for extracted features, we compare the H-score and classification accuracy of DNNs on image classification tasks. Specifically, we use the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) [22] dataset as the dataset and extract features using several deep neural networks with representative architectures designs [44–49]. After training the feature extractors on the ILSVRC2012 training set, we then compute the H-score of the feature in the last hidden layer, as well as the classification accuracies on ILSVRC2012 validation set (here, we use ILSVRC2012 validation set for testing, as the labels in ILSVRC2012 testing set have not been publicly released). The results are summarized in Table 1, where *<sup>H</sup>*AIC(*s*) is the AIC-corrected Hscore as defined in (27), with *n*<sup>p</sup> being the number of model parameters, and *n*<sup>s</sup> = 1,300,000 corresponding to the number of training samples in ImageNet. The AIC-corrected H-score is consistent with the classification accuracy, which validates the effectiveness of H-score as a measurement of neural networks.


**Table 1.** Classification accuracy and H-score for different DNN models on ImageNet dataset, where "Paras" indicates the number of parameters (in millions) in the model and *H*AIC represents the AIC-corrected H-score.

#### **4. Discussion**

Our characterization gives an information-theoretic interpretation of the feature extraction process in DNNs, which also provides a practical performance measure for scoring neural networks. Different from empirical studies focusing on specific datasets [7], our development is based on the probability distribution space, which is more general and can also provide theoretic insights. Moreover, the information-theoretic framework allows us to obtain direct operational meaning and better interpretations for the solutions, compared with optimization-based theoretical characterizations, e.g., [11,13].

As a first step in establishing a rigorous framework for DNN analysis, the present work can be extended in both theoretical and practical aspects. From the theoretical perspective, one extension is to investigate the analytical properties for general DNNs, using the theoretic insights obtained from local analysis regime. For example, it was shown in [50] that the symmetry between feature and weights in DNNs established in the local analysis regime (cf. Section 3.2.1) also holds for general probability distributions. Another extension is to apply the framework to investigate the optimal feature for structured data or network, e.g., data with sparsity structure [51].

From the practical perspective, in addition to the demonstrated example of evaluating existing DNN models (cf. Section 3.4.3), the H-score can also be used as an objective function in designing learning algorithms. In particular, such usages have been illustrated in multi-modal learning [52] and transfer learning [53] tasks.

#### **5. Conclusions**

In this paper, we apply the local information geometric analysis and provide an information-theoretic interpretation to the feature extraction scheme in DNNs. We first establish an information metric for features in inference tasks by formalizing the informationtheoretic feature selection problem. In addition, we demonstrate that the features extracted by DNNs coincide with the information-theoretically optimal feature, with the same metric measuring the performance of features, called H-score. Furthermore, we discuss the usage of the H-score for measuring the effectiveness of DNNs. Our framework demonstrates a connection between the practical deep learning implementations and information-theoretic characterizations, which can provide theoretical insights for DNN analysis and learning algorithm designs.

**Author Contributions:** X.X., S.-L.H., L.Z. and G.W.W. contributed to the conceptualization, methodology, and writing of this paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** The work of S.-L. Huang was supported in part by the National Natural Science Foundation of China under Grant 61807021 and the Shenzhen Science and Technology Program under Grant KQTD20170810150821146. The work of L. Zheng was supported in part by the National Science Foundation (NSF) under Award CNS-2002908 and the Office of Naval Research (ONR) under Grant N00014-19-1-2621.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Appendix A. Proof of Theorem 1**

We commence with the characterization of the error exponent.

**Lemma A1.** *Given a reference distribution PX* <sup>∈</sup> relint(PX)*, a constant* <sup>&</sup>gt; <sup>0</sup> *and integers <sup>n</sup> and k, let <sup>x</sup>*1, ... , *xn denote i.i.d. samples from one of <sup>P</sup>*<sup>1</sup> *or <sup>P</sup>*2*, where <sup>P</sup>*1, *<sup>P</sup>*<sup>2</sup> <sup>∈</sup> <sup>N</sup><sup>X</sup> (*PX*)*. To decide whether P*<sup>1</sup> *or P*<sup>2</sup> *is the generating distribution, a sequence of k-dimensional statistics <sup>h</sup><sup>k</sup>* = (*h*1,..., *hk*) *is constructed as*

$$h\_i = \frac{1}{n} \sum\_{l=1}^{n} f\_i(\mathbf{x}\_l), \quad i = 1, \dots, k,\tag{A1}$$

*where* (*f*1(*X*),..., *fk*(*X*)) *are zero mean, unit-variance, and uncorrelated with respect to PX, i.e.,*

$$\mathbb{E}\_{P\_X}[f\_i(X)] = 0, \quad i \in \{1, \dots, k\} \tag{A2}$$

$$\mathbb{E}p\_{\boldsymbol{X}}\left[f\_{i}(\boldsymbol{X})f\_{j}(\boldsymbol{X})\right] = \delta\_{\boldsymbol{i}\boldsymbol{j}} \quad \text{i}, \boldsymbol{j} \in \{1, \ldots, k\}. \tag{A3}$$

*Then, the error probability of the decision based on <sup>h</sup><sup>k</sup> decays exponentially in <sup>n</sup> as <sup>n</sup>* <sup>→</sup> <sup>∞</sup>*, with (Chernoff) exponent*

$$\lim\_{m \to \infty} \frac{-\log p\_{\varepsilon}}{n} \stackrel{\triangle}{=} E\_{h^k} = \sum\_{i=1}^k E\_{h\_i \prime} \tag{A4}$$

*where*

$$E\_{h\_i} = \frac{1}{8} \langle \phi\_1 - \phi\_{2\prime} \xi\_i \rangle^2 + o(\epsilon^2),\tag{A5}$$

*and <sup>φ</sup>*<sup>1</sup> <sup>↔</sup> *<sup>P</sup>*1, *<sup>φ</sup>*<sup>2</sup> <sup>↔</sup> *<sup>P</sup>*2, *<sup>ξ</sup><sup>i</sup>* <sup>↔</sup> *fi*(*X*), *<sup>i</sup>* ∈ {1, . . . , *<sup>k</sup>*} *are the corresponding information vectors.*

**Proof of Lemma A1.** Since the rule is to decide based on comparing the projection

$$\sum\_{i=1}^{k} h\_i \left( \mathbb{E}\_{P\_1} \left[ f\_i(X) \right] - \mathbb{E}\_{P\_2} \left[ f\_i(X) \right] \right)^2$$

to a threshold, via Cramér's theorem [54], the error exponent under *Pj* (*<sup>j</sup>* = 1, 2) is

$$E\_{\vec{j}}(\lambda) = \min\_{P \in \mathcal{S}(\lambda)} D(P || P\_{\vec{j}})\_{\prime} \tag{A6}$$

where

$$\mathcal{S}(\lambda) \stackrel{\Delta}{=} \left\{ P \in \mathcal{P}^{\mathcal{X}} \colon \mathbb{E}\_P \left[ f^k(X) \right] = \lambda \operatorname{\mathbb{E}}\_{P\_1} \left[ f^k(X) \right] + (1 - \lambda) \operatorname{\mathbb{E}}\_{P\_2} \left[ f^k(X) \right] \right\}. \tag{A7}$$

Now, since (A2) holds, we obtain

$$\begin{split} \mathbb{E}\_{P\_{j}}[f\_{i}(\mathbf{X})] &= \sum\_{\mathbf{x} \in \mathcal{X}} P\_{j}(\mathbf{x}) f\_{i}(\mathbf{x}) \\ &= \sum\_{\mathbf{x} \in \mathcal{X}} P\_{\mathbf{X}}(\mathbf{x}) f\_{i}(\mathbf{x}) + \sum\_{\mathbf{x} \in \mathcal{X}} (P\_{j}(\mathbf{x}) - P\_{\mathbf{X}}(\mathbf{x})) f\_{i}(\mathbf{x}) \\ &= \mathbb{E}\_{P\_{\mathbf{X}}}[f\_{i}(\mathbf{X})] + \sum\_{\mathbf{x} \in \mathcal{X}} \sqrt{P\_{\mathbf{X}}(\mathbf{x})} \,\phi\_{j}(\mathbf{x}) \cdot \frac{\mathbf{g}\_{i}(\mathbf{x})}{\sqrt{P\_{\mathbf{X}}(\mathbf{x})}} \\ &= \sum\_{\mathbf{x} \in \mathcal{X}} \phi\_{j}(\mathbf{x}) \,\xi\_{i}(\mathbf{x}) \end{split}$$

$$\mathbf{x}' = \langle \boldsymbol{\phi}\_{\mathbf{j}'} \boldsymbol{\xi}\_i \rangle, \quad j = 1, 2 \text{ and } i = 1, \dots, k,\tag{A8}$$

which we express compactly as

$$\mathbb{E}\_{P\_{\vec{\gamma}}}[f^k(X)] = \langle \phi\_{\vec{\gamma}}, \mathfrak{f}^k \rangle, \quad j = 1, 2$$

with *ξ<sup>k</sup>* -(*ξ*1,..., *<sup>ξ</sup>k*).

Hence, the constraint (A7) is expressed in information vectors as

$$
\langle \phi\_{\prime} \xi\_{i} \rangle = \langle \lambda \, \phi\_{1} + (1 - \lambda) \, \phi\_{2\prime} \xi\_{i} \rangle\_{\prime} \quad i = 1, \cdots, k,
$$

i.e.,

$$
\langle \boldsymbol{\Phi}, \mathfrak{J}^{k} \rangle = \langle \lambda \, \boldsymbol{\Phi}\_{1} + (1 - \lambda) \, \boldsymbol{\Phi}\_{2}, \mathfrak{J}^{k} \rangle. \tag{A9}
$$

In turn, the optimal *P* in (A6), which we denoted by *P*∗, lies in the exponential family through *Pj* with natural statistic *<sup>f</sup> <sup>k</sup>*(*x*), i.e., the *<sup>k</sup>*-dimensional family whose members are of the form

$$\log \mathcal{P}\_{\theta^k}(\mathbf{x}) = \sum\_{i=1}^k \theta\_i f\_i(\mathbf{x}) + \log P\_{\mathbf{j}}(\mathbf{x}) - \alpha \left(\theta^k\right),$$

for which the associated information vector is

$$\tilde{\phi}\_{\theta^k}(\mathbf{x}) = \sum\_{i=1}^k \theta\_i \tilde{\varsigma}\_i(\mathbf{x}) + \phi\_j(\mathbf{x}) - a(\theta^k) \sqrt{P\_X(\mathbf{x})} + o(\epsilon), \tag{A10}$$

where we have used the fact that

$$\begin{aligned} \log Q\_{\mathcal{X}}(\mathbf{x}) &= \log P\_{\mathcal{X}}(\mathbf{x}) + \log \frac{Q\_{\mathcal{X}}(\mathbf{x})}{P\_{\mathcal{X}}(\mathbf{x})} \\ &= \log P\_{\mathcal{X}}(\mathbf{x}) + \log \left( 1 + \frac{1}{\sqrt{P\_{\mathcal{X}}(\mathbf{x})}} \phi(\mathbf{x}) \right) \\ &= \log P\_{\mathcal{X}}(\mathbf{x}) + \frac{1}{\sqrt{P\_{\mathcal{X}}(\mathbf{x})}} \phi(\mathbf{x}) + o(\epsilon) \end{aligned}$$

for all *QX* <sup>∈</sup> <sup>N</sup><sup>X</sup> (*PX*) with the information vector *<sup>φ</sup>* <sup>↔</sup> *QX*. As a result,

$$
\langle \bar{\phi}\_{\theta^k}, \zeta\_i \rangle = \theta\_i + \langle \phi\_{j'} \zeta\_i \rangle + o(\epsilon),
$$

where we have used (A3). Hence, via (A9), we obtain that the intersection with the linear family (A7) is at *<sup>P</sup>*<sup>∗</sup> <sup>=</sup> *<sup>P</sup>θk*<sup>∗</sup> with

$$
\theta\_i^\* = \langle \lambda \phi\_1 + (1 - \lambda)\phi\_2 - \phi\_{j'}\zeta\_i\rangle + o(\epsilon)
$$

and thus

$$\begin{split} E\_{\vec{j}}(\lambda) &= D(P^\*||P\_{\vec{j}}) \\ &= \frac{1}{2} \left\| \vec{\phi}\_{\theta^k} - \phi\_{\vec{j}} \right\|^2 + o(\epsilon^2) \end{split} \tag{A11}$$

$$\mathcal{I} = \frac{1}{2} \left\| \sum\_{i=1}^{k} \theta\_i^\* \zeta\_i \right\|^2 + \frac{1}{2} \alpha \left( \theta^{k^\*} \right)^2 + o(\varepsilon^2) \tag{A12}$$

$$\hat{\theta} = \frac{1}{2} \sum\_{i=1}^{k} (\theta\_i^\*)^2 + \frac{1}{2} a \left(\theta^{k^\*}\right)^2 + o(\epsilon^2) \tag{A13}$$

$$=\frac{1}{2}\sum\_{i=1}^{k}\langle\lambda\phi\_1+(1-\lambda)\phi\_2-\phi\_{j'}\xi\_i\rangle^2+o(\epsilon^2),\tag{A14}$$

where to obtain (A11) we have exploited the local approximation of KL divergence [18], to obtain (A12) we have exploited (A10), to obtain (A13) we have again exploited (A3), and to obtain (A14) we have used that

$$
\alpha\left(\theta^{k^\*}\right) = o(\epsilon^2)
$$

since *θk*<sup>∗</sup> = *O*() and

$$
\mathfrak{a}(0) = 0, \quad \text{and} \quad \nabla \mathfrak{a}(0) = \mathbb{E}\_{\mathbb{P}\_{\rangle}} [f^k(X)] = \langle \phi\_{\rangle}, \mathfrak{f}^k \rangle = O(\epsilon).
$$

Finally, *<sup>E</sup>*1(*λ*) = *<sup>E</sup>*2(*λ*) when *<sup>λ</sup>* = 1/2, so the overall error probability has exponent (A5).

Then, the following lemma demonstrates a property of information vectors in a Markov chain.

**Lemma A2.** *Given the Markov relation <sup>X</sup>* <sup>↔</sup> *<sup>Y</sup>* <sup>↔</sup> *<sup>V</sup> and any <sup>v</sup>* <sup>∈</sup> <sup>V</sup>*, let <sup>φ</sup>X*|*<sup>V</sup> <sup>v</sup> and <sup>φ</sup>Y*|*<sup>V</sup> <sup>v</sup> denote the associated information vectors for PX*|*V*(·|*v*) *and PY*|*V*(·|*v*)*, then we have*

$$
\boldsymbol{\phi}\_{\upsilon}^{\boldsymbol{X}|\boldsymbol{V}} = \boldsymbol{\mathbb{B}}^{\mathrm{T}} \boldsymbol{\phi}\_{\upsilon}^{\boldsymbol{Y}|\boldsymbol{V}}.\tag{A15}
$$

**Proof of Lemma A2.** From the Markov relation we have

$$P\_{\mathcal{X}}(\mathfrak{x}) = \sum\_{\mathcal{y} \in \mathcal{Y}} P\_{\mathcal{X}|\mathcal{Y}}(\mathfrak{x}|\mathcal{y}) P\_{\mathcal{Y}}(\mathfrak{y}).$$

and

$$P\_{X|V}(\mathbf{x}|\upsilon) = \sum\_{\mathbf{y}\in\mathcal{Y}} P\_{X|Y,V}(\mathbf{x}|\mathbf{y},\upsilon) P\_{Y|V}(\mathbf{y}|\upsilon) = \sum\_{\mathbf{y}\in\mathcal{Y}} P\_{X|Y}(\mathbf{x}|\mathbf{y}) P\_{Y|V}(\mathbf{y}|\upsilon).$$

As a result,

$$P\_{\mathcal{X}|V}(\mathbf{x}|\boldsymbol{v}) - P\_{\mathcal{X}}(\mathbf{x}) = \sum\_{\boldsymbol{y} \in \mathcal{Y}} P\_{\mathcal{X}|Y}(\mathbf{x}|\boldsymbol{y}) [P\_{Y|V}(\boldsymbol{y}|\boldsymbol{v}) - P\_Y(\boldsymbol{y})]\_{\mathcal{Y}}$$

from which we obtain the corresponding information vector

$$\begin{split} \phi\_{v}^{X|V}(\mathbf{x}) &= \frac{1}{\sqrt{P\_{\mathcal{X}}(\mathbf{x})}} \sum\_{\mathcal{Y} \in \mathcal{Y}} P\_{\mathcal{X}|Y}(\mathbf{x}|\mathcal{Y}) \sqrt{P\_{Y}(\mathbf{y})} \phi\_{v}^{Y|V}(\mathbf{y}) \\ &= \sum\_{\mathcal{Y} \in \mathcal{Y}} \left[ \mathbf{B}(\mathcal{Y}, \mathbf{x}) + \sqrt{P\_{\mathcal{X}}(\mathbf{x}) P\_{Y}(\mathbf{y})} \right] \phi\_{v}^{Y|V}(\mathbf{y}) \\ &= \sum\_{\mathcal{Y} \in \mathcal{Y}} \mathbf{B}(\mathcal{Y}, \mathbf{x}) \phi\_{v}^{Y|V}(\mathbf{y}), \end{split} \tag{A16}$$

where the last equality follows from the fact that

$$\sum\_{y \in \mathcal{Y}} \sqrt{P\_Y(y)} \phi\_v^{Y|V}(y) = \sum\_{y \in \mathcal{Y}} \left[ P\_{Y|V}(y|v) - P\_Y(y) \right] = 0.$$

Finally, rewrite (A16) in the matrix form and we obtain (A15).

In addition, the following lemma is useful for dealing with the expectation over an RIE.

**Lemma A3.** *Let* **z** *be a spherically symmetric random vector of dimension M, i.e., for any orthogonal* **Q** *we have* **z** <sup>d</sup> = **Qz***. If* **A** *is a fixed matrix of compatible dimensions, then*

$$\mathbb{E}\left[\|\mathbf{z}^{\mathrm{T}}\mathbf{A}\|\|^{2}\right] = \frac{1}{\mathcal{M}}\mathbb{E}\left[\|\mathbf{z}\|^{2}\right]\|\mathbf{A}\|\_{\mathrm{F}}^{2}.\tag{A17}$$

**Proof of Lemma A3.** By definition we have **Λ<sup>z</sup>** = **QΛzQ**<sup>T</sup> for any orthogonal **Q**; hence, **Λ<sup>z</sup>** is diagonal. Suppose **Λ<sup>z</sup>** = *λ* **I**, then from

> tr(**Λz**) = E **<sup>z</sup>** <sup>2</sup>

we obtain

$$
\lambda = \frac{1}{M} \operatorname{tr}(\mathbf{A}\_{\mathbf{z}}) \,.
$$

 = *λM*

As a result, we have

$$\mathbb{E}\left[||\mathbf{z}^{\mathsf{T}}\mathbf{A}||^{2}\right] = \text{tr}\left(\mathbf{A}^{\mathsf{T}}\mathbf{A}\_{\mathbf{z}}\mathbf{A}\right) = \lambda \operatorname{tr}\left(\mathbf{A}^{\mathsf{T}}\mathbf{A}\right) = \frac{1}{M}\mathbb{E}\left[||\mathbf{z}||^{2}\right]\|\mathbf{A}||\_{\mathsf{F}}^{2}$$

Proceeding to our proof of Theorem 1, by definition of feature functions, we have <sup>E</sup>*PX fi*(*X*) = 0, *i* = 1, ... , *k*. Suppose *f* is the vector representation of *f <sup>k</sup>* and denote by ˜ *f* - **Λ**−1/2 *<sup>f</sup> <sup>f</sup>* the normalized *<sup>f</sup>*, with **<sup>Λ</sup>**1/2 *<sup>f</sup>* denoting any square root matrix of **Λ***<sup>f</sup>* . Then, the corresponding statistics ˜ *f <sup>k</sup>* = ( ˜ *f*1, ... , ˜ *fk*) satisfy the constraints (A2) and (A3). In addition, we construct the statistic ˜ *h<sup>k</sup>* = (˜ *h*1,..., ˜ *hk*) as [cf. (A1)]

$$\mathcal{H}\_i = \frac{1}{n} \sum\_{l=1}^n f\_i(\mathbf{x}\_l), \quad i = 1, \ldots, k. \tag{A18}$$

Then, from Lemma A1, the error exponent of distinguishing *v* and *v* based on ˜ *h<sup>k</sup>* is

$$\begin{split} E\_{\mathbb{R}^k}(\upsilon, \upsilon') &= \frac{1}{8} \sum\_{i=1}^k \left[ \left( \boldsymbol{\Phi}\_{\upsilon}^{X|V} - \boldsymbol{\Phi}\_{\upsilon'}^{X|V} \right)^{\mathrm{T}} \boldsymbol{\tilde{\xi}}\_i^X \right]^2 + o(\epsilon^2) \\ &= \frac{1}{8} \left\| \left( \boldsymbol{\Phi}\_{\upsilon}^{X|V} - \boldsymbol{\Phi}\_{\upsilon'}^{X|V} \right)^{\mathrm{T}} \boldsymbol{\Xi}^X \right\|^2 + o(\epsilon^2), \end{split}$$

where *<sup>φ</sup>X*|*<sup>V</sup> <sup>v</sup>* denotes the associated information vector for *PX*|*V*(·|*v*), ˜ *ξX <sup>i</sup>* denotes the information vectors of ˜ *fi*, and **Ξ**˜ *<sup>X</sup>* - [ ˜ *ξX* <sup>1</sup> , ... , ˜ *ξX <sup>k</sup>* ]. Since the optimal decision rule is linear, the error exponent is invariant with linear transformations of statistics, i.e.,

$$\begin{split} E\_{h^k}(\mathbf{v}, \mathbf{v}') &= E\_{\tilde{h}^k}(\mathbf{v}, \mathbf{v}') = \frac{1}{8} \left\| \left( \boldsymbol{\Phi}\_{\mathbf{v}}^{X|V} - \boldsymbol{\Phi}\_{\mathbf{v}'}^{X|V} \right)^{\mathbf{T}} \tilde{\boldsymbol{\Xi}}^{X} \right\|^2 + o(\epsilon^2) \\ &= \frac{1}{8} \left\| \left( \boldsymbol{\Phi}\_{\mathbf{v}}^{Y|V} - \boldsymbol{\Phi}\_{\mathbf{v}'}^{Y|V} \right)^{\mathbf{T}} \tilde{\mathbf{B}} \boldsymbol{\Xi}^{X} \right\|^2 + o(\epsilon^2), \end{split} \tag{A19}$$

where the last equality follows from Lemma A2.

As a result, taking the expectation of (A19) over a given RIE yields

$$\begin{split} \mathbb{E}\left[E\_{h^{k}}(\boldsymbol{v},\boldsymbol{v}')\right] &= \frac{1}{8} \mathbb{E}\left[\left\|\left(\boldsymbol{\Phi}\_{\boldsymbol{v}}^{\boldsymbol{Y}|\boldsymbol{V}} - \boldsymbol{\Phi}\_{\boldsymbol{v}'}^{\boldsymbol{Y}|\boldsymbol{V}}\right)^{\mathrm{T}} \mathbf{B} \boldsymbol{\Xi}^{\boldsymbol{X}}\right\|^{2}\right] + o(\epsilon^{2}) \\ &= \frac{\mathbb{E}\left[\left\|\left(\boldsymbol{\Phi}\_{\boldsymbol{v}}^{\boldsymbol{Y}|\boldsymbol{V}} - \boldsymbol{\Phi}\_{\boldsymbol{v}'}^{\boldsymbol{Y}|\boldsymbol{V}}\right)\right\|^{2}\right]}{8|\boldsymbol{\mathcal{Y}}|} \left\|\left\|\mathbf{B} \boldsymbol{\Xi}^{\boldsymbol{X}}\right\|\_{\mathrm{F}}^{2} + o(\epsilon^{2}), \end{split}$$

where we have exploited Lemma A3. Finally, the error exponent (7) can be obtained via noting from the definition of ˜ *f <sup>k</sup>* that

$$
\Xi^{\chi} = \Xi^{\chi} ( (\Xi^{\chi})^{\mathsf{T}} \Xi^{\chi} )^{-\frac{1}{2}}.
$$

#### **Appendix B. Proof of Lemma 2**

We first prove two useful lemmas.

**Lemma A4.** *For distributions <sup>P</sup>* <sup>∈</sup> relint(PX)*, <sup>Q</sup>*, *<sup>R</sup>* <sup>∈</sup> <sup>P</sup>X*, and sufficiently small , if <sup>D</sup>*(*<sup>P</sup> <sup>Q</sup>*) <sup>≤</sup> <sup>2</sup> *and <sup>D</sup>*(*<sup>P</sup> <sup>R</sup>*) <sup>≤</sup> <sup>2</sup>*, then there exists a constant <sup>C</sup>* <sup>&</sup>gt; <sup>0</sup> *independent of , such that <sup>D</sup>*(*<sup>Q</sup> <sup>R</sup>*) <sup>≤</sup> *<sup>C</sup>*<sup>2</sup>*.*

**Proof of Lemma A4.** Denote by · <sup>1</sup> the -1-distance between distributions, i.e., *P* − *Q* <sup>1</sup> -<sup>∑</sup>*x*∈<sup>X</sup> <sup>|</sup>*P*(*x*) <sup>−</sup> *<sup>Q</sup>*(*x*)|, then from Pinsker's inequality [14], we have

$$\|\|P - Q\|\|\_{1} \le \sqrt{2D(P\|\|Q)} < \sqrt{2}\epsilon,\tag{A20}$$

$$||P - R||\_1 \le \sqrt{2D(P \| R)} < \sqrt{2}\varepsilon\_\prime \tag{A21}$$

which implies

$$\|\|Q - R\|\|\_{1} \le \|P - Q\|\|\_{1} + \|\|P - R\|\|\_{1} \le 2\sqrt{2}\varepsilon. \tag{A.22}$$

In addition, with *p*min min*x*∈<sup>X</sup> *<sup>P</sup>*(*x*), for all *<sup>x</sup>* <sup>∈</sup> <sup>X</sup> we have

$$R(\mathbf{x}) > P(\mathbf{x}) - |P(\mathbf{x}) - R(\mathbf{x})| \tag{A23}$$

$$\lambda > \min\_{\mathbf{x} \in \mathfrak{X}} P(\mathbf{x}) - \sqrt{2}\varepsilon \tag{A24}$$

$$=p\_{\rm min} - \sqrt{2}\varepsilon\_{\prime} \tag{A25}$$

where to obtain (A24) we have used (A21). Note that since *<sup>P</sup>* <sup>∈</sup> relint(PX) we have *<sup>p</sup>*min > 0, and thus *<sup>R</sup>*(*x*) > *<sup>p</sup>*min/2 for sufficiently small . As a result,

$$D(Q||R) \le \sum\_{\mathbf{x} \in \mathcal{X}} \frac{(Q(\mathbf{x}) - R(\mathbf{x}))^2}{R(\mathbf{x})} \tag{A26}$$

$$\leq \frac{2}{p\_{\min}} \sum\_{x \in \mathcal{X}} [Q(x) - R(x)]^2 \tag{A27}$$

$$\leq \frac{2\|Q - R\|\_1^2}{p\_{\min}}\tag{A28}$$

$$\leq \frac{16}{p\_{\text{min}}} \epsilon^2,\tag{A29}$$

where to obtain (A26) we have used the fact that KL divergence is upper bounded by corresponding *χ*2-divergence [55], and to obtain (A29) we have used (A22).

**Lemma A5.** *For all* (*x*, *<sup>y</sup>*) <sup>∈</sup> <sup>X</sup> <sup>×</sup> <sup>Y</sup>*, we have*

$$D(P\_X P\_Y \| P\_X \mathcal{P}\_{Y|X}^{(s,\tau;b)}) \ge P\_X(\mathbf{x}) \log \left[ P\_Y(y) \varepsilon^{\tau(\mathbf{x},y)} + (1 - P\_Y(y)) \varepsilon^{-\frac{P\_Y(y)}{1 - P\_Y(y)} \tau(\mathbf{x},y)} \right],$$

*where <sup>P</sup>*˜(*s*,*v*,*b*) *<sup>Y</sup>*|*<sup>X</sup> is as defined in* (4)*, and where we have defined <sup>τ</sup>*(*x*, *<sup>y</sup>*) *v*˜T(*y*)*s*(*x*) + ˜*d*(*y*)*.* **Proof of Lemma A5.** First, we can rewrite the conditional distribution *<sup>P</sup>*˜(*s*,*v*,*b*) *<sup>Y</sup>*|*<sup>X</sup>* (*y*|*x*) as

$$\begin{split}P\_{Y|X}^{(\mathbf{x},\mathbf{y},\mathbf{b})}(y|\mathbf{x}) = \frac{e^{\mathbf{y}^{\text{T}}(y)\cdot\mathbf{c}(\mathbf{x}) + b(\mathbf{y})}}{\sum\_{y' \in \mathcal{Y}} e^{\mathbf{y}^{\text{T}}(\mathbf{y}')\cdot\mathbf{c}(\mathbf{x}) + b(\mathbf{y}')}} = \frac{P\_{Y}(y)e^{\mathbf{y}^{\text{T}}(y)\cdot\mathbf{c}(\mathbf{x}) + d(\mathbf{y})}}{\sum\_{y' \in \mathcal{Y}} P\_{Y}(y')e^{\mathbf{y}^{\text{T}}(y')\cdot\mathbf{c}(\mathbf{x}) + d(\mathbf{y}')}} \\ = \frac{P\_{Y}(y)e^{\mathcal{F}^{\text{T}}(y)\cdot\mathbf{c}(\mathbf{x}) + \tilde{d}(\mathbf{y})}}{\sum\_{y' \in \mathcal{Y}} P\_{Y}(y')e^{\mathbf{y}^{\text{T}}(y')\cdot\mathbf{c}(\mathbf{x}) + \tilde{d}(\mathbf{y}')}} \\ = \frac{P\_{Y}(y)e^{\mathbf{y}^{\text{T}}(\mathbf{x},\mathbf{y})}}{\sum\_{y' \in \mathcal{Y}} P\_{Y}(y')e^{\mathbf{y}^{\text{T}}(\mathbf{x},\mathbf{y}')}}. \tag{A30} \end{split} \tag{A30}$$

Then, the KL divergence *<sup>D</sup>*(*PXPY PX <sup>P</sup>*˜(*s*,*v*,*b*) *<sup>Y</sup>*|*<sup>X</sup>* ) can be expressed as

$$D\left(P\_{X}P\_{Y}\|\|P\_{X}\mathcal{P}\_{Y|X}^{\{\mathbf{x},\mathbf{y},\mathbf{b}\}}\right) = \sum\_{\{\mathbf{x},\mathbf{y}\} \in \mathcal{X} \times \mathcal{Y}} P\_{X}(\mathbf{x})P\_{Y}(\mathbf{y}) \log \frac{\sum\_{y' \in \mathcal{Y}} P\_{Y}(y')e^{\mathbf{c}\{\mathbf{x},\mathbf{y}'\}}}{e^{\mathbf{c}\{\mathbf{x},\mathbf{y}\}}}$$

$$= \sum\_{\mathbf{x} \in \mathcal{X}} P\_{X}(\mathbf{x}) \log \left[\sum\_{y' \in \mathcal{Y}} P\_{Y}(y')e^{\mathbf{c}\{\mathbf{x},\mathbf{y}'\}}\right] - \mathbb{E}\_{\mathbf{P}\_{X}\mathbf{P}\_{Y}}[\pi(X,Y)]$$

$$= \sum\_{\mathbf{x} \in \mathcal{X}} P\_{X}(\mathbf{x}) \log \left[\sum\_{y' \in \mathcal{Y}} P\_{Y}(y')e^{\pi(\mathbf{x},\mathbf{y}')}\right],\tag{A31}$$

where to obtain the last equality we have used the fact <sup>E</sup>*PX PY* [*τ*(*X*,*Y*)] <sup>=</sup> 0. As a result, we have

$$D(P\_X P\_Y \| P\_X \bar{P}\_{Y|X}^{(s,y,h)}) \ge P\_X(x) \log \left[ \sum\_{y' \in \mathcal{Y}} P\_Y(y') e^{\tau(x,y')} \right] \tag{A32}$$

$$\ge P\_X(x) \log \left[ P\_Y(y) e^{\tau(x,y)} + (1 - P\_Y(y)) e^{-\frac{P\_Y(y)}{1 - P\_Y(y)} \tau(x,y)} \right], \tag{A33}$$

where the last inequality follows from Jensen's inequality:

$$\begin{split} \sum\_{y' \in \mathcal{Y}} P\_Y(y') e^{\tau(\mathbf{x}, y')} &= P\_Y(y) e^{\tau(\mathbf{x}, y)} + (1 - P\_Y(y)) \sum\_{y' \neq y} \frac{P\_Y(y')}{1 - P\_Y(y)} e^{\tau(\mathbf{x}, y')} \\ &\ge P\_Y(y) e^{\tau(\mathbf{x}, y)} + (1 - P\_Y(y)) \exp\left(\frac{1}{1 - P\_Y(y)} \sum\_{y' \neq y} P\_Y(y') \tau(\mathbf{x}, y')\right) \\ &= P\_Y(y) e^{\tau(\mathbf{x}, y)} + \left(1 - P\_Y(y)\right) e^{-\frac{P\_Y(y)}{1 - P\_Y(y)} \tau(\mathbf{x}, y)} .\end{split}$$

Proceeding to our proof of Lemma 2, first note that when *v* = *d* = 0, we have *<sup>P</sup>*˜(*s*,*v*,*b*) *<sup>Y</sup>*|*<sup>X</sup>* <sup>=</sup> *PY*. As a result, the optimal *<sup>v</sup>*, *<sup>d</sup>* for (8) satisfy

$$\begin{split} D(P\_{XY}||P\_{X}\mathcal{P}\_{Y|X}^{(s,v,b)}) &\leq D(P\_{XY}||P\_{X}P\_{Y}) \\ &\leq \sum\_{(\mathbf{x},\mathbf{y})\in\mathcal{X}\times\mathcal{Y}} \frac{\left[P\_{X,Y}(\mathbf{x},\mathbf{y}) - P\_{X}(\mathbf{x})P\_{Y}(\mathbf{y})\right]^{2}}{P\_{X}(\mathbf{x})P\_{Y}(\mathbf{y})} \\ &\leq \epsilon^{2}, \end{split} \tag{A34}$$

where to obtain the second inequality we have again exploited *χ*2-divergence as an upper bound of KL divergence [55], and to obtain the last inequality we have used the definition of -dependency.

As *PXY* <sup>∈</sup> relint(PX×Y), from Lemma A4, there exist *<sup>C</sup>* <sup>&</sup>gt; 0 and <sup>1</sup> <sup>&</sup>gt; 0 such that *<sup>D</sup>*(*PXPY PX <sup>P</sup>*˜(*s*,*v*,*b*) *<sup>Y</sup>*|*<sup>X</sup>* ) <sup>&</sup>lt; *<sup>C</sup>*<sup>2</sup> for all <sup>&</sup>lt; 1. Furthermore, from Lemma A5, for all (*x*, *<sup>y</sup>*) <sup>∈</sup> <sup>X</sup> <sup>×</sup> <sup>Y</sup> and <sup>∈</sup> (0, <sup>1</sup>), we have

$$\mathcal{C}\epsilon^2 \ge P\_X(\mathbf{x})\log\left[P\_Y(y)e^{\tau(\mathbf{x},y)} + (1 - P\_Y(y))e^{-\frac{P\_Y(y)}{1 - P\_Y(y)}\tau(\mathbf{x},y)}\right].\tag{A35}$$

Note that the right-hand side of (A35) satisfies

$$\log \left[ P\_Y(y)e^{\tau(x,y)} + (1 - P\_Y(y))e^{-\frac{P\_Y(y)}{1 - P\_Y(y)}\tau(x,y)} \right] = \frac{P\_Y(y)}{2(1 - P\_Y(y))}\tau^2(x,y) + o(\tau^2(x,y)).$$

Therefore, there exists *<sup>δ</sup>* <sup>&</sup>gt; 0 independent of 1, such that for all <sup>|</sup>*τ*(*x*, *<sup>y</sup>*)| ≤ *<sup>δ</sup>*, we have

$$\log \left[ P\_Y(y) \varepsilon^{\tau(\mathbf{x}, y)} + (1 - P\_Y(y)) \varepsilon^{-\frac{P\_Y(y)}{1 - P\_Y(y)} \tau(\mathbf{x}, y)} \right] > \frac{P\_Y(y)}{2} \tau^2(\mathbf{x}, y). \tag{A36}$$

In addition, if <sup>|</sup>*τ*(*x*, *<sup>y</sup>*)<sup>|</sup> <sup>&</sup>gt; *<sup>δ</sup>*, we have

$$\begin{split} & \log \left[ P\_{Y}(y)e^{\tau(\mathbf{x},y)} + (1 - P\_{Y}(y))e^{-\frac{P\_{Y}(y)}{1 - P\_{Y}(y)}\tau(\mathbf{x},y)} \right] \\ & \geq \min \left\{ \log \left[ P\_{Y}(y)e^{\delta} + (1 - P\_{Y}(y))e^{-\frac{P\_{Y}(y)}{1 - P\_{Y}(y)}\delta} \right], \log \left[ P\_{Y}(y)e^{-\delta} + (1 - P\_{Y}(y))e^{\frac{P\_{Y}(y)}{1 - P\_{Y}(y)}\delta} \right] \right\} \\ & \geq \frac{P\_{Y}(y)}{2}\delta^{2}, \end{split}$$

where to obtain the second inequality we have exploited the monotonicity of function *t* → *PY*(*y*)*e<sup>t</sup>* + (<sup>1</sup> <sup>−</sup> *PY*(*y*))*<sup>e</sup>* <sup>−</sup> *PY*(*y*) <sup>1</sup>−*PY*(*y*) *<sup>t</sup>* , and to obtain the third inequality we have exploited (A36). As a result, we have

$$\log \left[ P\_Y(y)e^{\tau(\mathbf{x},y)} + (1 - P\_Y(y))e^{-\frac{P\_Y(y)}{1 - P\_Y(y)}\tau(\mathbf{x},y)} \right] > \frac{P\_Y(y)}{2} \cdot \min \{ \delta^2, \tau^2(\mathbf{x}, y) \}. \tag{A37}$$

Hence, (A35) becomes

$$
\mathbb{C}\epsilon^2 \ge \frac{P\_X(\mathbf{x})P\_Y(y)}{2} \cdot \min\{\delta^2, \tau^2(\mathbf{x}, y)\},
\tag{A38}
$$

from which we can obtain *τ*(*x*, *y*) = *O*(). To see this, let

$$\mathfrak{e}\_2 \stackrel{\triangle}{=} \frac{\delta}{\sqrt{2\mathcal{C}}} \cdot \min\_{(\mathbf{x}, \mathbf{y}) \in \mathfrak{X} \times \mathfrak{Y}} \sqrt{P\_X(\mathbf{x}) P\_Y(\mathbf{y})}, \quad \mathfrak{e}\_0 \stackrel{\triangle}{=} \min \{ \mathfrak{e}\_1, \mathfrak{e}\_2 \}.$$

Then, for all < 0, we have

$$
\mathbb{C} \epsilon^2 < \frac{P\_\mathcal{X}(\mathbf{x}) P\_\mathcal{Y}(y)}{2} \cdot \delta^2 \llcorner
$$

and (A38) implies <sup>|</sup>*τ*(*x*, *<sup>y</sup>*)<sup>|</sup> <sup>&</sup>lt; *<sup>C</sup>* with *C* = 2*C PX*(*x*)*PY*(*y*).

#### **Appendix C. Proof of Lemma 3**

**Proof.** From Lemma 2, there exists *<sup>C</sup>* <sup>&</sup>gt; 0 such that for all (*x*, *<sup>y</sup>*) <sup>∈</sup> <sup>X</sup> <sup>×</sup> <sup>Y</sup>, we have

$$|\vec{v}^T(y)s(x) + d(y)| < C'\epsilon,\tag{A39}$$

which implies

$$|\mu\_s^\mathrm{T}\vec{v}(y) + \tilde{d}(y)| < \mathbb{C}\mathfrak{e}\_\prime \tag{A40}$$

$$|\vec{v}^{\mathrm{T}}(y)\vec{s}(\boldsymbol{x})|<2\mathsf{C}\boldsymbol{\varepsilon},\tag{A41}$$

with *<sup>C</sup>* <sup>=</sup> max{*C* , 1}.

From (A30), we can assume <sup>E</sup>*PY* [*v*(*Y*)] <sup>=</sup> <sup>E</sup>*PY* [*d*(*Y*)] <sup>=</sup> 0 without loss of generality. Then, (4) can be rewritten as

$$\vec{P}\_{Y|X}^{(s,\mathbf{y},b)}(y|\mathbf{x}) = \frac{P\_Y(y)e^{\mathfrak{P}^T(y)s(\mathbf{x}) + \tilde{d}(y)}}{\sum\_{y' \in \mathcal{Y}} P\_Y(y')e^{\mathfrak{P}^T(y')s(\mathbf{x}) + \tilde{d}(y')}},\tag{A42}$$

and the numerator can be written as

$$\begin{aligned} P\_Y(y)e^{\vartheta^T(y)s(x) + \tilde{d}(y)} &= P\_Y(y) \left( 1 + \tilde{v}^T(y)s(x) + \tilde{d}(y) + o(\varepsilon) \right) \\ &= P\_Y(y) \left( 1 + \tilde{v}^T(y)s(x) + \tilde{d}(y) \right) + o(\varepsilon), \end{aligned}$$

where we have used (A39). Similarly, from

$$\begin{aligned} \sum\_{y' \in \mathcal{Y}} P\_Y(y) e^{\vartheta^\mathrm{T}(y)s(x) + \tilde{d}(y)} &= \sum\_{y' \in \mathcal{Y}} P\_Y(y) \left( 1 + \tilde{v}^\mathrm{T}(y)s(x) + \tilde{d}(y) \right) + o(\epsilon) \\ &= 1 + \mathbb{E}\_{P\_Y} \left[ \tilde{v}^\mathrm{T}(Y)s(x) \right] + \mathbb{E}\_{P\_Y} \left[ \tilde{d}(y) \right] + o(\epsilon) \\ &= 1 + o(\epsilon) \end{aligned}$$

we obtain

$$\frac{1}{\sum\_{y' \in \mathcal{Y}} P\_Y(y)e^{\mathcal{Y}^\mathsf{T}(y)s(\chi) + \tilde{d}(y)}} = \frac{1}{1 + o(\epsilon)} = 1 + o(\epsilon).$$

As a result, (A42) can be written as

$$\begin{split} P\_{Y|X}^{(s,\boldsymbol{\eta},b)}(\boldsymbol{y}|\mathbf{x}) &= \left[ P\_Y(\boldsymbol{y}) \left( 1 + \boldsymbol{\bar{\sigma}}^{\mathrm{T}}(\boldsymbol{y}) \boldsymbol{s}(\mathbf{x}) + \boldsymbol{\bar{d}}(\boldsymbol{y}) \right) + o(\boldsymbol{\varepsilon}) \right] \left[ 1 + o(\boldsymbol{\varepsilon}) \right] \\ &= P\_Y(\boldsymbol{y}) \left( 1 + \boldsymbol{\bar{\sigma}}^{\mathrm{T}}(\boldsymbol{y}) \boldsymbol{s}(\mathbf{x}) + \boldsymbol{\bar{d}}(\boldsymbol{y}) \right) + o(\boldsymbol{\varepsilon}), \end{split} \tag{A43}$$

which implies *PX <sup>P</sup>*˜(*v*,*b*) *<sup>Y</sup>*|*<sup>X</sup>* <sup>∈</sup> <sup>N</sup>X×<sup>Y</sup> *<sup>C</sup>* (*PXPY*) for sufficiently small . In addition, the local assumption of distributions implies that *PXY* <sup>∈</sup> <sup>N</sup>X×<sup>Y</sup> (*PXPY*) <sup>⊂</sup> <sup>N</sup>X×<sup>Y</sup> *<sup>C</sup>* (*PXPY*). Again, from the local approximation of KL divergence [18]

$$D(P\_1||P\_2) = \frac{1}{2}||\phi\_1 - \phi\_2||^2 + o(\epsilon^2),\tag{A44}$$

we have

$$\begin{split} &D\left(P\_{Y,X}||P\_{X}\,P\_{Y|X}^{(s,v,b)}\right) \\ &=\frac{1}{2}\sum\_{x\in\mathcal{X},y\in\mathcal{Y}}\frac{\left[P\_{Y,X}(y,x)-\bar{P}\_{Y|X}^{(s,v,b)}(y|\mathbf{x})P\_{X}(\mathbf{x})\right]^{2}}{P\_{Y}(y)P\_{X}(\mathbf{x})}+o(\epsilon^{2}) \\ &=\frac{1}{2}\sum\_{x\in\mathcal{X},y\in\mathcal{Y}}\left[\frac{P\_{Y,X}(y,x)}{\sqrt{P\_{Y}(y)P\_{X}(\mathbf{x})}}-\sqrt{P\_{Y}(y)P\_{X}(\mathbf{x})}\right. \\ &\left.-\sqrt{P\_{Y}(y)P\_{X}(\mathbf{x})}\left(\bar{\sigma}^{\mathrm{T}}(y)s(\mathbf{x})+\bar{d}(y)+o(\epsilon)\right)\right]^{2}+o(\epsilon^{2}). \end{split}$$

= 1 <sup>2</sup> <sup>∑</sup>*<sup>x</sup>*∈X,*y*∈<sup>Y</sup> **<sup>B</sup>**˜(*y*, *<sup>x</sup>*) <sup>−</sup> *PY*(*y*)*PX*(*x*)*v*˜ <sup>T</sup>(*y*)*s*˜(*x*) − *PY*(*y*)*PX*(*x*) - ˜*d*(*y*) + *μ*<sup>T</sup> *<sup>s</sup> <sup>v</sup>*˜(*y*) − *PY*(*y*)*PX*(*x*)*o*() 2 + *o*(<sup>2</sup>) (∗) = 1 <sup>2</sup> <sup>∑</sup>*<sup>x</sup>*∈X,*y*∈<sup>Y</sup> **<sup>B</sup>**˜(*y*, *<sup>x</sup>*) <sup>−</sup> *PY*(*y*)*PX*(*x*)*v*˜ <sup>T</sup>(*y*)*s*˜(*x*) 2 + 1 <sup>2</sup> <sup>∑</sup>*<sup>x</sup>*∈X,*y*∈<sup>Y</sup> *PY*(*y*)*PX*(*x*) - ˜*d*(*y*) + *μ*<sup>T</sup> *<sup>s</sup> <sup>v</sup>*˜(*y*) 2 + *o*(<sup>2</sup>) = 1 <sup>2</sup> <sup>∑</sup>*<sup>x</sup>*∈X,*y*∈<sup>Y</sup> **<sup>B</sup>**˜(*y*, *<sup>x</sup>*) <sup>−</sup> *ξY*(*y*) T *ξX*(*x*) 2 + 1 2 E*PY* ( ˜*d*(*y*) + *μ*<sup>T</sup> *<sup>s</sup> <sup>v</sup>*˜(*y*))<sup>2</sup> + *o*(<sup>2</sup>) = 1 2 **<sup>B</sup>**˜ <sup>−</sup> **<sup>Ξ</sup>***Y* **Ξ***X*<sup>T</sup> 2 F + 1 2 *<sup>η</sup>*(*v*,*b*) (*s*) + *o*(<sup>2</sup>),

where to obtain (∗), we have used (A40) and (A41) together with the fact <sup>|</sup>**B**˜(*y*, *<sup>x</sup>*)<sup>|</sup> <sup>&</sup>lt; , and that

$$\begin{aligned} \sum\_{\boldsymbol{x},\boldsymbol{x}\in\mathcal{X},\boldsymbol{y}\in\mathcal{Y}}\mathsf{\dot{B}}(\boldsymbol{y},\boldsymbol{x})\sqrt{P\_{Y}(\boldsymbol{y})P\_{X}(\boldsymbol{x})}\Big{(}\boldsymbol{d}(\boldsymbol{y})+\boldsymbol{\mu}\_{s}^{\mathrm{T}}\boldsymbol{\varepsilon}(\boldsymbol{y})\Big{)}=0,\\ \sum\_{\boldsymbol{x}\in\mathcal{X},\boldsymbol{y}\in\mathcal{Y}}P\_{Y}(\boldsymbol{y})P\_{X}(\boldsymbol{x})\boldsymbol{\varepsilon}^{\mathrm{T}}(\boldsymbol{y})\boldsymbol{\varepsilon}(\boldsymbol{x})\Big{(}\boldsymbol{d}(\boldsymbol{y})+\boldsymbol{\mu}\_{s}^{\mathrm{T}}\boldsymbol{\varepsilon}(\boldsymbol{y})\Big{)}=0,\end{aligned}$$

since E ˜*d*(*Y*) = 0,E[*s*˜(*X*)] = E[*v*˜(*Y*)] = 0.

#### **Appendix D. Proofs of Theorems 2 and 3**

Theorems 2 and 3 can be proved based on Lemma 3.

**Proofs of Theorems <sup>2</sup> and 3.** Note that the value of *<sup>d</sup>*(·) only affects the second term of the KL divergence; hence, we can always choose *<sup>d</sup>*(·) such that ˜*d*(*y*) + *<sup>μ</sup>*<sup>T</sup> *<sup>s</sup> <sup>v</sup>*˜(*y*) = 0. Then, the (**Ξ***Y*, **Ξ***X*) pair should be chosen as

$$(\boldsymbol{\Xi}^Y, \boldsymbol{\Xi}^X)^\* = \underset{(\boldsymbol{\Xi}^Y, \boldsymbol{\Xi}^X)}{\arg\min} \left\| \mathbf{B} - \boldsymbol{\Xi}^Y (\boldsymbol{\Xi}^X)^\top \right\|\_F^2. \tag{A45}$$

Set the derivative (we use the denominator-layout notation of matrix calculus where the scalar-by-matrix derivative will have the same dimension as the matrix)

$$\frac{\partial}{\partial \Xi^Y} \|\mathbf{B} - \Xi^Y (\Xi^X)^T\|\_F^2 = 2 (\Xi^Y (\Xi^X)^T \Xi^X - \mathbf{B} \Xi^X) \tag{A46}$$

to zero, and the optimal **Ξ***<sup>Y</sup>* for fixed **Ξ***<sup>X</sup>* is (here, we assume the matrix **Ξ***X*<sup>T</sup> **<sup>Ξ</sup>***<sup>X</sup>* <sup>=</sup> **<sup>Λ</sup>***s*˜(*X*) is invertible; for the case where **Ξ***X*<sup>T</sup> **Ξ***<sup>X</sup>* is singular, we can obtain a similar result with ordinary matrix inverse replaced by the Moore–Penrose inverse)

$$
\Xi^{Y\*} = \mathsf{B}\Xi^X (\left(\Xi^X\right)^T \Xi^X)^{-1}.\tag{A47}
$$

As **<sup>1</sup>**T√**P***<sup>Y</sup>* **<sup>B</sup>**˜ <sup>=</sup> 0, we have **<sup>1</sup>**T√**P***<sup>Y</sup>* **<sup>Ξ</sup>***Y*<sup>∗</sup> <sup>=</sup> 0, which demonstrates that **<sup>Ξ</sup>***Y*<sup>∗</sup> is a valid matrix for a zero-mean feature vector.

To express **Ξ***Y*<sup>∗</sup> of (A47) in the form of *s* and *v*, we can make use of the correspondence between feature and information vectors. We can show that, for a zero-mean feature function *f*(*X*) with corresponding information vector *φ*, we have the correspondence <sup>E</sup>*PX*|*<sup>Y</sup>* [ *<sup>f</sup>*(*X*)|*Y*] <sup>↔</sup> **<sup>B</sup>**˜ *<sup>φ</sup>*. To see this, note that the *<sup>y</sup>*-th element of information vector **<sup>B</sup>**˜ *<sup>φ</sup>* is given by

$$\begin{split} \sum\_{\mathbf{x} \in \mathcal{X}} \mathsf{\dot{B}}(\boldsymbol{y}, \boldsymbol{x}) \phi(\mathbf{x}) &= \sum\_{\mathbf{x} \in \mathcal{X}} \frac{P\_{XY}(\mathbf{x}, \boldsymbol{y}) - P\_{\mathbf{X}}(\mathbf{x}) P\_{Y}(\mathbf{y})}{\sqrt{P\_{\mathbf{X}}(\mathbf{x}) P\_{Y}(\mathbf{y})}} f(\mathbf{x}) \sqrt{P\_{\mathbf{X}}(\mathbf{x})} \\ &= \frac{1}{\sqrt{P\_{Y}(\mathbf{y})}} \sum\_{\mathbf{x} \in \mathcal{X}} P\_{XY}(\mathbf{x}, \mathbf{y}) f(\mathbf{x}) \\ &= \frac{1}{\sqrt{P\_{Y}(\mathbf{y})}} \mathbb{E}\_{P\_{\mathbf{X}|Y}}[f(\mathbf{X}) | \mathbf{y} = \mathbf{y}]. \end{split}$$

Using similar methods, we can verify that **<sup>Λ</sup>***s*˜(*X*) <sup>=</sup> **Ξ***X*<sup>T</sup> **Ξ***X*. As a result, (A47) is equivalent to

$$\vec{w}^\*(y) = \mathbb{E}\_{\mathbb{P}\_{\mathcal{X}|Y}} \left[ \mathbf{A}\_{\mathfrak{z}(\mathcal{X})}^{-1} \vec{s}(\mathcal{X}) \; \middle| \; Y = y \right]. \tag{A48}$$

By a symmetry argument, we can also obtain the first two equations of Theorem 3. To obtain the third equations of these two theorems, we need to minimize *<sup>η</sup>*(*v*,*b*)(*s*) = E*PY* (*μ*<sup>T</sup> *<sup>s</sup> <sup>v</sup>*˜(*Y*) + ˜*d*(*Y*))<sup>2</sup> . For given *v*˜ and *μs*, the optimal ˜*d* is

$$d^\*(y) = -\mu\_s^\mathrm{T} \overline{v}(\mathcal{Y}),\tag{A49}$$

and the corresponding *<sup>η</sup>*(*v*,*b*)(*s*) = 0.

In addition, for given ˜*d* and *v*˜, we have

$$\begin{split} \boldsymbol{\eta}^{(\boldsymbol{\upsilon},\boldsymbol{b})}(\boldsymbol{s}) &= \mathbb{E}\_{\mathcal{P}\_{\boldsymbol{Y}}} \Big[ (\boldsymbol{\mu}\_{\boldsymbol{s}}^{\mathrm{T}} \boldsymbol{\upsilon}(\boldsymbol{Y}) + \boldsymbol{d}(\boldsymbol{Y}))^{2} \Big] \\ &= \boldsymbol{\mu}\_{\boldsymbol{s}}^{\mathrm{T}} \mathsf{A}\_{\widetilde{\boldsymbol{\upsilon}}(\boldsymbol{Y})} \boldsymbol{\mu}\_{\boldsymbol{s}} + 2\boldsymbol{\mu}\_{\boldsymbol{s}}^{\mathrm{T}} \mathbb{E}\_{\mathcal{P}\_{\boldsymbol{Y}}} \Big[ \boldsymbol{\upsilon}(\boldsymbol{Y}) \boldsymbol{\tilde{d}}(\boldsymbol{Y}) \Big] + \mathsf{var}(\boldsymbol{\tilde{d}}(\boldsymbol{Y})). \end{split} \tag{A50}$$

Set *<sup>∂</sup> ∂μs <sup>η</sup>*(*v*,*b*)(*s*) = 0 and we obtain

$$
\mu\_s^\* = -\mathbf{A}\_{\vartheta(Y)}^{-1} \mathbb{E}\_{\mathcal{P}\_Y} \left[ \overline{v}(Y) \overline{d}(Y) \right]. \tag{A51}
$$

#### **Appendix E. Proof of Theorem 4**

**Proof.** From Lemma 3, choosing the optimal (**Ξ***Y*, **Ξ***X*) is equivalent to solving the matrix factorization problem of **B**˜ . Since both **Ξ***<sup>Y</sup>* and **Ξ***<sup>X</sup>* have rank no greater than *k*, from the Eckart–Young–Mirsky theorem [56], the optimal choice of **Ξ***Y* **Ξ***X*<sup>T</sup> should be the truncated singular value decomposition of **B**˜ with top *k* singular values. As a result, (**Ξ***Y*, **Ξ***X*)<sup>∗</sup> are the left and right singular vectors of **B**˜ corresponding to the largest *k* singular values.

The optimality of bias ˜*d*(*y*) = <sup>−</sup>*μ*<sup>T</sup> *<sup>s</sup> <sup>v</sup>*˜(*y*) has already been shown in Appendix D.

#### **Appendix F. Proof of Theorem 5**

The following lemma is useful to prove Theorem 5.

**Lemma A6** (Pythagorean theorem)**.** *Let* **Ξ***X*<sup>∗</sup> *be the optimal matrix for given* **Ξ***<sup>Y</sup> as defined in* (13)*. Then,*

$$\left\|\left\|\mathbf{B} - \boldsymbol{\Xi}^{\mathrm{Y}} \left(\boldsymbol{\Xi}^{\mathrm{X}}\right)^{\mathrm{T}}\right\|\_{\mathrm{F}}^{2} - \left\|\left\|\mathbf{B} - \boldsymbol{\Xi}^{\mathrm{Y}} \left(\boldsymbol{\Xi}^{\mathrm{X\*}\*}\right)^{\mathrm{T}}\right\|\_{\mathrm{F}}^{2} = \left\|\left\|\boldsymbol{\Xi}^{\mathrm{Y}} \left(\boldsymbol{\Xi}^{\mathrm{X\*}\*}\right)^{\mathrm{T}} - \boldsymbol{\Xi}^{\mathrm{Y}} \left(\boldsymbol{\Xi}^{\mathrm{X}}\right)^{\mathrm{T}}\right\|\_{\mathrm{F}}^{2}.\tag{A52}$$

**Proof of Lemma A6.** Denote by **U**, **V** the Frobenius inner product of matrices **U** and **V**, i.e., **U**, **V** tr(**U**T**V**), and we have

$$\begin{aligned} \left\langle \mathbf{B} - \boldsymbol{\Xi}^{Y} \left( \boldsymbol{\Xi}^{X\*} \right)^{\mathrm{T}}, \boldsymbol{\Xi}^{Y} \left( \boldsymbol{\Xi}^{X} \right)^{\mathrm{T}} \right\rangle &= \mathrm{tr} \Big( \mathbf{B} \boldsymbol{\Xi}^{X} \left( \boldsymbol{\Xi}^{Y} \right)^{\mathrm{T}} \Big) - \mathrm{tr} \Big( \boldsymbol{\Xi}^{X\*} \left( \boldsymbol{\Xi}^{Y} \right)^{\mathrm{T}} \boldsymbol{\Xi}^{Y} \left( \boldsymbol{\Xi}^{X} \right)^{\mathrm{T}} \Big) \\ &= \mathrm{tr} \Big( \mathbf{B} \boldsymbol{\Xi}^{X} \left( \boldsymbol{\Xi}^{Y} \right)^{\mathrm{T}} \Big) - \mathrm{tr} \Big( \mathbf{B}^{T} \boldsymbol{\Xi}^{Y} \left( \boldsymbol{\Xi}^{X} \right)^{\mathrm{T}} \Big) \\ &= 0. \end{aligned}$$

As a result, we obtain

$$\begin{split} \|\|\mathbf{B} - \boldsymbol{\Xi}^{Y} (\boldsymbol{\Xi}^{X})^{\top} \|\|\_{\mathrm{F}}^{2} &= \|\|\mathbf{B} - \boldsymbol{\Xi}^{Y} (\boldsymbol{\Xi}^{X\*})^{\top} + \left(\boldsymbol{\Xi}^{Y} (\boldsymbol{\Xi}^{X\*})^{\top} - \boldsymbol{\Xi}^{Y} (\boldsymbol{\Xi}^{X})^{\top} \right) \|\_{\mathrm{F}}^{2} \\ &= \|\|\mathbf{B} - \boldsymbol{\Xi}^{Y} (\boldsymbol{\Xi}^{X\*})^{\top} \|\|\_{\mathrm{F}} + \|\|\boldsymbol{\Xi}^{Y} (\boldsymbol{\Xi}^{X\*})^{\top} - \boldsymbol{\Xi}^{Y} (\boldsymbol{\Xi}^{X})^{\top} \|\_{\mathrm{F}}^{2} \\ &\quad + 2 \left\langle \mathbf{B} - \boldsymbol{\Xi}^{Y} (\boldsymbol{\Xi}^{X\*})^{\top}, \boldsymbol{\Xi}^{Y} (\left(\boldsymbol{\Xi}^{X\*} \right)^{\top} - \left(\boldsymbol{\Xi}^{X} \right)^{\top}) \right\rangle \\ &= \left\|\|\mathbf{B} - \boldsymbol{\Xi}^{Y} (\boldsymbol{\Xi}^{X\*})^{\top} \|\|\_{\mathrm{F}} + \|\boldsymbol{\Xi}^{Y} (\boldsymbol{\Xi}^{X\*})^{\top} - \boldsymbol{\Xi}^{Y} (\boldsymbol{\Xi}^{X})^{\top} \|\_{\mathrm{F}}^{2} . \end{split}$$

which finishes the proof.

Proceeding to our proof of Theorem 5, from Lemma A6 we have

$$\begin{split} &\mathcal{L}(s) - \mathcal{L}(s^\*) \\ &= \frac{1}{2} \Big[ \| \dot{\mathbf{B}} - \boldsymbol{\Xi}^Y \left( \boldsymbol{\Xi}^X \right)^\mathrm{T} \|\_{\mathrm{F}}^2 - \| \dot{\mathbf{B}} - \boldsymbol{\Xi}^Y \left( \boldsymbol{\Xi}^{X\*} \right)^\mathrm{T} \|\_{\mathrm{F}}^2 \Big] + \frac{1}{2} \Big[ \eta^{(v,b)}(s) - \eta^{(v,b)}(s^\*) \Big] + o(\epsilon^2) \\ &= \frac{1}{2} \| \boldsymbol{\Xi}^Y \left( \boldsymbol{\Xi}^{X\*} \right)^\mathrm{T} - \boldsymbol{\Xi}^Y \left( \boldsymbol{\Xi}^X \right)^\mathrm{T} \|\_{\mathrm{F}}^2 + \frac{1}{2} \kappa^{(v,b)}(s,s^\*) + o(\epsilon^2), \end{split}$$

where *<sup>κ</sup>*(*v*,*b*)(*s*,*s*∗) *<sup>η</sup>*(*v*,*b*)(*s*) <sup>−</sup> *<sup>η</sup>*(*v*,*b*)(*s*∗). We then optimize **<sup>Ξ</sup>***Y* **<sup>Ξ</sup>***X*∗<sup>T</sup> <sup>−</sup> **<sup>Ξ</sup>***Y* **Ξ***X*<sup>T</sup> 2 <sup>F</sup> and *<sup>κ</sup>*(*v*,*b*)(*s*,*s*∗) separately.

For the first term, we need to express **Ξ***<sup>X</sup>* in terms of **W** and **Ξ***<sup>X</sup>* <sup>1</sup> . From (17), we obtain

$$\mathbb{E}[s\_z(X)] = \sigma(\mathfrak{c}(z)) + o(\mathfrak{e}),\tag{A53}$$

$$\mathfrak{s}\_z(\mathbf{x}) = w^\mathsf{T}(z)\mathfrak{f}(\mathbf{x}) \cdot \sigma'(\mathfrak{c}(z)) + o(\mathfrak{e}),\tag{A54}$$

which can be expressed in information vectors as

$$
\boldsymbol{\Xi}^{X} = \boldsymbol{\Xi}\_{1}^{X} \mathbf{W}^{\mathsf{T}} \mathbf{J} + o(\epsilon). \tag{A55}
$$

From Theorem 3, we have

$$
\boldsymbol{\Xi}^{X\*} = \bar{\mathbf{B}}^T \boldsymbol{\Xi}^Y \left( \left( \boldsymbol{\Xi}^Y \right)^T \boldsymbol{\Xi}^Y \right)^{-1}. \tag{A56}
$$

As a result, we have

 **Ξ***Y* **<sup>Ξ</sup>***X*∗<sup>T</sup> <sup>−</sup> **<sup>Ξ</sup>***Y* **Ξ***X*T 2 <sup>F</sup> <sup>=</sup> **Ξ***Y*<sup>T</sup> **Ξ***Y*1/2( **<sup>Ξ</sup>***X*∗<sup>T</sup> <sup>−</sup> **Ξ***X*<sup>T</sup> ) 2 F = **Ξ***Y*<sup>T</sup> **<sup>Ξ</sup>***Y*1/2 · - **<sup>Ξ</sup>***X*∗<sup>T</sup> <sup>−</sup> **JW Ξ***<sup>X</sup>* 1 <sup>T</sup> <sup>−</sup> *<sup>o</sup>*() 2 F = **Ξ***Y*<sup>T</sup> **<sup>Ξ</sup>***Y*1/2 · - **<sup>Ξ</sup>***X*∗<sup>T</sup> <sup>−</sup> **JW Ξ***<sup>X</sup>* 1 T 2 F + *o*(<sup>2</sup>) = **Ξ***Y*<sup>T</sup> **<sup>Ξ</sup>***Y*1/2**<sup>J</sup>** · - **J** −1 **<sup>Ξ</sup>***X*∗<sup>T</sup> <sup>−</sup> **<sup>W</sup> Ξ***<sup>X</sup>* 1 T 2 F + *o*(<sup>2</sup>) = Θ**B**˜ <sup>1</sup> <sup>−</sup> <sup>Θ</sup>**W Ξ***<sup>X</sup>* 1 T 2 <sup>F</sup> <sup>+</sup> *<sup>o</sup>*(<sup>2</sup>), (A57)

where the third equality follows from the fact that [cf. (A41)] *s*˜(*x*) = *O*() and *v*˜(*y*) = *O*(1), and the last equality follows from the definitions **B**˜ <sup>1</sup> - **J**−<sup>1</sup> **<sup>Ξ</sup>***X*∗<sup>T</sup> and <sup>Θ</sup> - ( **Ξ***Y*<sup>T</sup> **Ξ***Y*)1/2**J**. For the second term, from (A50) and (A51), we have

$$\begin{split} \kappa^{(\nu,b)}(s,s^\*) &= \left[ (\mu\_s - \mu\_{s^\*}) + \mu\_{s^\*} \right]^\mathrm{T} \mathbf{A}\_{\vartheta(\mathcal{V})} \left[ (\mu\_s - \mu\_{s^\*}) + \mu\_{s^\*} \right] \\ &- \mu\_{s^\*}^\mathrm{T} \mathbf{A}\_{\vartheta(\mathcal{V})} \mu\_{s^\*} + 2(\mu\_s - \mu\_{s^\*})^\mathrm{T} \mathbb{E}\_{\mathcal{P}\_{\mathcal{V}}} \left[ \tilde{\nu}(\mathcal{V}) \tilde{d}(\mathcal{V}) \right] \\ &= \left( \mu\_s - \mu\_{s^\*} \right)^\mathrm{T} \mathbf{A}\_{\vartheta(\mathcal{V})} \left( \mu\_s - \mu\_{s^\*} \right) + 2(\mu\_s - \mu\_{s^\*})^\mathrm{T} \left( \mathbf{A}\_{\vartheta(\mathcal{V})} \mu\_{s^\*} + \mathbb{E}\_{\mathcal{P}\_{\mathcal{V}}} \left[ \tilde{\nu}(\mathcal{V}) \tilde{d}(\mathcal{V}) \right] \right) \\ &= \left( \mu\_s - \mu\_{s^\*} \right)^\mathrm{T} \mathbf{A}\_{\vartheta(\mathcal{V})} \left( \mu\_s - \mu\_{s^\*} \right). \end{split} \tag{A58}$$

Combining (A57) and (A58) finishes the proof.

#### **Appendix G. Analyses of Hidden Layer Parameters**

First, from (A53), the bias *<sup>c</sup>*(*z*) of hidden layer is (when *<sup>μ</sup><sup>t</sup>* <sup>=</sup> 0, the formula should be modified as *c*(*z*) = *σ*−1(*μ*<sup>∗</sup> *<sup>s</sup>* (*z*)) <sup>−</sup> *<sup>μ</sup>*<sup>T</sup> *<sup>t</sup> <sup>w</sup>* <sup>+</sup> *<sup>o</sup>*().)

$$c(z) = \sigma^{-1}(\mu\_s^\*(z)) + o(\epsilon).$$

To obtain *μ*∗ *<sup>s</sup>* , let us define *σ*min inf*<sup>x</sup> σ*(*x*), *σ*max sup*<sup>x</sup> <sup>σ</sup>*(*x*). Then, the optimal *<sup>μ</sup><sup>s</sup>* is the solution of

$$\begin{array}{ll}\underset{\mu\_{\boldsymbol{s}}}{\text{minimize}} & \left(\mu\_{\boldsymbol{s}}-\mu\_{\boldsymbol{s}^\*}\right)^{\mathrm{T}}\boldsymbol{\Lambda}\_{\mathcal{O}(\boldsymbol{Y})}\left(\mu\_{\boldsymbol{s}}-\mu\_{\boldsymbol{s}^\*}\right) \\\\ \underset{\text{colb}\text{subject to}}{\text{c.t.}} & \sigma\_{\boldsymbol{s}^\*} \quad \boldsymbol{\omega} \text{ on} \quad \boldsymbol{\omega} \end{array} \tag{A59}$$

subject to *σ*min *μ<sup>s</sup> σ*max.

If *μs*<sup>∗</sup> satisfies the constraint of (A59), then it is the optimal solution. Otherwise, some elements of *μ*∗ *<sup>s</sup>* will become either *σ*min or *σ*max, known as the saturation phenomenon [21].

To obtain **W**∗, let

$$\begin{aligned} \mathbf{B}'\_1 &\stackrel{\triangle}{=} \boldsymbol{\Theta} \mathbf{B}\_1 = \left( \left( \boldsymbol{\Xi}^{\boldsymbol{Y}} \right)^{\mathrm{T}} \boldsymbol{\Xi}^{\boldsymbol{Y}} \right)^{-1/2} \left( \boldsymbol{\Xi}^{\boldsymbol{Y}} \right)^{\mathrm{T}} \mathbf{B}, \\ \mathbf{W}' &\stackrel{\triangle}{=} \boldsymbol{\Theta} \mathbf{W} = \left( \left( \boldsymbol{\Xi}^{\boldsymbol{Y}} \right)^{\mathrm{T}} \boldsymbol{\Xi}^{\boldsymbol{Y}} \right)^{1/2} \mathbf{J} \mathbf{W}. \end{aligned}$$

Then, the optimal **W** is given by

$$\mathbf{W}^{\prime \*} = \underset{\mathbf{W}^{\prime}}{\text{arg min}} \ \|\mathbf{B}\_{1}{}^{\prime} - \mathbf{W}^{\prime} (\boldsymbol{\Xi}\_{1}^{X})^{\top} \|\_{\mathbf{F}}^{2} = \mathbf{B}\_{1}{}^{\prime} \boldsymbol{\Xi}\_{1}^{X} (\left(\boldsymbol{\Xi}\_{1}^{X}\right)^{\top} \boldsymbol{\Xi}\_{1}^{X})^{-1}. \tag{A60}$$

Hence, **W**∗ is given by

$$\begin{split} \mathbf{W}^\* &= \boldsymbol{\Theta}^{-1} \mathbf{W}^{\prime \*} = \boldsymbol{\Theta}^{-1} \tilde{\mathbf{B}}\_1^{\prime} \boldsymbol{\Xi}\_1^X ( (\boldsymbol{\Xi}\_1^X)^\top \boldsymbol{\Xi}\_1^X )^{-1} \\ &= \mathbf{B}\_1 \boldsymbol{\Xi}\_1^X ( (\boldsymbol{\Xi}\_1^X)^\top \boldsymbol{\Xi}\_1^X )^{-1} \\ &= \mathbf{J}^{-1} \cdot [\boldsymbol{\Xi}^Y ( (\boldsymbol{\Xi}^Y)^\top \boldsymbol{\Xi}^Y )^{-1} ]^\mathbf{T} \mathbf{B} \,\boldsymbol{\Xi}\_1^X ( (\boldsymbol{\Xi}\_1^X)^\top \boldsymbol{\Xi}\_1^X )^{-1} .\end{split}$$

where the term **B**˜ **Ξ***<sup>X</sup>* 1 ( **Ξ***<sup>X</sup>* 1 T **Ξ***<sup>X</sup>* <sup>1</sup> )−<sup>1</sup> corresponds to a feature projection of ˜*t*(*X*):

$$\mathbb{B}\,\Xi\_1^X\left(\left(\Xi\_1^X\right)^T\Xi\_1^X\right)^{-1}\leftrightarrow\,\mathbb{E}\_{P\_{X|Y}}\left[\mathbf{A}\_{\tilde{t}(X)}^{-1}\tilde{t}(X)\,\,\big|\,\,Y\right].\tag{A61}$$

As a consequence, this multi-layer neural network conducts a generalized feature projection between features extracted from different layers. Note that the projected feature <sup>E</sup>*P*˜*t*|*<sup>Y</sup>* **Λ**−<sup>1</sup> ˜*t* ˜*t Y* depends only on the distribution *<sup>P</sup>*˜*t*|*<sup>Y</sup>* and does not depend on the distribution *PX*|*Y*. Therefore, the above computations can be accomplished without knowing the hidden random variable *X* and can be applied to general cases.

#### **References**


## *Article* **Population Risk Improvement with Model Compression: An Information-Theoretic Approach †**

**Yuheng Bu 1,‡, Weihao Gao 2, Shaofeng Zou <sup>3</sup> and Venugopal V. Veeravalli 1,\***


**Abstract:** It has been reported in many recent works on deep model compression that the population risk of a compressed model can be even better than that of the original model. In this paper, an information-theoretic explanation for this population risk improvement phenomenon is provided by jointly studying the decrease in the generalization error and the increase in the empirical risk that results from model compression. It is first shown that model compression reduces an information-theoretic bound on the generalization error, which suggests that model compression can be interpreted as a regularization technique to avoid overfitting. The increase in empirical risk caused by model compression is then characterized using rate distortion theory. These results imply that the overall population risk could be improved by model compression if the decrease in generalization error exceeds the increase in empirical risk. A linear regression example is presented to demonstrate that such a decrease in population risk due to model compression is indeed possible. Our theoretical results further suggest a way to improve a widely used model compression algorithm, i.e., Hessian-weighted *K*-means clustering, by regularizing the distance between the clustering centers. Experiments with neural networks are provided to validate our theoretical assertions.

**Keywords:** empirical risk; generalization error; K-means clustering; model compression; population risk; rate distortion theory; vector quantization

#### **1. Introduction**

Although deep neural networks have achieved remarkable success in various domains [1], e.g., computer vision [2], playing games like Go [3], and autonomous driving [4], the improvement of the performance of deep models often comes with deeper layers and more complex network structures, which usually have a large number of parameters. For example, in the application of image classification, it takes over 200 MB to save the parameters of AlexNet [2] and more than 500 MB for VGG-16 net [5]. Hence, it is difficult to port such large models to resource-limited devices such as mobile devices and embedded systems, due to their limited storage, bandwidth, energy, and computational resources.

Due to this reason there has been a flurry of work on compressing deep neural networks (see [6–8] for recent surveys). Existing studies mainly focus on designing compression algorithms to reduce the memory and computational cost, while keeping the same level of population risk. In some recent papers [9–12], aggressive model compression algorithms have been proposed, which require 10% or fewer bits to store the compressed model compared to the storage required by the original model. Surprisingly, it has been

**Citation:** Bu, Y.; Gao, W.; Zou, S.; Veeravalli, V.V. Population Risk Improvement with Model Compression: An Information-Theoretic Approach. *Entropy* **2021**, *23*, 1255. https://doi.org/10.3390/ e23101255

Academic Editors: Lizhong Zheng and Chao Tian

Received: 13 August 2021 Accepted: 23 September 2021 Published: 27 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

observed empirically in these works that the population risk of the compressed model can often be even *better* than that of the original model. This phenomenon is counter-intuitive at first glance, since more compression generally leads to more information loss.

Indeed, a compressed model would usually have a larger empirical risk than the original one, since machine learning methods are usually trained by minimizing the empirical risk. On the other hand, model compression could possibly decrease the generalization error, since it can be interpreted as a regularization technique to avoid overfitting. As the population risk is the sum of the empirical risk and the generalization error, it is possible for the population risk to be reduced by model compression.

#### *1.1. Contributions*

In this paper, we provide an information-theoretic explanation for the population risk improvement with model compression by jointly characterizing the decrease in generalization error and the increase in empirical risk. Specifically, we focus on the case where the model is compressed based on a pre-trained model.

We first prove that model compression leads to a tightening of the informationtheoretic generalization error bound in [13], and it can therefore be interpreted as a regularization method to reduce overfitting. Furthermore, by defining a distortion metric based on the difference in the empirical risk between the original model obtained by empirical risk minimization (ERM) and compressed models, we use rate distortion theory to characterize the increase in empirical risk as a function of the number of bits *R* used to describe the model. If the decrease in generalization error exceeds the increase in empirical risk, the population risk can be improved. An empirical illustration of this result for the MNIST dataset is provided in Figure 1, where model compression can lead to population risk improvement (details are given in Section 7). To better demonstrate our theoretical results, we investigate the example of linear regression comprehensively, where we develop explicit bounds on the generalization error and the increase in empirical risk.

**Figure 1.** Population risk of the compressed model *W*ˆ and the original model *W* vs. compression ratio (ratio of the number of bits used for compressed model to the number of bits used for original model). The generalization error of *W*ˆ decreases and the empirical risk of *W*ˆ increases with more compression (smaller compression ratio). The population risk of *W*ˆ is less than that of *W* for compression ratios larger than 6% in this figure. As the compression ratio goes to 100% (no compression), the population risk of *W*ˆ will converge to that of the original model *W*.

Our results also suggest a way to improve a method for compression based on Hessianweighted *K*-means clustering [11] in both scalar and vector case, by regularizing the distance between the clustering centers. Our experiments with neural networks validate our theoretical assertions and demonstrate the effectiveness of the proposed regularizer.

#### *1.2. Related Works*

There have been many studies on model compression for deep neural networks. The compression could be achieved by varying the training process, e.g., network structure optimization [14], low precision neural networks [15], and neural networks with binary weights [16,17]. Here we mainly discuss compression approaches that are applied on a pre-trained model.

Pruning, quantization, and matrix factorization are the most popular approaches to compressing pre-trained deep neural networks. The study of pruning algorithms for model compression which remove redundant parameters from neural networks dates back to the 1980s and 1990s [18–20]. More recently, an iterative pruning and retraining algorithm to further reduce the size of deep models was proposed in [9,21]. The method of network quantization or weight sharing, i.e., employing a clustering algorithm to group the weights in a neural network, and its variants, including vector quantization [22], soft quantization [23,24], fixed point quantization [25], transform quantization [26], and Hessian weighted quantization [11], have been extensively investigated. Matrix factorization, where low-rank approximation of the weights in neural networks is used instead of the original weight matrix, has also been widely studied in [27–29].

All of the aforementioned works demonstrate the effectiveness of their compression methods via comprehensive numerical experiments. Little research has been done to develop a theoretical understanding of how model compression affects performance. In work [30], an information-theoretic view of model compression via rate-distortion theory is provided, with the focus on characterizing the tradeoff between model compression and only the *empirical risk* of the compressed model. In [31–33], using a PAC-Bayesian framework, a non-vacuous generalization error bound for compressed model is derived based on its smaller model complexity.

In contrast to these works, instead of focusing on minimizing only the empirical risk as in [30], or minimizing only the generalization error as in [33], we use the mutual information based generalization error bound developed in [13,34] jointly with rate distortion theory to connect analyses of generalization error and empirical risk. This way, we are able to characterize the tradeoff between decrease in generalization error and the increase in empirical risk that results from model compression, and thus provide an understanding as to why model compression can improve the population risk. More importantly, our theoretical studies offer insights on designing practical model compression algorithms.

The rest of the paper is organized as follows. In Section 2, we provide relevant definitions and review relevant results from rate distortion theory. In Section 3, we prove that model compression results in the tightening of an information-theoretic generalization error upper bound. In Section 4, we use rate distortion theory to characterize the tradeoff between the increase in empirical risk and the decrease in generalization error that results from model compression. In Section 5, we quantify this tradeoff for a linear regression model. In Section 6, we discuss how the Hessian-weighted K-means clustering compression approach can be improved by using a regularizer motivated by our theoretical results. In Section 7, we provide some experiments with neural network models to validate our theoretical results and demonstrate the effectiveness of the proposed regularizer.

**Notation 1.** *For a random variable <sup>X</sup> generated from a distribution <sup>μ</sup>, we use* <sup>E</sup>*X*∼*<sup>μ</sup> to denote the expectation taken over X with distribution μ. We use Id to denote the d-dimensional identity matrix and A to denote the spectral norm of a matrix A. The cumulant generating function (CGF) of a random variable X is defined as* <sup>Λ</sup>*X*(*λ*) lnE[*eλ*(*X*−E*X*)]*. All logarithms are the natural ones.*

#### **2. Preliminaries**

#### *2.1. Review of Rate Distortion Theory*

Rate distortion theory, introduced by Shannon [35], is a major branch of information theory that studies the fundamental limits of lossy data compression. It addresses the minimal number of bits per symbol, as measured by the rate *R*, to transmit a random variable *W* such that the receiver can reconstruct *W* without exceeding distortion *D*.

Specifically, let *<sup>W</sup><sup>m</sup>* <sup>=</sup> {*W*1, *<sup>W</sup>*2, ··· , *Wm*} denote a sequence of *<sup>m</sup>* i.i.d. random variables *Wi* ∈ W generated from a source distribution *PW*. An encoder *fm* : <sup>W</sup>*<sup>m</sup>* <sup>→</sup> {1, 2, ··· , *<sup>M</sup>*} maps the message *<sup>W</sup><sup>m</sup>* into a codeword, and a decoder *gm* : {1, 2, ··· , *<sup>M</sup>*} → <sup>W</sup><sup>ˆ</sup> *<sup>m</sup>* reconstructs the message by an estimate *<sup>W</sup>*<sup>ˆ</sup> *<sup>m</sup>* from the codeword, where W⊆W <sup>ˆ</sup> denotes the range of *<sup>W</sup>*<sup>ˆ</sup> . A distortion metric *<sup>d</sup>* : W×W → <sup>R</sup><sup>+</sup> quantifies the difference between the original and reconstructed messages. The distortion between sequences *w<sup>m</sup>* and *w*ˆ *<sup>m</sup>* is defined to be

$$d(w^m, \mathfrak{a}^m) \triangleq \frac{1}{m} \sum\_{i=1}^m d(w\_{i\prime}, \mathfrak{a}\_i). \tag{1}$$

A commonly used distortion metric is the square distortion: *<sup>d</sup>*(*w*, *<sup>w</sup>*ˆ)=(*<sup>w</sup>* <sup>−</sup> *<sup>w</sup>*ˆ)2.

**Definition 1.** *An* (*m*, *M*, *D*)*-triple is achievable, if there exists a (probabilistic) encoder-decoder pair* (*fm*, *gm*) *such that the alphabet of codeword has size M and the expected distortion* <sup>E</sup>[*d*(*Wm*; *gm*(*fm*(*Wm*)))] <sup>≤</sup> *D.*

Now we define the following rate-distortion and distortion-rate function for lossy data compression.

**Definition 2.** *The rate-distortion function and the distortion-rate function are defined as*

$$R(D) \stackrel{\triangle}{=} \lim\_{m \to \infty} \frac{1}{m} \log\_2 M^\*(m, D), \tag{2}$$

$$D(R) \stackrel{\Delta}{=} \lim\_{m \to \infty} D^\*(m, R)\_\prime \tag{3}$$

*where M*∗(*m*, *D*) min{*<sup>M</sup>* : (*m*, *<sup>M</sup>*, *<sup>D</sup>*) *is achievable*} *and <sup>D</sup>*∗(*m*, *<sup>R</sup>*) min{*<sup>D</sup>* : (*m*, 2*mR*, *<sup>D</sup>*) *is achievable*}*.*

The main theorem of rate distortion theory is as follows.

**Lemma 1** ([36])**.** *For an i.i.d. source W with distribution PW and distortion function d*(*w*, *<sup>w</sup>*ˆ)*:*

$$R(D) = \min\_{P\_{\hat{W}|\mathcal{W}} : \mathbb{E}[d(\mathcal{W}; \hat{\mathcal{W}})] \le D} I(\mathcal{W}; \hat{\mathcal{W}}), \tag{4}$$

$$D(R) = \min\_{P\_{\hat{W}|\mathcal{W}}: I(\mathcal{W}; \hat{\mathcal{W}}) \le R} \mathbb{E}[d(\mathcal{W}, \hat{\mathcal{W}})],\tag{5}$$

*where I*(*W*; *W*ˆ ) - <sup>E</sup>*W*,*W*<sup>ˆ</sup> [ln *PW*,*W*<sup>ˆ</sup> *PW PW*<sup>ˆ</sup> ] *denotes the mutual information between W and W.* <sup>ˆ</sup>

The rate-distortion function quantifies the smallest number of bits required to compress the data given the distortion, and the distortion-rate function quantifies the minimal distortion that can be achieved under the rate constraint.

#### *2.2. Generalization Error*

Consider an instance space Z, a hypothesis space W, and a non-negative loss function -: W×Z→ <sup>R</sup>+. A training dataset *<sup>S</sup>* <sup>=</sup> {*Z*1, ··· , *Zn*} consists of *<sup>n</sup>* i.i.d samples *Zi* ∈ Z

drawn from an unknown distribution *μ*. The goal of a supervised learning algorithm is to find an output hypothesis *w* ∈ W that minimizes the population risk:

$$L\_{\mu}(w) \triangleq \mathbb{E}\_{Z \sim \mu}[\ell(w, Z)]. \tag{6}$$

In practice, *μ* is unknown, and therefore *Lμ*(*w*) cannot be computed directly. Instead, the empirical risk of *w* on the training dataset *S* is studied, which is defined as

$$L\_S(w) \triangleq \frac{1}{n} \sum\_{i=1}^n \ell(w, Z\_i). \tag{7}$$

A learning algorithm can be characterized by a randomized mapping from the training dataset *<sup>S</sup>* to a hypothesis *<sup>W</sup>* according to a conditional distribution *PW*|*S*. The (expected) generalization error of a supervised learning algorithm is the expected difference between the population risk of the output hypothesis and its empirical risk on the training dataset:

$$\text{gen}(\mu\_{\prime}P\_{W|S}) \triangleq \mathbb{E}\_{W,S} [L\_{\mu}(W) - L\_{S}(W)],\tag{8}$$

where the expectation is taken over the joint distribution *PS*,*<sup>W</sup>* <sup>=</sup> *PS* <sup>⊗</sup> *PW*|*S*. The generalization error is used to measure the extent to which the learning algorithm overfits the training data.

#### **3. Compression Can Improve Generalization**

In this section, we show that lossy compression can lead to a tighter mutual information based generalization error upper bound, which potentially reduces the generalization error of a supervised learning algorithm.

We start from the following lemma which provides an upper bound on the generalization error using the mutual information *I*(*S*; *W*) between training dataset *S* and the output of the learning algorithm *W*.

**Lemma 2** ([13])**.** *Suppose* -(*w*, *Z*) *is σ-sub-Gaussian (A random variable X is σ-sub-Gaussian if* <sup>Λ</sup>*X*(*λ*) <sup>≤</sup> *<sup>σ</sup>*2*λ*<sup>2</sup> <sup>2</sup> *,* <sup>∀</sup>*<sup>λ</sup>* <sup>∈</sup> <sup>R</sup>*.) under Z* <sup>∼</sup> *<sup>μ</sup> for all w* ∈ W*, then*

$$|\text{gen}(\mu\_{\prime}P\_{W|S})| \leq \sqrt{\frac{2\sigma^2}{n}I(S;W)}.\tag{9}$$

Compression can be viewed as a post-processing of the output of a learning algorithm. The output model *W* generated by a learning algorithm can be quantized, pruned, factorized, or even perturbed by noise, which results in a compressed model *W*ˆ . Assume that the compression algorithm is only based on *W* and can be described by a conditional distribution *PW*<sup>ˆ</sup> <sup>|</sup>*W*. Then the following Markov chain holds: *<sup>S</sup>* <sup>→</sup> *<sup>W</sup>* <sup>→</sup> *<sup>W</sup>*<sup>ˆ</sup> . By the data processing inequality,

$$I(\mathcal{S}; \mathcal{W}) \le \min\{I(\mathcal{W}; \mathcal{W}), I(\mathcal{S}, \mathcal{W})\}.$$

Thus, we have the following theorem characterizing the generalization error of the compressed model.

**Theorem 1.** *Consider a learning algorithm PW*|*S, a compression algorithm PW*<sup>ˆ</sup> <sup>|</sup>*W, and suppose* -(*w*ˆ, *<sup>Z</sup>*) *is <sup>σ</sup>-sub-Gaussian under Z* <sup>∼</sup> *<sup>μ</sup> for all <sup>w</sup>*<sup>ˆ</sup> <sup>∈</sup> <sup>W</sup><sup>ˆ</sup> *. Then*

$$|\text{gen}(\mu\_{\prime}P\_{\hat{W}|S})| \le \sqrt{\frac{2\sigma^2}{n} \min\{I(\mathcal{W};\hat{\mathcal{W}}), I(S,\mathcal{W})\}}.\tag{10}$$

Note that the generalization error upper bound in Theorem 1 for the compressed model is always no greater than the one in Lemma 2. This allows for the interpretation of compression as a regularization technique to reduce the generalization error.

#### **4. Generalization Error and Model Distortion**

In this section, we define a distortion metric in model compression that allows us to relate the distortion (the increase in empirical risk) due to compression with the reduction in the generalization error bound discussed in Section 3.

#### *4.1. Distortion Metric in Model Compression*

The expected population risk of a model *W* can be written as

$$\mathbb{E}\_{\mathcal{W}}[L\_{\mu}(\mathcal{W})] = \mathbb{E}[L\_{\mathcal{S}}(\mathcal{W})] + \text{gen}(\mu, P\_{\mathcal{W}|\mathcal{S}}),\tag{11}$$

where the first term, which is the expected empirical risk, reflects how well the model *W* fits the training data, while the second term demonstrates how well the model generalizes. In the empirical risk minimization framework, we control both terms by (1) minimizing the empirical risk of *W* directly or using other stochastic optimization algorithms, and (2) using regularization methods to control the generalization error, e.g., early stopping and dropout [1].

Now, consider the expected population risk of the compressed model *W*ˆ :

$$\begin{split} \mathbb{E}\_{\hat{\mathcal{W}}}[L\_{\mu}(\hat{\mathcal{W}})] &= \mathbb{E}[L\_{\mu}(\hat{\mathcal{W}}) - L\_{\mathcal{S}}(\hat{\mathcal{W}}) + L\_{\mathcal{S}}(\hat{\mathcal{W}}) - L\_{\mathcal{S}}(\mathcal{W}) + L\_{\mathcal{S}}(\mathcal{W})] \\ &= \mathbb{E}[L\_{\mathcal{S}}(\mathcal{W})] + \text{gen}(\mu, P\_{\mathcal{W}|\mathcal{S}}) + \mathbb{E}[L\_{\mathcal{S}}(\hat{\mathcal{W}}) - L\_{\mathcal{S}}(\mathcal{W})]. \end{split} \tag{12}$$

Compared with (11), we note that the first empirical risk term is independent of the compression algorithm, the second generalization error term can be upper bounded by Theorem 1, and the third term <sup>E</sup>[*LS*(*W*<sup>ˆ</sup> ) <sup>−</sup> *LS*(*W*)] quantifies the increase in the empirical risk if we use the compressed model *W*ˆ instead of the original model *W*. We then define the following distortion metric for model compression:

$$d\_S(w, \mathfrak{w}) \stackrel{\Delta}{=} L\_S(\mathfrak{w}) - L\_S(w), \tag{13}$$

which is the difference in the empirical risk between the compressed model *W*ˆ and the original model *<sup>W</sup>*. In general, function *dS*(*w*, *<sup>w</sup>*ˆ) is not always non-negative. However, for ERM solution *<sup>W</sup>*, which is obtained by minimizing the empirical risk *LS*(*W*), *dS*(*w*, *<sup>w</sup>*ˆ) <sup>≥</sup> 0, which ensures that *dS*(*w*, *<sup>w</sup>*ˆ) is a valid distortion metric. By Theorem 1, it follows that

$$\mathbb{E}\_{\mathbb{S},\mathcal{W},\hat{\mathcal{W}}}[L\_{\mu}(\hat{\mathcal{W}}) - L\_{\mathcal{S}}(\mathcal{W})] \le \sqrt{\frac{2\sigma^{2}}{n}} I(\mathcal{W};\hat{\mathcal{W}}) + \mathbb{E}\_{\mathbb{S},\mathcal{W},\hat{\mathcal{W}}}[d\_{\mathcal{S}}(\hat{\mathcal{W}},\mathcal{W})] \triangleq \mathcal{L}\_{\mathbb{S},\mathcal{W}}(P\_{\hat{\mathcal{W}}|\mathcal{W}}), \tag{14}$$

where <sup>L</sup>*S*,*W*(*PW*<sup>ˆ</sup> <sup>|</sup>*W*) is an upper bound on the expected difference between the population risk of *W*ˆ and the empirical risk of the original model *W* on training dataset *S*. Note that *LS*(*W*) is independent of the compression algorithm. Therefore, the bound in (14) can be viewed as an upper bound of the population risk of the compressed model *W*ˆ .

#### *4.2. Population Risk Improvement*

By Lemma 1, the smallest distortion that can be achieved at rate *R* is *D*(*R*) = min*I*(*W*;*W*<sup>ˆ</sup> )≤*<sup>R</sup>* <sup>E</sup>*S*,*W*,*W*<sup>ˆ</sup> [*dS*(*W*<sup>ˆ</sup> , *<sup>W</sup>*)]. Thus, the tightest bound in (14) that can be achieved at rate *R* is given in the following theorem.

**Theorem 2.** *Suppose the assumptions in Theorem 1 hold, PW*|*<sup>S</sup> minimizes the empirical risk LS*(*W*)*, and I*(*W*; *<sup>W</sup>*<sup>ˆ</sup> ) = *R, then*

$$\min\_{P\_{\hat{W}|W}: I(\mathcal{W}; \mathcal{W}) = R} \mathbb{E}\_{\mathcal{S}, \mathcal{W}, \hat{\mathcal{W}}} [L\_{\mu}(\mathcal{W}) - L\_{\mathcal{S}}(\mathcal{W})] \le \sqrt{\frac{2\sigma^2}{n}} R + D(R). \tag{15}$$

From the properties of the distortion-rate function [36], we know that *D*(*R*) is a decreasing function of *R*. Thus, we see that as *R* decreases the first term in (15), which corresponds to the generalization error, decreases, while the second term, which corresponds to the empirical risk, increases. Due to this tradeoff, it may be possible for the bound in (15) to be smaller due to compression, i.e., using a smaller rate *R*. This indicates that the population risk could improve with compression algorithm, which minimizes the upper bound <sup>L</sup>*S*,*W*(*PW*<sup>ˆ</sup> <sup>|</sup>*W*).

**Remark 1.** *In order to conclude definitively that the population risk can be improved with compression, we need to find a lower bound (as a function of R) to match (at least in the order sense) the upper bound in Theorem 2. This appears to be difficult to construct in general. One approach might be to use the same decomposition as in* (12) *and develop lower bounds for* min*I*(*W*;*W*<sup>ˆ</sup> )=*<sup>R</sup>* gen(*μ*, *PW*<sup>ˆ</sup> <sup>|</sup>*S*) *and* min*I*(*W*;*W*<sup>ˆ</sup> )=*<sup>R</sup>* <sup>E</sup>*S*,*W*,*W*<sup>ˆ</sup> [*dS*(*W*<sup>ˆ</sup> , *<sup>W</sup>*)] *independently. However, such an approach runs into the following issues: (1) such a lower bound would be loose since the compression algorithm PW*<sup>ˆ</sup> <sup>|</sup>*<sup>W</sup> that minimizes generalization error, the one that minimizes the distortion, and the one that minimizes the sum of the two can be quite different; and (2) a lower bound for generalization error needs to be developed, which appears to be difficult, with existing literature mainly focusing on lower bounding the excess risk, e.g., [37].*

As will be shown in Section 7, we can actually improve the population risk with a well designed compression algorithm in practical applications.

#### **5. Example: Linear Regression**

In this section, we comprehensively explore the example of linear regression to get a better understanding of the results in Section 4. To this end, we develop explicit upper bounds for generalization error and distortion-rate function *D*(*R*). All the proofs of the lemmas and theorems are provided in the Appendixes A–D.

Suppose that the dataset *<sup>S</sup>* <sup>=</sup> {*Z*1, ··· , *Zn*} <sup>=</sup> {(*X*1,*Y*1), ··· ,(*Xn*,*Yn*)} is generated from the following linear model with weight vector *<sup>w</sup>*<sup>∗</sup> = (*w*∗(1), ··· , *<sup>w</sup>*∗(*d*)) <sup>∈</sup> <sup>R</sup>*d*,

$$Y\_i = X\_i^\top w^\* + \varepsilon\_{i\prime} \quad i = 1, \cdots, n,\tag{16}$$

where *Xi*'s are i.i.d. *<sup>d</sup>*-dimensional random vectors with distribution <sup>N</sup> (0, <sup>Σ</sup>*X*), and *<sup>ε</sup><sup>i</sup>* <sup>∼</sup> <sup>N</sup> (0, *<sup>σ</sup>*2) denotes i.i.d. Gaussian noise. We adopt the mean squared error as the loss function, and the corresponding empirical risk on *S* is

$$L\_S(w) = \frac{1}{n} \sum\_{i=1}^n (Y\_i - X\_i^\top w)^2 = \frac{1}{n} \|Y - X^\top w\|\_{2\prime}^2 \tag{17}$$

for *<sup>w</sup>* ∈ W <sup>=</sup> <sup>R</sup>*d*, where *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*d*×*<sup>n</sup>* denotes all the input samples, and *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* denotes the responses. If *n* > *d*, the ERM solution is

$$\mathcal{W} = (XX^\top)^{-1}XY,\tag{18}$$

which is deterministic given *S*. Its generalization error can be computed exactly as in the following lemma (see Appendix A for detailed proof).

**Lemma 3.** *If n* > *d* + 1*, then*

$$\operatorname{sgn}(\mu\_\prime P\_{W|S}) = \frac{\sigma'^2 d}{n} (2 + \frac{d+1}{n-d-1}).\tag{19}$$

#### *5.1. Information-Theoretic Generalization Bounds for Compressed Linear Model*

We note that the mutual information based bound in Lemma 2 is not applicable for this linear regression model, since *W* is a deterministic function of *S*, and *I*(*S*; *W*) = ∞. However, this issue can be resolved if we post-process the ERM solution *W* by a compression algorithm and upper bound the generalization error by *I*(*W*ˆ ; *W*) as shown in Theorem 1.

Consider a compression algorithm, which maps the original weights *<sup>W</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* to the compressed model *<sup>W</sup>*<sup>ˆ</sup> <sup>∈</sup> W ⊆<sup>ˆ</sup> <sup>R</sup>*d*. For a fixed and compact <sup>W</sup><sup>ˆ</sup> , we define

$$\mathbb{C}(w^\*) \triangleq \sup\_{\vartheta \in \mathcal{V}} \|\vartheta - w^\*\|\_{2^\*}^2 \tag{20}$$

which measures the largest distance between the reconstruction *w*ˆ and the optimal weights *w*∗. The following proposition provides an upper bound on the generalization error of the compressed model *W*ˆ , and the detailed proof is provided in Appendix B.

**Proposition 1.** *Consider the ERM solution <sup>W</sup>* = (*XX*)−1*XY, and suppose* <sup>W</sup><sup>ˆ</sup> *is compact, then*

$$\text{gen}(\mu\_{\prime}P\_{\hat{W}|S}) \le 2\sigma\_{\ell}^{\*2} \sqrt{\frac{I(\mathcal{W};\hat{\mathcal{W}})}{n}},\tag{21}$$

*where σ*∗<sup>2</sup> - -*<sup>C</sup>*(*w*∗) <sup>Σ</sup>*<sup>X</sup>* <sup>+</sup> *<sup>σ</sup>*2*.*

#### *5.2. Distortion-Rate Function for Linear Model*

We now provide an upper bound on the distortion-rate function *D*(*R*) for the linear regression model. Note that <sup>∇</sup>*LS*(*W*) = 0, since *<sup>W</sup>* minimizes the empirical risk. The Hessian matrix of the loss function is

$$H\_S(W) = \frac{1}{n}XX^\top,\tag{22}$$

which is not a function of *W*. Then, the distortion function can be written as:

$$\begin{split} \mathbb{E}\_{\mathrm{S},\mathcal{W},\hat{\mathcal{W}}}[d\_{\mathrm{S}}(\hat{\mathcal{W}},\mathcal{W})] &= \mathbb{E}\_{\mathrm{S},\mathcal{W},\hat{\mathcal{W}}}[L\_{\mathrm{S}}(\hat{\mathcal{W}}) - L\_{\mathrm{S}}(\mathcal{W})] \\ &= \mathbb{E}\_{\mathrm{S},\mathcal{W},\hat{\mathcal{W}}}[(\hat{\mathcal{W}} - \mathcal{W})^{\top} \frac{1}{n} X X^{\top} (\hat{\mathcal{W}} - \mathcal{W})]. \end{split} \tag{23}$$

The following theorem characterizes upper bounds for *R*(*D*) and *D*(*R*) for linear regression.

**Proposition 2.** *For the ERM solution W* = (*XX*)−1*XY, we have*

$$R(D) \le \frac{d}{2} \left( \ln \frac{d\sigma'^2}{(n-d-1)D} \right)^+, \quad D \ge 0,\tag{24}$$

$$D(R) \le \frac{d\sigma^{\prime 2}}{n - d - 1} e^{-\frac{2R}{d}}, \quad R \ge 0,\tag{25}$$

*where* (*x*)<sup>+</sup> <sup>=</sup> max{0, *<sup>x</sup>*}*.*

**Proof sketch.** The proof of the upper bound for *R*(*D*) is based on considering a Gaussian random vector which has the same mean and covariance matrix as *W*. In addition, the upper bound is achieved when *<sup>W</sup>* <sup>−</sup> *<sup>W</sup>*<sup>ˆ</sup> is independent of the dataset *<sup>S</sup>* with the following conditional distribution,

$$P\_{\hat{W}|W} = \mathcal{N}\{(1-a)\mathcal{W} + aw^\*, (1-a)\frac{D}{d}\Sigma\_X^{-1}\},\tag{26}$$

where *α nD <sup>d</sup>σ*<sup>2</sup> ≤ 1. Note that this "compression algorithm" requires the knowledge of optimal weights *w*∗, which is unknown in practice.

The details can be found in Appendix C.

**Remark 2.** *As shown in [38], if <sup>n</sup>* <sup>&</sup>gt; *<sup>d</sup>*/<sup>2</sup>*,* <sup>1</sup> *<sup>n</sup> XX* − Σ*X* ≤ *holds with high probability. Then, the following lower bound on R*(*D*) *holds if we can approximate* <sup>1</sup> *<sup>n</sup> XX in* (23) *using* Σ*X,*

$$R(D) \gtrsim \frac{d}{2} \left( \ln \frac{d\sigma'^2}{(n-d-1)D} \right)^+ - D(P\_W \| P\_{W\_G})\_\prime \tag{27}$$

*where WG denotes a Gaussian random vector with the same mean and variance as W. The details can be found in Appendix D.*

Combing Propositions 1 and 2, we have the following result.

**Corollary 1.** *Under the same assumptions as in Propositions 1, we have*

$$\min\_{\mathcal{P}\_{\mathcal{W}|\mathcal{W}}: I(\mathcal{W}; \hat{\mathcal{W}}) = \mathbb{R}} \mathbb{E}\_{\text{S}, \mathcal{W}, \mathcal{W}} [L\_{\mu}(\hat{\mathcal{W}}) - L\_{\mathcal{S}}(\mathcal{W})] \le 2\sigma\_{\ell}^{\*2} \sqrt{\frac{R}{n}} + \frac{d\sigma^{\ell 2}}{n - d - 1} \varepsilon^{-\frac{2R}{d}}, \quad R \ge 0. \tag{28}$$

In (28) the first term corresponds to the generalization error, which decreases with compression, and the second term corresponds to the empirical risk, which increases with compression.

#### *5.3. Evaluation and Visualization*

In the following plots, we generate the training dataset *S* using the linear model in (16) by letting *<sup>d</sup>* = 50, *<sup>n</sup>* = 80, <sup>Σ</sup>*<sup>X</sup>* = *Id* and *<sup>σ</sup>*<sup>2</sup> = 1. We consider the following two compression algorithms. The first one is the conditional distribution *PW*<sup>ˆ</sup> <sup>|</sup>*<sup>W</sup>* in the proof of achievability (26), which requires the knowledge of *w*∗ and is denoted as "Oracle". The second one is the well-known *K*-means clustering algorithm, where the weights in *W* are grouped into *K* clusters and represented by the cluster centers in the reconstruction *W*ˆ . By changing the number of clusters *K*, we can control the rate *R*, i.e., *I*(*W*; *W*ˆ ). We average the performance and estimate *I*(*W*; *W*ˆ ) of these algorithms with 10,000 Monte-Carlo trials in the simulation.

We note that *I*(*W*; *W*ˆ ) is equal to the number of bits used in compression only in the asymptotic regime of large number of samples. In practice, we may have only one sample of the weights *W*, and therefore *I*(*W*; *W*ˆ ) simply measures the extent to which compression is performed by the compression algorithm.

In Figure 2a, we plot the generalization error bound in Proposition 1 as a function of the rate *R* and compare the generalization errors of the Oracle and *K*-means algorithms. It can be seen that Proposition 1 provides a valid upper bound for the generalization error, but this bound is tight only when *R* is small. Moreover, both compression algorithms can achieve smaller generalization errors compared to that of the ERM solution *W*, which validates the result in Theorem 1.

Figure 2b plots the upper bound on the distortion-rate function in Theorem 2 and the distortions achieved by the Oracle and *K*-means algorithms. The distortion of the Oracle decreases as we increase the rate *R* and matches the *D*(*R*) function well. However, there is a large gap between the distortion achieved by *K*-means algorithms and *D*(*R*). One possible explanation is that since *w*∗ is unknown, it is impossible for the *K*-means algorithm to learn the optimal cluster center with only one sample of *W*. Even if we view *<sup>W</sup>*(*j*), *<sup>j</sup>* <sup>=</sup> 1, ··· , *<sup>d</sup>* as i.i.d. samples from the same distribution, there is still a gap between the distortion achieved by the *K*-means algorithm and the optimal quantization as studied in [39].

We plot the population risks of the ERM solution *W*, the Oracle, and *K*-means algorithms in Figure 2c. It is not surprising that the Oracle algorithm achieves a small population risk, since *W*ˆ is a function of *w*<sup>∗</sup> and *W*ˆ = *w*<sup>∗</sup> when *R* = 0. However, it can be seen that the *K*-means algorithm achieves a smaller population risk than the original model *W*, since the decrease in generalization error exceeds the increase in empirical risk, when we use fewer clusters in the *K*-means algorithm, i.e., a smaller rate *R*. We note that the minimal population risk is achieved when *<sup>K</sup>* = 2, since we initialize *<sup>w</sup>*<sup>∗</sup> so that *<sup>w</sup>*∗(*i*), 1 ≤ *i* ≤ *d*, can be well approximated by two cluster centers.

**Figure 2.** Comparison of three different quantities for linear regression as a function of rate *R* in bits. (**a**) Generalization error. (**b**) Distortion. (**c**) Population risk.

#### **6. Clustering Algorithm Minimizing** *LS***,***<sup>W</sup>*

In this section, we propose an improvement of the Hessian-weighted (HW) *K*-means clustering algorithm [11] for model compression by regularizing the distance between the cluster centers, which minimizes the upper bound <sup>L</sup>*S*,*W*(*PW*<sup>ˆ</sup> <sup>|</sup>*W*), as suggested by our theoretical results in Section 4.

#### *6.1. Hessian-Weighted K-Means Clustering*

The goal of HW *<sup>K</sup>*-means is to minimize the distortion on the empirical risk *dS*(*W*<sup>ˆ</sup> , *<sup>W</sup>*), which has the following Taylor series approximation:

$$d\_S(\vec{\mathcal{W}}, \mathcal{W}) \approx (\vec{\mathcal{W}} - \mathcal{W})^T \nabla L\_S(\mathcal{W}) + \frac{1}{2} (\vec{\mathcal{W}} - \mathcal{W})^T H\_S(\mathcal{W}) (\vec{\mathcal{W}} - \mathcal{W}),\tag{29}$$

where *HS*(*W*) is the Hessian matrix. Assuming that *<sup>W</sup>* is a local minimum of *LS*(*W*) (ERM solution) and <sup>∇</sup>*LS*(*W*) <sup>≈</sup> 0, the first term can be ignored. Furthermore, the Hessian matrix *HS*(*W*) can be approximated by a diagonal matrix, which further simplifies the objective to *dS*(*W*<sup>ˆ</sup> , *<sup>W</sup>*) <sup>≈</sup> <sup>∑</sup>*<sup>d</sup> <sup>j</sup>*=<sup>1</sup> *<sup>h</sup>*(*j*)(*W*(*j*) <sup>−</sup> *<sup>W</sup>*<sup>ˆ</sup> (*j*))2, where *<sup>h</sup>*(*j*) is the *<sup>j</sup>*-th diagonal element of the Hessian matrix.

Given network parameters *<sup>w</sup>* <sup>=</sup> {*w*(1), ··· , *<sup>w</sup>*(*d*)}, the HW *<sup>K</sup>*-means clustering algorithm [11] partitions them into *K* disjoint clusters, using a set of cluster centers *c* = {*c*(1), ··· , *<sup>c</sup>*(*K*)}, and a cluster assignment *<sup>C</sup>* <sup>=</sup> *<sup>C</sup>*(1), ··· , *<sup>C</sup>*(*K*) ! , while solving the following optimization problem:

$$\min \sum\_{k=1}^{K} \sum\_{w^{(j)} \in \mathbb{C}^{(k)}} h^{(j)} |w^{(j)} - c^{(k)}|^2. \tag{30}$$

#### *6.2. Diameter Regularization*

In contrast to HW *K*-means which only cares about empirical risk, our goal is to obtain as small a population risk as possible by minimizing the upper bound

$$\mathcal{L}\_{\text{S},\mathcal{W}}(P\_{\hat{\mathcal{W}}|\mathcal{W}}) = \sqrt{\frac{2\sigma^2}{n}I(\mathcal{W};\hat{\mathcal{W}})} + \mathbb{E}[d\_{\mathcal{S}}(\hat{\mathcal{W}},\mathcal{W})].\tag{31}$$

Here, we let the number of clusters *K* to be an input argument of the algorithm, so that *<sup>I</sup>*(*W*; *<sup>W</sup>*<sup>ˆ</sup> ) <sup>≤</sup> log2 *<sup>K</sup>*, and we want to minimize <sup>L</sup>*S*,*W*(*PW*<sup>ˆ</sup> <sup>|</sup>*W*) by carefully designing the reconstructed weights given *<sup>K</sup>*, i.e., by choosing cluster centers{*c*(1), ··· , *<sup>c</sup>*(*K*)}. Then, minimizing the sub-Gaussian parameter *σ* is one way to control the generalization error of the compression algorithm. Recall that in Proposition 1, we have

$$\text{gen}(\mu\_{\prime}P\_{\hat{W}|S}) \le 2\left(\mathbb{C}(w^\*)\|\Sigma\_X\| + \sigma'^2\right)\sqrt{\frac{I(\mathcal{W};\hat{\mathcal{W}})}{n}},\tag{32}$$

where the sub-Gaussian parameter is related to *<sup>C</sup>*(*w*∗) = sup*w*ˆ∈W<sup>ˆ</sup> *<sup>w</sup>*<sup>ˆ</sup> <sup>−</sup> *<sup>w</sup>*∗ <sup>2</sup> <sup>2</sup> in linear regression. Note that this quantity can be interpreted as the diameter of the set W. Since the ground truth *w*∗ is unknown in practice, we then propose the following diameter regularization by approximating *C*(*w*∗) in (32) by

$$\beta \max\_{k\_1, k\_2} |c^{(k\_1)} - c^{(k\_2)}|^2, \quad \beta \ge 0,\tag{33}$$

where *β* is a parameter controls the penalty term and can be selected by cross validation in practice. Our diameter-regularized Hessian-weighted (DRHW) *K*-means algorithm solves the following optimization problem:

$$\min \sum\_{k=1}^{K} \sum\_{w^{(j)} \in \mathcal{C}^{(k)}} h^{(j)} |w^{(j)} - \mathfrak{c}^{(k)}|^2 + \beta \max\_{k\_1 k\_2} |\mathfrak{c}^{(k\_1)} - \mathfrak{c}^{(k\_2)}|^2. \tag{34}$$

Such an optimization problem can be easily extended to the vector case which leads to a vector quantization algorithm. Suppose that we group the *d*-dimensional weights *w* = {*w*(1), ··· , *<sup>w</sup>*(*d*)} into *<sup>d</sup>* <sup>=</sup> *<sup>d</sup>*/*<sup>m</sup>* vectors with length *<sup>m</sup>*, i.e., {**w**(1), ··· , **<sup>w</sup>**(*d* )}, **<sup>w</sup>**(*j*) <sup>∈</sup> <sup>R</sup>*m*, then our goal is to find cluster centers **<sup>c</sup>***<sup>k</sup>* <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* and assignments minimizing the following cost function:

$$\min \sum\_{k=1}^{K} \sum\_{\mathbf{w}^{(j)} \in \mathcal{C}^{(k)}} \left( \mathbf{w}^{(j)} - \mathbf{c}^{(k)} \right)^{\top} H^{(j)} \left( \mathbf{w}^{(j)} - \mathbf{c}^{(k)} \right) + \beta \max\_{k\_1, k\_2} ||\mathbf{c}^{(k\_1)} - \mathbf{c}^{(k\_2)}||\_{2^\prime}^2 \tag{35}$$

where *<sup>H</sup>*(*j*) is the diagonal Hessian matrix corresponding to the vector **<sup>w</sup>**(*j*). An iterative algorithm to solve the above optimization problem for vector quantization is provided in Algorithm 1.

The algorithm alternates between minimizing the objective function over the cluster centers and the assignments. In the Assignment step, we first fix centers and assign each **<sup>w</sup>**(*j*) to its nearest neighbor. We then fix assignments and update the centers by the weighted mean of each cluster in the Update step. For the farthest pair of centers, the diameter regularizer pushes them toward each other, so that the output centers have potentially smaller diameters than those of regular *K*-means. We note that the time complexity of the proposed diameter-regularized Hessian weighted *K*-means algorithm is the same as that of the original *K*-means algorithm.

**Algorithm 1** Diameter-regularized Hessian weighted *K*-means in vector case

**Input:** Weights vector {**w**(1), ... , **<sup>w</sup>**(*d* )}, Hessian matrices {*H*(1), ... , *<sup>H</sup>*(*d* )}, diameter regularizer *β* > 0, number of clusters *K*, iterations *T*

**Initialize** the *K* cluster centers {**c** (1) <sup>0</sup> ,..., **c** (*K*) <sup>0</sup> } randomly

**for** *t* = 1 to *T* **do**

**Assignment step:** Initialize *<sup>C</sup>*(*k*)

*<sup>t</sup>* <sup>=</sup> <sup>∅</sup> for all *<sup>k</sup>* <sup>∈</sup> [*K*].

**for** *j* = 1 to *d* **do**

Assign **<sup>w</sup>**(*j*) to the nearest cluster center, i.e., find *<sup>k</sup>* (*j*) *<sup>t</sup>* <sup>=</sup> arg min*k*∈[*K*] **<sup>w</sup>**(*j*) <sup>−</sup> **<sup>c</sup>** (*k*) *<sup>t</sup>*−<sup>1</sup> <sup>2</sup> 2 and let

$$\mathbf{C}\_{t}^{(k\_{t}^{(j)})} \leftarrow \mathbf{C}\_{t}^{(k\_{t}^{(j)})} \cup \{\mathbf{w}^{(j)}\} \tag{36}$$

#### **end for**

#### **Update step:**

Find current farthest pair of centers (*k*1, *<sup>k</sup>*2) = arg max*k*1,*k*<sup>2</sup> **<sup>c</sup>** (*k*1) *<sup>t</sup>*−<sup>1</sup> <sup>−</sup> **<sup>c</sup>** (*k*2) *<sup>t</sup>*−<sup>1</sup> <sup>2</sup> 2. Update **c** (*k*1) *<sup>t</sup>* and **c** (*k*2) *<sup>t</sup>* by

$$\mathbf{c}\_{t}^{(k\_{1})} = \left(\sum\_{\mathbf{w}^{(j)} \in \mathbf{C}\_{t}^{(k\_{1})}} H^{(j)} + \beta I\_{\mathbf{m}}\right)^{-1} \left(\sum\_{\mathbf{w}^{(j)} \in \mathbf{C}\_{t}^{(k\_{1})}} H^{(j)} \mathbf{w}^{(j)} + \beta \mathbf{c}\_{t}^{(k\_{2})}\right)$$

$$\mathbf{c}\_{t}^{(k\_{2})} = \left(\sum\_{\mathbf{w}^{(j)} \in \mathbf{C}\_{t}^{(k\_{2})}} H^{(j)} + \beta I\_{\mathbf{m}}\right)^{-1} \left(\sum\_{\mathbf{w}^{(j)} \in \mathbf{C}\_{t}^{(k\_{2})}} H^{(j)} \mathbf{w}^{(j)} + \beta \mathbf{c}\_{t}^{(k\_{1})}\right) \tag{37}$$

**for** *<sup>k</sup>* <sup>=</sup> 1 to *<sup>K</sup>*, *<sup>k</sup>* ∈ {*k*1, *<sup>k</sup>*2} **do**

Update the cluster centers by

$$\mathbf{c}\_{t}^{(k)} = \left(\sum\_{\mathbf{w}^{(j)} \in \mathbb{C}\_{t}^{(k)}} H^{(j)}\right)^{-1} \left(\sum\_{\mathbf{w}^{(j)} \in \mathbb{C}\_{t}^{(k)}} H^{(j)} \mathbf{w}^{(j)}\right) \tag{38}$$

**end for**

**end for**

**Output:** centers {**c** (1) *<sup>T</sup>* ,..., **c** (*K*) *<sup>T</sup>* } and assignments {*C*(1) *<sup>T</sup>* ,..., *<sup>C</sup>*(*K*) *<sup>T</sup>* }.

#### **7. Experiments**

In this section, we provide some real-world experiments to validate our theoretical assertions and the DRHW *K*-means algorithm. (The code for our experiments is available at the following link https://github.com/wgao9/weight-quant (accessed on 13 August 2021)) Our experiments include compression of: (i) a three-layer fully connected network on the MNIST dataset [40]; and (ii) a convolutional neural network with five convolutional layers and three linear layers on the CIFAR10 dataset [41] (We downloaded the pre-trained model in PyTorch from https://github.com/aaron-xichen/pytorch-playground (accessed on 13 August 2021)).

In Theorem 1, an upper bound on the *expected* generalization error is provided, and therefore we independently train 50 different models (with the same structure but different parameter initializations) using different subset of training samples, and average the results. We use 10% of the training data to train the model for MNIST and use 20% of the training data to train the model for CIFAR10. For each experiment, we use the same number of clusters for each convolutional layer and fully connected layer.

In the following experiments, we plot the cross entropy loss as a function of compression ratio. Note that compression ratio can be controlled by changing the number of clusters *K* in the quantization algorithm. To see this, suppose that the neural networks have total of *d* parameters that need to be compressed, and each parameter is of *b* bits. Let *<sup>C</sup>*(*k*) be the set of weights in cluster *<sup>k</sup>* and let *bk* be the number of bits of the codeword assigned to the network parameters in cluster *k* for 1 ≤ *k* ≤ *K*. For a lookup table to decode quantized values, we need *Kb* bits to store all the reconstructed weights, i.e., cluster centers *<sup>c</sup>* <sup>=</sup> {*c*(1), ··· , *<sup>c</sup>*(*K*)}. Then, the compression ratio is given by

$$\text{Compression Ratio} = \frac{\sum\_{k=1}^{K} |C^{(k)}| b\_k + Kb}{db},\tag{39}$$

where |·| denotes the number of elements in the set. In our experiments, we use a variablelength code such as the Huffman code to compute the compression ratio under different numbers of clusters *K*.

In Figures 3 and 4, we compare the scalar DRHW *K*-means algorithm with the scalar HW *K*-means algorithm for different compression ratios on the MNIST and CIFAR10 datasets. Both figures demonstrate that the compression algorithm increases the empirical risk but decreases the generalization error, and the net effect is that the both compressed models have smaller population risks than those of the original models. More importantly, the DRHW *K*-means algorithm produces a compressed model that has a better population risk than that of the HW *K*-means algorithm.

**Figure 3.** Comparison between DRHW *K*-means (*β* = 50) and HW *K*-means (*β* = 0) on MNIST. **Top**: empirical risks. **Bottom**: population risks and generalization errors.

In Figure 5, we compare the population risk of scalar DRHW *K*-means algorithm and that of the vector DRHW *K*-means algorithm with block length *m* = 2 for different compression ratios on the MNIST dataset. It can be seen from the figure that the improvement by using vector quantization (*m* = 2) is quite modest, which implies that the dependence between the weights *<sup>W</sup>*(*j*) is weak. However, we can still observe the improvement of adding the diameter regularizer in vector DRHW *K*-means algorithm by comparing the curves with *β* = 50 and *β* = 0.

In Figure 6, we demonstrate how *β* affects the performance of our diameter-regularized Hessian-weighted *K*-means algorithm in scalar case. It can be seen that as *β* increases, the generalization error decreases and the distortion in empirical risk increases, which validates the idea that this proposed diameter regularizer can be used to reduce the generalization error. The value of *β* that results in the best population risk therefore can be chosen via cross-validation in practice.

**Figure 4.** Comparison between DRHW *K*-means (*β* = 25) and HW *K*-means (*β* = 0) on CIFAR10. **Top**: empirical risks. **Bottom**: population risks and generalization errors.

**Figure 5.** Comparison between scalar DRHW *K*-means (*m* = 1) and vector DRHW *K*-means (*m* = 2) on the MNIST dataset.

**Figure 6.** DRHW *K*-means with different *β* on the MNIST dataset with *K* = 7.

#### **8. Conclusions**

In this paper, we have provided an information-theoretical understanding of how model compression affects the population risk of a compressed model. In particular, our results indicate that model compression may increase the empirical risk but decrease the generalization error. Therefore, it might be possible to achieve a smaller population risk via model compression. Our experiments validate these theoretical findings. Furthermore, we showed how our information-theoretic bound on the population risk can be used to optimize practical compression algorithms.

We note that our results could be applied to improve other compression algorithms, such as pruning and matrix factorization. Moreover, we believe that the information-theoretic analysis adopted here could be generalized to characterize a similar tradeoff between the generalization error and empirical risk in other applications beyond compressing pre-trained models, e.g., distributed optimization [42] and low precision training [15].

**Author Contributions:** Y.B.: theoretical analysis, methodology, conceptualization, writing—original draft. W.G.: software, methodology and visualization. S.Z.: writing—review and editing. V.V.V.: supervision, funding acquisition, and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196, through the University of Illinois at Urbana-Champaign.

**Data Availability Statement:** Data and code can be found in Section 7.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Proof of Lemma 3**

Let *<sup>Z</sup>* = (*X*,*Y*), *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* and *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup> denote an independent copy of the training sample *Zi*. Then, it can be shown that

$$\begin{split} \text{gen}(\mu\_{\prime}P\_{W|S}) &= \mathbb{E}\_{W,S}[L\_{\mu}(\mathcal{W}) - L\_{S}(\mathcal{W})] \\ &= \mathbb{E}\_{W,S}\left[\mathbb{E}\_{\widetilde{Z}}[(\widetilde{Y} - \widetilde{X}^{\intercal}\mathcal{W})^{2}] - \frac{1}{n}||Y - X^{\intercal}\mathcal{W}||\_{2}^{2}\right] \\ &= \mathbb{E}\_{S}\left[\mathbb{E}\_{\widetilde{Z}}[(\widetilde{Y} - \widetilde{X}^{\intercal}(XX^{\intercal})^{-1}XY)^{2}] - \frac{1}{n}||Y - X^{\intercal}(XX^{\intercal})^{-1}XY||\_{2}^{2}\right], \end{split} \tag{A1}$$

where *<sup>Y</sup>* <sup>=</sup> *<sup>X</sup><sup>w</sup>*<sup>∗</sup> <sup>+</sup> *<sup>ε</sup>* and *<sup>Y</sup>* <sup>=</sup> *<sup>X</sup><sup>w</sup>*<sup>∗</sup> <sup>+</sup> *<sup>ε</sup>*. Then, we have

$$\text{sgn}(\mu, P\_{W|S}) = \mathbb{E}\_{\varepsilon, \overline{\varepsilon}, \mathbf{X}, \overline{\mathbf{X}}} \left[ (\widetilde{\varepsilon} - \widetilde{\mathbf{X}}^{\top} (\mathbf{X} \mathbf{X}^{\top})^{-1} \mathbf{X} \varepsilon)^{2} \right] - \frac{1}{n} \mathbb{E}\_{\varepsilon, \mathbf{X}} \left[ \| \varepsilon - \mathbf{X}^{\top} (\mathbf{X} \mathbf{X}^{\top})^{-1} \mathbf{X} \varepsilon \|\_{2}^{2} \right]$$

$$= \mathbb{E}\_{\varepsilon, \mathbf{X}, \overline{\mathbf{X}}} \left[ \varepsilon^{\top} \mathbf{X}^{\top} (\mathbf{X} \mathbf{X}^{\top})^{-1} \mathbf{X} \mathbf{X}^{\top} (\mathbf{X} \mathbf{X}^{\top})^{-1} \mathbf{X} \varepsilon + \frac{1}{n} \varepsilon^{\top} \mathbf{X}^{\top} (\mathbf{X} \mathbf{X}^{\top})^{-1} \mathbf{X} \varepsilon \right]$$

$$= \mathbb{E}\_{\varepsilon, \mathbf{X}} \left[ \text{Tr} (\mathbf{X}^{\top} (\mathbf{X} \mathbf{X}^{\top})^{-1} \Sigma\_{\mathbf{X}} (\mathbf{X} \mathbf{X}^{\top})^{-1} \mathbf{X} \varepsilon \varepsilon^{\top}) \right] + \frac{\sigma^{\prime} 2}{n}$$

$$= \sigma^{\prime} \mathbb{E}\_{\mathcal{X}} \left[ \text{Tr} ((\mathbf{X} \mathbf{X}^{\top})^{-1} \Sigma\_{\mathbf{X}}) \right] + \frac{\sigma^{\prime} 2}{n}. \tag{A2}$$

Note that *Xi*'s are i.i.d. samples from <sup>N</sup> (0, <sup>Σ</sup>*X*), then we have (*XX*)−<sup>1</sup> distributed according to Wishart−1(Σ−<sup>1</sup> *<sup>X</sup>* , *<sup>n</sup>*), where Wishart−<sup>1</sup> denotes the inverse Wishart distribution with *<sup>n</sup>* degrees of freedom, and <sup>E</sup>[(*XX*)−1] = <sup>Σ</sup>−<sup>1</sup> *X <sup>n</sup>*−*d*−<sup>1</sup> . It then follows that

$$\text{gen}(\mu, P\_{W|S}) = \frac{\sigma'^2}{n - d - 1} \left[ \text{Tr}(\Sigma\_X^{-1} \Sigma\_X) \right] + \frac{\sigma'^2 d}{n} = \frac{\sigma'^2 d}{n} (2 + \frac{d + 1}{n - d - 1}).\tag{A3}$$

#### **Appendix B. Proof of Proposition 1**

For all *<sup>w</sup>*<sup>ˆ</sup> <sup>∈</sup> <sup>W</sup><sup>ˆ</sup> , it can be shown that


$$\ell(\mathfrak{w}, \widetilde{Z}) = (\widetilde{Y} - \widetilde{X}^{\top}\mathfrak{w})^2 = (\widetilde{X}^{\top}(w^\* - \mathfrak{w}) + \widehat{\varepsilon})^2. \tag{A4}$$

$$\text{Since } \widetilde{X} \sim \mathcal{N}(0, \Sigma\_X) \text{ and } \widetilde{\varepsilon} \sim \mathcal{N}(0, \sigma'^2) \text{, then } \ell(\mathfrak{w}, \widetilde{Z}) \sim \sigma\_\ell^2 \chi\_{1'}^2 \text{, where }$$

$$\sigma\_\ell^2 \triangleq (\widehat{w} - w^\*)^\top \Sigma\_X(\widehat{w} - w^\*) + \sigma'^2,$$

and *χ*<sup>2</sup> <sup>1</sup> denotes the chi-squared distribution with one degree of freedom. Then, the CGF of -(*w*ˆ, *<sup>Z</sup>*) is

$$\Lambda\_{\ell(\psi,\overline{Z})}(\lambda) = -\sigma\_{\ell}^2 \lambda - \frac{1}{2} \ln(1 - 2\sigma\_{\ell}^2 \lambda), \ \lambda \in (-\infty, \frac{1}{2\sigma\_{\ell}^2}). \tag{A5}$$

Thus, -(*w*ˆ, *<sup>Z</sup>*) is not sub-Gaussian for all *<sup>λ</sup>* <sup>∈</sup> <sup>R</sup>. However, it can be shown that

$$
\Lambda\_{\ell(\mathfrak{d},\overline{Z})}(\lambda) \le \sigma\_\ell^4 \lambda^2, \quad \lambda < 0. \tag{A6}
$$

We need the following lemma from the Theorem 1 of [43] to proceed our analysis.

**Lemma A1** ([43])**.** *Assume that for all <sup>w</sup>*<sup>ˆ</sup> <sup>∈</sup> <sup>W</sup><sup>ˆ</sup> *,* <sup>Λ</sup>-(*w*ˆ,*Z*)(*λ*) <sup>≤</sup> *<sup>σ</sup>*2*λ*<sup>2</sup> <sup>2</sup> *for λ* ≤ 0*. Then,*

$$\text{gen}(\mu\_{\prime}P\_{\hat{W}|S}) \le \sqrt{\frac{2\sigma^2}{n}I(\hat{W};S)}.\tag{A7}$$

Recall that *<sup>C</sup>*(*w*∗) = sup*w*ˆ∈W<sup>ˆ</sup> *<sup>w</sup>*<sup>ˆ</sup> <sup>−</sup> *<sup>w</sup>*∗ <sup>2</sup> <sup>2</sup>. We then have the following bound on the CGF of -(*w*ˆ, *<sup>Z</sup>*),

$$\Lambda\_{\ell(\mathfrak{w},\mathbb{Z})}(\lambda) \le \lambda^2 \max\_{\mathfrak{w} \in \mathcal{W}} \sigma\_\ell^4 \le \lambda^2 \left(\mathbb{C}(w^\*) \|\Sigma\_X\| + \sigma'^2\right)^2, \quad \lambda < 0. \tag{A8}$$

Applying Lemma A1 and data processing inequality, we have

$$\operatorname{sgn}(\mu\_{\prime}P\_{\hat{W}|S}) \le 2\left(\mathbb{C}(w^\*)\|\Sigma\_X\| + \sigma'^2\right)\sqrt{\frac{I(\hat{W};\mathcal{W})}{n}}.\tag{A9}$$

#### **Appendix C. Proof of Proposition 2**

The constraint on the distortion function can be written as follows:

$$D \ge \mathbb{E}\_{\mathbb{S}, \mathcal{W}, \hat{\mathcal{W}}} [d\_{\mathcal{S}}(\mathcal{W}, \mathcal{W})] = \frac{1}{n} \mathbb{E}\_{\mathbb{S}, \mathcal{W}, \hat{\mathcal{W}}} [(\mathcal{W} - \mathcal{W})^{\top} X X^{\top} (\mathcal{W} - \mathcal{W})].\tag{A10}$$

It follows from Lemma 1 that

$$R(D) = \min\_{P\_{\hat{W}|W}} I(\hat{W}; \mathcal{W})\_\prime \quad \text{s.t.} \quad \mathbb{E}\_{\mathcal{S}, \mathcal{W}, \hat{W}} [(\hat{\mathcal{W}} - \mathcal{W})^\top \frac{1}{n} \mathbf{X} \mathbf{X}^\top (\hat{\mathcal{W}} - \mathcal{W})] \le D. \tag{A11}$$

Note that E[*W*] = *w*<sup>∗</sup> and Cov[*W*] = *<sup>σ</sup>*<sup>2</sup> *<sup>n</sup>*−*d*−1Σ−<sup>1</sup> *<sup>X</sup>* since *W* is the ERM solution. In the following proof, we consider a Gaussian random vector with the same mean and covariance matrix *WG* ∼ N (*w*∗, *<sup>σ</sup>*<sup>2</sup> *<sup>n</sup>*−*d*−1Σ−<sup>1</sup> *<sup>X</sup>* ) as *<sup>W</sup>*.

For the upper bound of *R*(*D*), consider the channel *P*<sup>∗</sup> *<sup>W</sup>*<sup>ˆ</sup> <sup>|</sup>*<sup>W</sup>* <sup>=</sup> <sup>N</sup> (<sup>1</sup> <sup>−</sup> *<sup>α</sup>*)*<sup>W</sup>* <sup>+</sup> *<sup>α</sup>w*∗,(<sup>1</sup> <sup>−</sup> *α*) *<sup>D</sup> <sup>d</sup>* <sup>Σ</sup>−<sup>1</sup> *X* , where *α* = *nD <sup>d</sup>σ*<sup>2</sup> ≤ 1. It can be verified that this channel satisfies the constraint on the distortion:

$$\begin{split} & \mathbb{E}\_{S,\mathcal{W},\tilde{\mathcal{W}}} [d\_S(\tilde{\mathcal{W}}, \mathcal{W})] \\ &= a^2 \mathbb{E} [(\mathcal{W} - w^\*)^\top \frac{1}{n} X \boldsymbol{X}^\top (\mathcal{W} - w^\*)] + (1 - a) \frac{D}{d} \text{Tr} \left( \mathbb{E} [\frac{1}{n} \boldsymbol{X} \boldsymbol{X}^\top] \boldsymbol{\Sigma}\_{\boldsymbol{X}}^{-1} \right) \\ &= a^2 \mathbb{E} [((\mathcal{X} \boldsymbol{X}^\top)^{-1} \boldsymbol{X} \boldsymbol{\varepsilon})^\top \frac{1}{n} \boldsymbol{X} \boldsymbol{X}^\top ((\mathcal{X} \boldsymbol{X}^\top)^{-1} \boldsymbol{X} \boldsymbol{\varepsilon})] + (1 - a) D \\ &= a^2 \frac{1}{n} \mathbb{E} [\boldsymbol{\varepsilon}^\top \boldsymbol{X}^\top (\mathcal{X} \boldsymbol{X}^\top)^{-1} \boldsymbol{X} \boldsymbol{\varepsilon}] + (1 - a) D \\ &= D. \end{split} \tag{A12}$$

If we let *<sup>ξ</sup>* ∼ N (0,(<sup>1</sup> <sup>−</sup> *<sup>α</sup>*) *<sup>D</sup> <sup>d</sup>* <sup>Σ</sup>−<sup>1</sup> *<sup>X</sup>* ), it follows that

$$\begin{split} R(D) &\leq I(W; (1-\alpha)W + \alpha w^\* + \xi^\circ) \\ &\overset{(a)}{\leq} I(W\_G; (1-\alpha)W\_G + \xi^\circ) \\ &= \frac{d}{2} \ln \left( \frac{d\sigma'^2}{(n-d-1)D} - \frac{n}{n-d-1} + 1 \right) \\ &\leq \frac{d}{2} \left( \ln \frac{d\sigma'^2}{(n-d-1)D} \right)^+, \end{split} \tag{A13}$$

where (a) is due to the fact that Gaussian distribution maximizes the mutual information in an additive white Gaussian noise channels.

The upper bound on *D*(*R*) follows immediately from the upper bound on *R*(*D*).

#### **Appendix D. Discussion of Remark 2**

Suppose that <sup>1</sup> *<sup>n</sup> XX* can be approximated by Σ*<sup>X</sup>* for large *n* in (A10). It then follows that

$$R(D) = \min\_{P\_{\hat{W}|\mathcal{W}}} I(\hat{\mathcal{W}}; \mathcal{W}), \quad \text{s.t.} \quad \mathbb{E}\_{\mathcal{S}, \mathcal{W}, \hat{\mathcal{W}}} [(\hat{\mathcal{W}} - \mathcal{W})^\top \Sigma\_X (\hat{\mathcal{W}} - \mathcal{W})] \le D. \tag{A14}$$

It can be easily verified that the channel *P*∗ *<sup>W</sup>*|*W*<sup>ˆ</sup> <sup>=</sup> <sup>N</sup> (*W*<sup>ˆ</sup> , *<sup>D</sup> <sup>d</sup>* <sup>Σ</sup>−<sup>1</sup> *<sup>X</sup>* ) satisfies the distortion constraint. For any *PW*|*W*<sup>ˆ</sup> such that <sup>E</sup>*S*,*W*,*W*<sup>ˆ</sup> [*dS*(*W*<sup>ˆ</sup> , *<sup>W</sup>*)] <sup>≤</sup> *<sup>D</sup>*, it follows that

$$\begin{split} I(\mathcal{W};\hat{\mathcal{W}}) &= \mathbb{E}\_{\mathcal{W},\mathcal{W}} \Big[ \ln \frac{P\_{\mathcal{W}|\hat{\mathcal{W}}}}{P\_{\mathcal{W}}} \Big] \\ &= \mathbb{E}\_{\mathcal{W},\hat{\mathcal{W}}} \Big[ \ln \frac{P\_{\mathcal{W}|\hat{\mathcal{W}}}}{P\_{\mathcal{W}|\hat{\mathcal{W}}}^{\*}} \Big] + \mathbb{E}\_{\mathcal{W},\hat{\mathcal{W}}} \Big[ \ln \frac{P\_{\mathcal{W}|\hat{\mathcal{W}}}^{\*}}{P\_{\mathcal{W}\_G}} \Big] - \text{KL}(P\_{\mathcal{W}} \| P\_{\mathcal{W}\_G}) \\ &\geq \mathbb{E}\_{\mathcal{W},\hat{\mathcal{W}}} \Big[ \ln \frac{P\_{\mathcal{W}|\hat{\mathcal{W}}}^{\*}}{P\_{\mathcal{W}\_G}} \Big] - \text{KL}(P\_{\mathcal{W}} \| P\_{\mathcal{W}\_G}) . \end{split} \tag{A15}$$

where KL(*PW PWG* ) is the Kullback–Leibler divergence between the two distributions, and the last step follows from the fact that KL(*PW*,*W*<sup>ˆ</sup> *<sup>P</sup>*<sup>∗</sup> *<sup>W</sup>*,*W*<sup>ˆ</sup> ) <sup>≥</sup> 0. Note that

$$\begin{split} & \mathbb{E}\_{\mathcal{W},\hat{\mathcal{W}}} \left[ \ln \frac{P\_{\hat{\mathcal{W}}|\hat{\mathcal{W}}}}{P\_{\hat{\mathcal{W}}\_{G}}} \right] \\ &= \mathbb{E}\_{\mathcal{W},\hat{\mathcal{W}}} \left[ \frac{(n-d-1)(\mathcal{W}-w^{\*})^{\top} \Sigma\_{X}(\mathcal{W}-w^{\*})}{2\sigma'^{2}} - \frac{d(\hat{\mathcal{W}}-\mathcal{W})^{\top} \Sigma\_{X}(\hat{\mathcal{W}}-\mathcal{W})}{2D} \right] \\ &+ \frac{d}{2} \ln \frac{d\sigma'^{2}}{(n-d-1)D} \\ \overset{(a)}{\leq} & \frac{d}{2} \ln \frac{d\sigma'^{2}}{(n-d-1)D} + \mathbb{E}\_{\mathcal{W},\hat{\mathcal{W}}} \left[ \frac{d}{2} - \frac{d(\hat{\mathcal{W}}-\mathcal{W})^{\top} \Sigma\_{X}(\hat{\mathcal{W}}-\mathcal{W})}{2D} \right] \\ & \overset{(b)}{\geq} \frac{d}{2} \ln \frac{d\sigma'^{2}}{(n-d-1)D} . \end{split} \tag{A16}$$

where (a) follows from the fact that E[*W*] = *w*<sup>∗</sup> and Cov[*W*] = *<sup>σ</sup>*<sup>2</sup> *<sup>n</sup>*−*d*−1Σ−<sup>1</sup> *<sup>X</sup>* , and (b) is due to the fact that *PW*<sup>ˆ</sup> <sup>|</sup>*<sup>W</sup>* satisfies the distortion constraint. Thus,

$$R(D) \ge \frac{d}{2} \ln \frac{d\sigma'^2}{(n-d-1)D} - \text{KL}(P\_W || P\_{\mathbb{W}\_G}).\tag{A17}$$

#### **References**


## *Article* **Robust Spike-Based Continual Meta-Learning Improved by Restricted Minimum Error Entropy Criterion**

**Shuangming Yang 1, Jiangtong Tan <sup>1</sup> and Badong Chen 2,\***


**Abstract:** The spiking neural network (SNN) is regarded as a promising candidate to deal with the great challenges presented by current machine learning techniques, including the high energy consumption induced by deep neural networks. However, there is still a great gap between SNNs and the online meta-learning performance of artificial neural networks. Importantly, existing spike-based online meta-learning models do not target the robust learning based on spatio-temporal dynamics and superior machine learning theory. In this invited article, we propose a novel spike-based framework with minimum error entropy, called MeMEE, using the entropy theory to establish the gradient-based online meta-learning scheme in a recurrent SNN architecture. We examine the performance based on various types of tasks, including autonomous navigation and the working memory test. The experimental results show that the proposed MeMEE model can effectively improve the accuracy and the robustness of the spike-based meta-learning performance. More importantly, the proposed MeMEE model emphasizes the application of the modern information theoretic learning approach on the state-of-the-art spike-based learning algorithms. Therefore, in this invited paper, we provide new perspectives for further integration of advanced information theory in machine learning to improve the learning performance of SNNs, which could be of great merit to applied developments with spike-based neuromorphic systems.

**Keywords:** spiking neural network; meta-learning; information theoretic learning; minimum error entropy; artificial general intelligence

#### **1. Introduction**

In recent years, deep learning has shown a superior performance that exceeds the human-level performance in various types of individual narrow tasks [1]. However, in comparison with human intelligence that can learn to learn continually in order to execute unlimited tasks, the current successful deep learning methods still have a lot of drawbacks and limitations. In fact, humans can learn to learn by accumulating knowledge across their life time, which is a great challenge for artificial neural networks (ANNs) [2]. From this point of view, continual meta-learning aims at realizing machine intelligence at a higher level by providing machines with the meta-learning capability of learning to learn continually [3].

The human brain can realize meta-learning continually and avoid the catastrophic forgetting problem based on a combination of neural mechanisms [4]. The catastrophic forgetting problem is the critical challenge for developing the capability of continual metalearning [5]. The human brain has implemented an efficient and scalable mechanism for continual learning based on neuronal activity patterns that represent previous experiences [6]. Neurons communicate with each other and process the neural information by using neural spikes, which is one of the most critical fundamental mechanism in the brain. Based on this mechanism, the human brain can realize superior performance in

**Citation:** Yang, S.; Tan, J.; Chen, B. Robust Spike-Based Continual Meta-Learning Improved by Restricted Minimum Error Entropy Criterion. *Entropy* **2022**, *24*, 455. https://doi.org/10.3390/e24040455

Academic Editors: Lizhong Zheng and Chao Tian

Received: 25 February 2022 Accepted: 23 March 2022 Published: 25 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

different aspects, such as low power consumption and high spatio-temporal processing capability [7]. Therefore, implementing a brain-inspired continual meta-learning algorithm based on spike patterns and the brain's mechanisms is a promising technique.

The spiking neural network (SNN) uses the biologically plausible neuron model based on spiking dynamics, while the conventional ANN only uses the neurons based on a static rate [8]. SNNs are applied to reproduce the brain's mechanisms and to deal with the cognitive tasks [9]. In addition, the neuromorphic hardware based on SNNs can realize high performance in artificial intelligence tasks, including low power consumption, high noise tolerance, and low computation latency [10]. Previous neuromorphic hardware researches have proven these advantages by using various types of tasks, such as Tianjic, Loihi, BiCoSS, CerebelluMorphic, and LaCSNN [11–15]. Researchers have proposed SNN models to realize the short-term memory capability in a spike-based framework [16]. However, the current SNN models still suffer from the continual meta-learning problem under the non-Gaussian noise, and no previous study has solved this problem. Therefore, this is the focus of this study.

Information theoretic learning (ITL) has attracted increasing attention in the field of machine learning in recent years to improve the learning robustness and enhance the explainable capability [17–19]. Previously, Chen et al. proposed researches focusing on maximum correntropy theory and minimum error entropy criteria to improve the robustness of machine learning theory [20–22]. In addition, a series of entropy-based learning algorithms have been presented to deal with the robustness improvement of machine learning models, including guided complement entropy and fuzzy entropy [23–25]. Nevertheless, there is no application of the ITL-based approach in the spike-based continual meta-learning to improve its learning robustness. Therefore, in this invited article, we aim to propose a novel approach to deal with this challenging problem. A novel model is presented, which is called meta-learning with minimum error entropy (MeMEE). We test the meta-learning capability of the proposed SNN model. Then, we investigate the robust working memory capability in non-Gaussian noise. Finally, the robust transfer learning performance is explored under a non-Gaussian noisy condition. Experimental results strongly suggest the robust meta-learning capability of the SNN model with a working memory feature in a non-Gaussian noisy environment.

#### **2. Materials and Methods**

#### *2.1. SNN Model*

Previous studies have shown that the firing timing and activity space of dendrites can significantly affect neural function. Excitability of dendrites can excite the membrane to fire, whereas inhibitory dendrites can have the opposite effect [26–29]. Inspired by this morphological structure and function of the neuron model, we propose a spiking neuron model, which has three compartments, including a somatic compartment and two dendritic compartments. The model utilizes distinct dendritic compartments to receive excitatory and inhibitory inputs, while using dendrites and somatic cells to receive and send spiking activities, respectively. The formulation for calculating the membrane potential of dendrites and soma are as follows

$$\begin{cases} \begin{aligned} \tau\_{\mathfrak{m}} \frac{d\mathcal{U}\_{\mathfrak{m}}(t)}{dt} &= -\mathcal{U}\_{\mathfrak{m}}(t) + \mathcal{R}\_{\mathfrak{m}} I\_{\mathfrak{m}}(t) + \mathcal{g}\_{i} (\mathcal{U}\_{i}(t) - \theta\_{i}) + \mathcal{g}\_{\varepsilon} (\mathcal{U}\_{\mathfrak{c}}(t) - \theta\_{\mathfrak{c}}) - \Gamma\_{j}(t) z\_{j}(t) \\ \tau\_{i} \frac{d\mathcal{U}\_{i}(t)}{dt} &= -\mathcal{U}\_{i}(t) + \mathcal{R}\_{i} I\_{i}(t) \\ \tau\_{\varepsilon} \frac{d\mathcal{U}\_{\mathfrak{c}}(t)}{dt} &= -\mathcal{U}\_{\mathfrak{c}}(t) + \mathcal{R}\_{\mathfrak{c}} I\_{\mathfrak{c}}(t) \end{aligned} \tag{1} $$

where *τ<sup>v</sup>* represents the time constant of membrane. The variables *U*(*t*), *Ui*(*t*), and *Ue*(*t*) represent the somatic membrane potentials, inhibitory dendritic membrane potentials, and excitatory dendritic membrane potentials, respectively. The parameters *θ<sup>e</sup>* and *θ<sup>i</sup>* represent the reversal membrane potential of excitatory dendrite and inhibitory dendrite, respectively. *Rm*, *Re*, and *Ri* represent the membrane resistance of the soma, excitatory dendrite, and inhibitory dendrite, respectively. The parameters *ge* and *gi* represent the

synaptic conductance of excitatory dendrites and inhibitory dendrites, respectively. Neuron emits a spike at time *t* when it is currently not in a refractory period. The soma of neurons uses the spike adaptation mechanism. The threshold size can be changed by analyzing the firing pattern of neurons. Variable *zj*(*t*) represents the spike train of neuron *j* and assumes value in {0, 1/Δ*t*}. The dynamics of Γ*j*(*t*) is changed with each spike, representing the firing rate of neuron *j*, which is defined as

$$
\Gamma\_{\dot{j}}(t) = \tau\_{\dot{j}}^0 + \mathfrak{a} \cdot \mathfrak{r}\_{\dot{j}}(t) \tag{2}
$$

where α represents a constant that scales the deviation *τj*(*t*) from the baseline *τ<sup>j</sup>* 0. The variable *τj*(*t*) can be defined as

$$
\pi\_{\dot{\gamma}}(t + \Delta t) = \beta\_{\dot{\gamma}} \pi\_{\dot{\gamma}}(t) + (1 - \beta\_{\dot{\gamma}}) z\_{\dot{\gamma}}(t) \tag{3}
$$

where *<sup>β</sup><sup>j</sup>* <sup>=</sup> exp(−Δ*t*/*τa*,*j*). The constant *<sup>τ</sup>a,j* represents the adaptation time constant. Variable *zj*(*t*) represents the spike train of neuron *j* and assumes value in {0, 1/Δ*t*}. The parameter values of the spiking neuron model that we proposed are listed in Table 1. The input current *Ij*(*t*) of a neuron is defined as the weighted sum of the pulses, which come from external neurons or other neurons. Its mathematical formula is as follows

$$\begin{cases} \begin{aligned} I\_{\textit{m}}^{\dot{j}}(t) &= \sum\_{j=1}^{n} \mathcal{W}\_{\textit{ij}} \chi\_{i}(t - \kappa\_{\textit{ij}}) + \sum\_{j=1}^{n} \mathcal{W}\_{\textit{ij}}^{\text{rec}} \varepsilon\_{i}(t - \kappa\_{\textit{ij}}^{\text{rec}}) \\\ I\_{i}^{\dot{j}}(t) &= \sum\_{j=1}^{n} \mathcal{W}\_{\textit{ij}}^{\dot{\imath}} \chi\_{i}(t - \kappa\_{\textit{ij}}^{\dot{\imath}}) + \sum\_{j=1}^{n} \mathcal{W}\_{\textit{ij}}^{\text{rec}} \varepsilon\_{i}(t - \kappa\_{\textit{ij}}^{\text{rec}}) \\\ I\_{\varepsilon}^{\dot{j}}(t) &= \sum\_{j=1}^{n} \mathcal{W}\_{\textit{ij}}^{\varepsilon} \chi\_{i}(t - \kappa\_{\textit{ij}}^{\varepsilon}) + \sum\_{j=1}^{n} \mathcal{W}\_{\textit{ij}}^{\text{rec}} \varepsilon\_{i}(t - \kappa\_{\textit{ij}}^{\text{rec}}) \end{aligned} \tag{4}$$

where *Wrec ij* , *<sup>W</sup>erec ij* , and *<sup>W</sup>irec ij* represent the recurrent synaptic weights of soma, excitatory dendrites, and inhibitory dendrites, respectively. In addition,*Wij*, *<sup>W</sup><sup>e</sup> ij*, and *<sup>W</sup><sup>i</sup> ij* represent the synaptic weights of soma, excitatory dendrite, and inhibitory dendrite, respectively. The constants *<sup>κ</sup>ij*, *<sup>κ</sup><sup>e</sup> ij*, and *<sup>κ</sup><sup>i</sup> ij* represent the delays of input synapses for soma, excitatory dendrite, and inhibitory dendrite, respectively. The constants *κrec ij* , *<sup>κ</sup>erec ij* , and *<sup>κ</sup>irec ij* represent the delays of recurrent synapses for soma, excitatory dendrite, and inhibitory dendrite, respectively. The spike trains *<sup>χ</sup>i*(*t*) and *<sup>ε</sup>i*(*t*) are modeled as sums of Dirac pulses, representing the spike trains from input neurons and recurrent neurons with recurrent connections, respectively. The dynamics of the proposed spiking neuron model are shown in Figure 1 accordingly.

**Table 1.** Parameter settings of the spiking neuron model.


We integrate the spiking neuron model into an SNN framework and test the accuracy of this new model on different types of learning tasks. The structure of the SNN model is shown in Figure 2. The model is divided into three layers: input layer, hidden layer, and output layer. According to different tasks, we choose different encoding methods of the input layer and decoding methods of the output layer. In Figure 2, the solid blue lines represent feed-forward inhibitory synaptic connections, while the red dashed lines represent lateral inhibitory synaptic connections. The dendrites and soma of different neurons in the hidden layer are connected by lateral inhibitory synapses that are random and sparse at the same time. Information is transmitted from the input layer to the dendrites, and the soma transmits impulse signals to the output layer. The initial network weights in the proposed SNN model are set via a Gaussian distribution *Wij* ~ <sup>√</sup>*w*<sup>0</sup> *nin <sup>N</sup>*(0, 1), where *nin* represents the number of input neurons in the spiking neural network in the weight matrix. *N*(0, 1) represents the Gaussian distribution with zero mean and unit variance, while *w*<sup>0</sup> = Δ*t*/*Rm* represents a weight-scaling factor depending on the time step Δ*t* and membrane resistance *Rm*. This scaling factor is significant as it is used to initialize the spiking neural network with a practical firing rate needed for efficient training.

**Figure 1.** Dynamics of the proposed spiking neuron. (**a**) The biological structure that inspires the proposed neuron model. (**b**) The adaptive dynamics of the threshold along with the firing events.

We use a deep rewiring algorithm because it is able to maintain the sign of each synapse during the learning process [30]. Hence, this sign is inherited from the initial weights of the network. In consideration of this, the model needs efficient and reasonable initialization weights for both excitatory and inhibitory neurons. To achieve this, we sample neurons from a Bernoulli distribution, generating the symbol sign *ki* ∈ {−1, 1} randomly. At the same time, to avoid the problem of exploding gradients, we scale the weights so that the largest eigenvalue is less than 1. A large square matrix is generated with the number of rows selected, ultimately with uniform probability. This square matrix is then multiplied by a binary mask, resulting in a sparse matrix, as a part of the depth rewiring algorithm that we mentioned before. This algorithm achieves the goal of maintaining the level of sparse connectivity in the network by dynamically disconnecting some synapses while

reconnecting others. In this algorithm, we set the temperature parameter to 0 and the L1-norm regularization parameter to 0.01.

**Figure 2.** Network architecture for learning and memory integrated with the proposed SAM model. This network architecture is comparable to a 2-layer network of point neurons. The soma and dendrites of different neurons in the hidden layer are connected to lateral inhibitory synapses randomly. The gray circles in the input layer and output layer are not SAM neurons, representing the input spiking neuron and output spiking neuron, respectively. The input and output encodings are determined for different tasks, which will be described in the section of experimental results.

#### *2.2. BPTT Training Algorithm*

In common ANN models, the gradients of the loss function are obtained with respect to the weights in the network using back propagation. Nevertheless, the training method of back propagation cannot be directly applied to SNNs due to the non-differentiability of spikes from SNNs. Providing that time is discretized, the gradient needs to be propagated through continuous time or multiple time steps. To enable the SNN model to learn in the training process, we use a pseudo-derivative technique as shown below

$$\frac{dz\_j(t)}{dv\_j(t)} = k \max\{0, 1 - \left|v\_j(t)\right|\}\tag{5}$$

where *k* = 0.3 (typically less than 1) is a constant value that can dampen the increase in back propagated errors through spikes by using a pseudo-derivative of amplitude to achieve the goal of stable performance. The variable *zj*(*t*) represents the spike train of neuron *j* that assumes values in {0, 1}. The variable *vj*(*t*) represents the normalized membrane potential, which is defined as follows

$$v\_j(t) = \frac{V\_j(t) - \Gamma\_j(t)}{\Gamma\_j(t)}\tag{6}$$

where Γ*<sup>j</sup>* represents the firing rate of neuron *j*. With the purpose of providing the selflearning capability required for reinforcement learning for the proposed SAM model, we utilize a proximal policy optimization algorithm [31]. This algorithm is easy to implement and allows the model to have self-learning capabilities. The clipped surrogate objective of this algorithm is defined as *<sup>O</sup>PPO*(*ϑold*, *<sup>ϑ</sup>*, *<sup>t</sup>*, *<sup>k</sup>*). Therefore, the loss function with respect to *<sup>ϑ</sup>* is formulated as

$$L\_P(\theta) = -\frac{\sum\_{k < K} \sum\_{t < T} O^{PPO}(\theta\_{old}, \theta, t, k)}{KT} + \mu\_\mathbf{f} \frac{1}{n} \sum\_{j} \left\| \frac{\sum\_{k, t} z\_j(t, k) - f^0}{KT} \right\|^2 \tag{7}$$

where *f* <sup>0</sup> represents a target firing rate of 10 Hz and *μ*<sup>f</sup> represents a regularization hyperparameter. Variables *t* and *k* represent the simulation time step and the total number of epochs. The variable *ϑ* represents the current policy parameter, which is defined in the previous research [31]. In each iteration of training, *K* = 10 episodes of *T* = 2000 time steps are generated with a fixed parameter *ϑold*, which is the vector of policy parameters before the update as expressed in [31]. At the same time, the loss function *L*(*ϑ*) is minimized by the ADAM optimizer [32].

#### *2.3. Minimum Error Entropy Criterion (MEEC)*

The minimum error entropy (MEE) can minimize the entropy of the estimation error, so that decreases the uncertainty in the learning process. The α-order Renyi's entropy is used assuming a random variable *e* with probability density function *f <sup>α</sup>*(*e*), which is defined as

$$H(\varepsilon) \stackrel{\Delta}{=} \frac{1}{1-a} \log \int f^a(\varepsilon) d\varepsilon \tag{8}$$

where α is set to 2 for 2-order Renyi's entropy in this study. The kernel density estimation (KDE) is used to estimate the PDF of the error samples, which has three advantages. First, it is a non-parameter approach, which does not require the prior knowledge of the error distribution. Second, it does not require the integration calculation. Third, it can be smooth and differentiable, which is vital for the gradient computation. Considering a set of i.i.d data {*ei*}*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> drawn from the distribution, the KDE of the PDF can be formulated as

$$\hat{f}\_E(\varepsilon) = \frac{1}{N} \sum\_{i=1}^{N} G\_{\Sigma}(\varepsilon - \varepsilon\_i) \tag{9}$$

where *G*Σ(*e* − *ei*) represents the Gaussian function with the following expression as

$$\log G\_{\Sigma}(\varepsilon - \varepsilon\_{i}) = \frac{1}{\sqrt{2\pi(\det \sum)}} \cdot \exp\left(-\frac{1}{2}(\varepsilon - \varepsilon\_{i})^{T} \sum^{-1}(\varepsilon - \varepsilon\_{i})\right) \tag{10}$$

where *N* and Σ represent the number of the data points and the kernel parameter, respectively. In this research, Σ represents a diagonal matrix with the *s*-th diagonal element with

the variance *δ*<sup>2</sup> *<sup>s</sup>* for *es* in *e*, where *s* = 1, 2, ... , *S*. The kernel parameter represents a free parameter. Thus, the Renyi's quadratic entropy can be expressed as

$$\begin{split} H\_2(e) &= -\log \int \left( \frac{1}{N} \sum\_{i=1}^{N} G\_{\Sigma}(e - e\_i) \right)^2 d\varepsilon \\ &= -\log \frac{1}{N^2} \int \left( \sum\_{i=1}^{N} \sum\_{j=1}^{N} G\_{\Sigma}(e - e\_i) G\_{\Sigma}(e - e\_j) \right) d\varepsilon \\ &= -\log \frac{1}{N^2} \int \left( \sum\_{i=1}^{N} \sum\_{j=1}^{N} G\_{\Sigma}(e - e\_i) G\_{\Sigma}(e - e\_j) \right) d\varepsilon \\ &= -\log \frac{1}{N^2} \left( \sum\_{i=1}^{N} \sum\_{j=1}^{N} G\_{\sqrt{2}\Sigma}(e\_i - e\_j) \right) \\ &= -\log \frac{1}{N^2} \left( \sum\_{i=1}^{N} \sum\_{j=1}^{N} G\_{\Sigma}(e\_i - e\_j) \right) \end{split} \tag{11}$$

Based on the Formula (11), we define a function *V*(*e*) to represent the information potential of variable *e*, which is formulated as

$$V(\boldsymbol{\varepsilon}) = \frac{1}{N^2} \left( \sum\_{i=1}^{N} \sum\_{j=1}^{N} G\_{\sum\_{2}}(\boldsymbol{e\_i} - \boldsymbol{e\_j}) \right) \tag{12}$$

Therefore, the minimization of the Renyi's entropy *H*2(*e*) means the maximization of the information potential *V*(*e*) because of the monotonic increasing feature of the log function. The Parzen window is used to decrease the computational complexity and the instantaneous information potential at time *t*, which can be formulated as

$$J\_1(\boldsymbol{\varepsilon}) = \frac{1}{\mathcal{W}} \sum\_{i=k-\mathcal{W}+1}^{k} \mathcal{G}\_{\sum\_{2}}(\boldsymbol{\varepsilon}\_k - \boldsymbol{\varepsilon}\_i) \tag{13}$$

where *W* represents the length of the Parzen window. It should be noted that MEE is a kind of local optimization criterion but suffers from the shift-invariant problem. It can only determine the location of error PDF but cannot know the distribution location. The function *G*Σ2(.) can be defined as the Gaussian kernel function with bandwidth *σ*

$$G\_{\Sigma^2}(x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{x^2}{2\sigma^2}\right) \tag{14}$$

In order to reduce the computational complexity, quantization technique is used to realize the quantized MEE (QMEE). Thus, the information potential is expressed as

$$V^{\mathbb{Q}}(\varepsilon) = \frac{1}{N^2} \left( \sum\_{i=1}^{N} \sum\_{j=1}^{N} G\_{\overline{\Sigma}\_2} \left( \varepsilon\_i - \mathbb{Q} \left| \varepsilon\_j \right| \right) \right) = \frac{1}{N^2} \sum\_{i=1}^{N} \sum\_{j=1}^{M} \varphi\_j \mathbb{G}\_{\overline{\Sigma}^2} \left( \varepsilon\_i - \varepsilon\_j \right) \tag{15}$$

where *<sup>Q</sup>*[.] represents a quantization operator mapping each {*ei*}*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> to one of , *cj* -*<sup>M</sup> j*=1 , resulting in a codebook *C* = (*c*1, *c*2, *c*3, ... , *c*M). *Φ* = (*ϕ*1, *ϕ*2, ... , *ϕ*M) represents the number of the samples quantized to the corresponding set , *cj* -*<sup>M</sup> j*=1 . It should be noted that ∑*<sup>M</sup> <sup>j</sup>*=<sup>1</sup> *<sup>ϕ</sup><sup>j</sup>* <sup>=</sup> *<sup>N</sup>*. Theoretical proof of the robustness has been presented in [22].

#### *2.4. Restricted MEEC*

In this study, the fundamental inner product to measure the similarity is used, which is generalized from its vectors' application [33]. The inner product similarity between continuous pdfs *fX*(*x*) and *gX*(*x*) can be expressed as

$$
\langle f\_X(\mathbf{x}), g\_X(\mathbf{x}) \rangle = \int\_X f\_X(\mathbf{x}) g\_X(\mathbf{x}) d\mathbf{x} \tag{16}
$$

The desired distribution *ρE*(*e*), which is expressed in [33] in detail, can be defined as follows

$$\rho\_E(e) = \begin{cases} \begin{array}{ll} \zeta\_{0\prime} & e = 0 \\ \zeta\_{-1\prime} & e = -1 \\ \zeta\_{1\prime} & e = 1 \\ 0 & \text{otherwise} \end{array} \tag{17}$$

where *ζ<sup>i</sup>* (*i* = 0, −1, 1) denotes the corresponding density for each peak, which is simplified into a Dirac-δ function.

The maximization of the similarity measure between the error pdf *fE*(*e*) and the desired distribution *ρE*(*e*) can be formulated as

$$\begin{array}{l} \max \langle f\_E(e), \rho\_E(e) \rangle \\ \Leftrightarrow \max \int\_X f\_E(e) \rho\_E(e) dx \\ \Leftrightarrow \max \tilde{\zeta}\_0 f\_E(0) + \tilde{\zeta}\_{-1} f\_E(-1) + \tilde{\zeta}\_1 f\_E(1) \end{array} \tag{18}$$

Furthermore, the model parameter can be expressed as

$$\begin{aligned} w^\* &= \operatorname\*{argmax}\_{\mathbb{F}} \zeta\_0 f\_E(0) + \zeta\_{-1} f\_E(-1) \zeta\_1 f\_E(1) \\ &= \operatorname\*{argmax}\_{i} \begin{pmatrix} \zeta\_0 \frac{N}{N} \sum\_{i=1}^N G\_{\sum2}(0 - e\_i) \\ + \zeta\_{-1} \frac{1}{N} \sum\_{i=1}^N G\_{\sum2}(-1 - e\_i) \\ \zeta\_1 \frac{1}{N} \sum\_{i=1}^N G\_{\sum2}(1 - e\_i) \end{pmatrix} \\ &= \operatorname\*{argmax}\_{\mathbb{F}} \frac{1}{N^2} \sum\_{i=1}^N \begin{pmatrix} N \zeta\_0 G\_{\sum2}(e\_i) \\ + N \zeta\_{-1} G\_{\sum2}(e\_i + 1) \\ + N \zeta\_1 G\_{\sum2}(e\_i - 1) \end{pmatrix} \end{aligned} \tag{19}$$

In fact, QMEE converges the prediction errors , *cj* -*<sup>M</sup> <sup>j</sup>*=<sup>1</sup> to obtain a compact error distribution. Based on the method in [33], a predetermined codebook *C* = (0, −1, 1) implements QMEE to restrict errors to three positions and avoid the undesirable double-peak learning consequence. Therefore, the restricted MEE (RMEE) algorithm can be formulated as

$$V^{R}(\boldsymbol{\varepsilon}) = \frac{1}{N^2} \sum\_{i=1}^{N} \begin{pmatrix} \varphi\_0 \boldsymbol{G}\_{\sum 2}(\boldsymbol{\varepsilon}\_i) \\ + \varphi\_{-1} \boldsymbol{G}\_{\sum 2}(\boldsymbol{\varepsilon}\_i + 1) \\ + \varphi\_1 \boldsymbol{G}\_{\sum 2}(\boldsymbol{\varepsilon}\_i - 1) \end{pmatrix} \tag{20}$$

where *Φ* = (*ϕ*0, *ϕ*−1, *ϕ*1)=(*Nζ*0, *Nζ*−1, *Nζ*1) that represents the corresponding number for each quantization word *C* = (0, −1, 1). The proposed RMEE algorithm maximizes the inner product similarity between error pdf *fE*(*e*) and the optimal three-peak distribution *ρE*(*e*). RMEE is a specific formation of QMEE where the codebook is predetermined as *C* = (0, −1, 1) and converges learning errors on these three locations.

In order to optimize Equation (19), the half-quadratic technique is used to solve optimization issues. A convex function *g*(*x*) = −*x*log(−*x*) + *x* is defined, and the information potential can be expressed as

$$V^{R}(\boldsymbol{e}) = \sum\_{i=1}^{N} \begin{pmatrix} \varrho\_{0} \left\{ u\_{i} \frac{\boldsymbol{e}\_{i}^{2}}{2\sigma^{2}} - \mathcal{g}\left(\boldsymbol{u}\_{i}\right) \right\} \\ + \varrho\_{-1} \left\{ v\_{i} \frac{\left(\boldsymbol{e}\_{i} + 1\right)^{2}}{2\sigma^{2}} - \mathcal{g}\left(\boldsymbol{v}\_{i}\right) \right\} \\ + \varrho\_{1} \left\{ s\_{i} \frac{\left(\boldsymbol{e}\_{i} - 1\right)^{2}}{2\sigma^{2}} - \mathcal{g}\left(\boldsymbol{s}\_{i}\right) \right\} \end{pmatrix} \triangleq f\_{\mathcal{R}1}(\boldsymbol{w}, \boldsymbol{u}\_{i}, \boldsymbol{v}\_{i}, \boldsymbol{s}\_{i}) \tag{21}$$

In half-quadratic technique, it has the following relationship

$$\begin{aligned} u\_i^k &= -\exp\left(-\frac{c\_i^2}{2\sigma^2}\right) < 0\\ v\_i^k &= -\exp\left(-\frac{\left(c\_i + 1\right)^2}{2\sigma^2}\right) < 0\\ s\_i^k &= -\exp\left(-\frac{\left(c\_i - 1\right)^2}{2\sigma^2}\right) < 0\\ (i &= 1, 2, \dots, N). \end{aligned} \tag{22}$$

By attaining the optimal (*u<sup>k</sup> <sup>i</sup>* , *<sup>v</sup><sup>k</sup> <sup>i</sup>* , *<sup>s</sup><sup>k</sup> <sup>i</sup>* ) in the *k*th iteration, the information potential can be formulated as

$$V^{R}(\varepsilon) = \sum\_{i=1}^{N} \left( \begin{array}{c} \varrho \eta u\_{i} \left( t\_{i} - y\_{i} \right)^{2} \\ + \varrho -\_{1} v\_{i} \left( t\_{i} + 1 - y\_{i} \right)^{2} \\ + \varrho \eta\_{1} s\_{i} \left( t\_{i} - 1 - y\_{i} \right)^{2} \end{array} \right) \stackrel{\Delta}{=} f\_{R2}(w) \tag{23}$$

The *JR*2(*w*) can be optimized based on gradient-based methods because the objective function is differentiable and continuous. For example, the gradient of *JR*2(*w*) can be expressed as

$$\frac{\partial}{\partial w}J\_{\mathbb{R}2}(w) = \sum\_{i=1}^{N} \begin{pmatrix} \varrho \eta u\_{i} \frac{\partial (t\_{i} - y\_{i})^{2}}{\partial w} \\ + \varrho\_{-1} v\_{i} \frac{\partial (t\_{i} + 1 - y\_{i})^{2}}{\partial w} \\ + \varrho\_{1} s\_{i} \frac{\partial (t\_{i} - 1 - y\_{i})^{2}}{\partial w} \end{pmatrix} = -2 \sum\_{i=1}^{N} \begin{pmatrix} \varrho\_{0} u\_{i} c\_{i} \\ + \varrho\_{-1} v\_{i} (c\_{i} + 1) \\ + \varrho\_{-1} s\_{i} (c\_{i} - 1) \end{pmatrix} x\_{i} y\_{i} (1 - y\_{i}) \tag{24}$$

The detailed algorithm of the HQ-based optimization and its convergence analysis for RMEE are presented in [33].

#### **3. Results**

#### *3.1. Proposed Network with RMEE Criterion*

Since MEE has the shift-invariant feature, and estimation results based on MEEC will not always converge to the true value. A consideration is to combine the RMEE criterion with CEE for a global optimal solution. The cross-entropy loss function, also regarded as log loss, is the most commonly used loss function for back propagation. The cross-entropy loss function increases as the predicted probability deviates from the actual label, and can be described as follows

$$L\_{cc}\left(\stackrel{\circ}{y}\_{i'}y\_i\right) = -\sum\_{i} y\_i \log\left(\stackrel{\circ}{y}\_i\right) \tag{25}$$

In this paper, the label *l <sup>n</sup>* of each image is used, which is only assumed to be 1 for images belonging to the same class of images during testing, and 0 otherwise. The crossentropy formula can be expressed as

$$J\_2 = \sum\_{n=1}^{5} -l^n \log \sigma \left( y^{20+20 \cdot n} \right) - (1 - l^n) \log \left( 1 - \sigma^{20+20 \cdot n} \right) \tag{26}$$

where the output of the SNN model is only counted after all images are fully rendered. Therefore, for the novel criterion, the performance index can be formulated as

$$f\_k(\boldsymbol{e}) = \mu \left[ \sum\_{i=1}^{N} \begin{pmatrix} \boldsymbol{\rho}\_0 \boldsymbol{u}\_i (t\_i - \boldsymbol{y}\_i)^2 \\ + \boldsymbol{\rho}\_{-1} \boldsymbol{v}\_l (t\_l + 1 - \boldsymbol{y}\_l)^2 \\ + \boldsymbol{\rho}\_1 \boldsymbol{s}\_l (t\_i - 1 - \boldsymbol{y}\_i)^2 \end{pmatrix} \right] + (1 - \mu) \left[ \sum\_{n=1}^5 \begin{pmatrix} -l^n \log \sigma \left( y^{20 + 20 \cdot n} \right) \\ - (1 - l^n) \log \left( 1 - \sigma^{20 + 20 \cdot n} \right) \end{pmatrix} \right] \tag{27}$$

where *μ* represents a weighting constant. In the supervised learning tasks, there only exist cross-entropy and RMEE, which is described in Equation (27).

#### *3.2. Autonomous Navigation*

We first apply the proposed SNN model in the agent navigation task, which requires the network to have reinforcement learning capabilities. The agent needs to learn to find objects in a 2D area and eventually be able to navigate to find objects at random locations in the area. This task is interrelated with the neuroscience paradigm of the well-known Morris water maze task, which is designed to study learning in the brain [34]. In this task, a virtual agent is simulated as a point in the 2D simulation arena and is controlled by the proposed SNN model. The position of the agent is configured randomly with a uniform probability in the overall arena at the beginning of an episode. The agent produces a small velocity vector of the Euclidean norm and selects an action at each time step. It receives a reward value '1' after reaching the destination.

In the navigation task, the information *s*(*t*) of the current environment state and the reward score *r*(*t*) are received as input data by neurons in the input layer at each time step. The coordinate information of the position is encoded by the input neurons through the Gaussian population rate encoding method. Furthermore, each neuron in the input layer is assigned a coordinate value with a firing rate, which is defined as: *r*max = exp(−100(*ξi*-*ξ*) 2), where *ξ<sup>i</sup>* and *ξ* represent the actual coordinate value and the preferred coordinate value, respectively. *r*max is supposed to be set as 500 Hz. Moreover, the instantaneous reward *r*(*t*) is encoded by two sets of input neurons. In the first group, the neurons generate spikes in sync when a positive reward is received, while in the second group, the neurons generate spikes as long as the proposed SNN model receives a negative reward. The output of the network is represented by five readout neurons in the output layer with membrane potential *λi*(*t*). The action vector *ζ*(*t*)=(*ζx*(*t*), *ζy*(*t*))*<sup>T</sup>* is used to determine the movement of the agent in the navigation task that we mentioned before. It is calculated from a Gaussian distribution with mean *μ<sup>x</sup>* = tanh(*λ*1(*t*)) and *μ<sup>y</sup>* = tanh(*λ*2(*t*)) as well as variances *Φ<sup>x</sup>* = *σ*(*λ*3(*t*)) and *Φ<sup>y</sup>* = *σ*(*λ*4(*t*)). In the end, the output of the last readout neuron λ<sup>5</sup> is calculated to predict the value function *μθ*(*t*). This predicts the expected discounted sum of future rewards Ω(*t*) = Σ*t'* > *tγt'* − *tω*(*t'*), where *ω*(*t'*) represents the reward at time *t'* and *γ* represents the discount factor, whose value is usually 0.99.

The agent based on the proposed SNN model learns to learn in the navigation task towards the correct destination location after the meta-learning process. The overall training process in the reward learning process is described by Algorithm 1. We add other loss functions to support the reinforcement learning framework, maintaining the loss function consistent with Equation (26). Figure 3 shows the successful destination reached number (DRN) per learning iteration. Each iteration contains a batch of ten episodes, and network weights are updated during the navigation task. For each episode, the model is expected to explore until reaching and storing the destination location, and uses the prior knowledge to find the shortest path to the destination. This reveals that the proposed SNN model has meta-learning capability in the autonomous navigation task.

**Figure 3.** Navigation performance of the proposed model with different settings.

**Algorithm 1** Training process in the reward learning process

**Input:** number of full episodes **K**, timesteps **T**, fixed parameters θ*old*, target firing rate *f 0*, regularization hyper-parameters <sup>μ</sup>*v*, <sup>μ</sup>*e*, <sup>μ</sup>*firing*, bandwidth *<sup>σ</sup>*, predicted value function **<sup>V</sup>**θ(*t*, *<sup>k</sup>*) and sum of future rewards **R**(**t**, **k**)

**Output:** total loss *L*θ.

1. Parameters setting: *f* **<sup>0</sup>**, μ*v*, μ*e*, μ*firing and σ*. 2. **for** n **in** batch size N: 3. Set *en* <sup>=</sup> **<sup>R</sup>**(**t**, **<sup>k</sup>**) <sup>−</sup> **<sup>V</sup>**θ(*t*, *<sup>k</sup>*) 4. **if** number of literation is 0: 5. (*ϕ*0, *<sup>ϕ</sup>*−**1**, *<sup>ϕ</sup>***1**) <sup>=</sup> (*N*, **<sup>0</sup>**, **<sup>0</sup>**) 6. **else**: (*ϕ***0**, *<sup>ϕ</sup>*−**1**, *<sup>ϕ</sup>***1**) = (#{*en* <sup>∈</sup> (−**0.5**, **0.5**)}, #{*en* <sup>∈</sup> (−**1**, <sup>−</sup>**0.5**)}, #{*en* <sup>∈</sup> (**0.5**, **<sup>1</sup>**)}) where #{·} indicates counting the samples that satisfy the condition 7. (*un*, *vn*, *sn*)=(−*exp*(<sup>−</sup> *en* **2 <sup>2</sup>***σ***<sup>2</sup>** ), <sup>−</sup>*exp*(−(*en*+**1**) **2 <sup>2</sup>***σ***<sup>2</sup>** ), <sup>−</sup>*exp*(−(*en*−**1**) **2** <sup>2</sup>*σ***<sup>2</sup>** )) 8. *LRMEE <sup>n</sup> =ϕ***0***unen* **<sup>2</sup>** <sup>+</sup>*ϕ*−**1***vn*(*en* <sup>+</sup> **<sup>1</sup>**) **<sup>2</sup>** <sup>+</sup>*ϕ***1***sn*(*en* <sup>−</sup> **<sup>1</sup>**) **2** 9. **end for** 10. **for** k **in** K: 11. **for** t **in** T: *LPPO* (*t*,*k*) <sup>=</sup> *OPPO*(*θold*, *<sup>θ</sup>*, *<sup>t</sup>*, *<sup>k</sup>*) 12. **end for** 13. **end for** 14. Calculate the total loss: *<sup>L</sup>*(*e*) <sup>=</sup> *Lp*(*e*) <sup>+</sup> *Jk*(*e*) 15. **return** *L*(*e*)

*3.3. Working Memory Performance on Store–Recall Task with Non-Gaussian Noise*

To further demonstrate the robust working memory capability of the proposed SNN model, we apply the model in a store–recall task with non-Gaussian noise. The detailed settings of the store–recall task have been previously presented in [35]. The SNN model receives a sequence of frames that are represented by ten spike trains in a period of time. The inputs #1 and #2 are represented by the spiking activities of input neurons from #1 to #10 and from #11 to #20, respectively. As shown in Figure 4, the neurons from #21 to #30 and from #31 to #40 receive the random store and recall commands, respectively. The store command means direct attention is paid to the specific frame of input data flow. Then, this frame will be reproduced when receiving the recall command. Figure 4 shows one test example with the spiking activities after working memory training. The dynamic threshold changes along with the learning procedure, which is shown in Figure 4. This reveals that the proposed SNN model can exhibit the working memory performance and realize the store–recall task successfully. Since working memory is a vital feature and the foundation for meta-learning, this also suggests that the MeMEE model can exhibit the meta-learning tasks based on its working memory mechanisms with a robust performance.

**Figure 4.** Working memory capability of the proposed SNN model after training.

*3.4. Meta-Learning Performance on Sequential MNIST Data Set with Non-Gaussian Noise*

We further demonstrate the meta-learning capability of the proposed SNN model in a transfer learning task based on the sequential MNIST (sMNIST) data set. We divide the

sMNIST data set into two parts. The first part includes 30,000 images for digits '0', '1', '2', '3', and '4', and the second part includes 30,000 patterns for digits '5', '6', '7', '8', and '9'. In the first phase, the first part is employed to train the SNN model, and the second part is then used for training. In the second phase, 10% salt and pepper noise is added to the testing data set as the non-Gaussian noise for the performance evaluation. Figure 5 shows the performance of the MeMEE model and compares it with other counterpart models, including recurrent SNN (RSNN) and the conventional LIF-based SNN model without the RMEE criterion. This shows that the proposed model outperforms the other solutions, and the reasoning behind this includes three points. Firstly, the proposed model has the meta-learning capability, so it can illustrate the transfer learning capability, and its transfer learning performance is superior to the RSNN model accordingly, considering accuracy and convergence speed. Secondly, due to the RMEE criterion being the loss function, its robustness to the non-Gaussian noise is superior to the model without the RMEE criterion in terms of the learning accuracy. The result suggests that the MeMEE model with RMEE criterion has a more powerful robust meta-learning capability in learning sequential spatio-temporal patterns.

**Figure 5.** Meta-learning capability of the proposed MeMEE model on sequential MNIST data set.

#### *3.5. Effects of Loss Parameters on Learning Performance*

In this study, we further investigate how each loss function affects the learning performance of the proposed MeMEE model. We use the sMNIST data set to evaluate and quantify the learning accuracy along with the changing loss parameter. In order to demonstrate the learning robustness based on the proposed MeMEE model, salt and pepper noise is added to the sMNIST data set. Different levels are considered, which are selected from 3.19% to 19.13%. Different values of parameter *μ* are investigated, which are set from 0.3 to 1.0. As shown in Figure 6, the value of *μ* with 0.7, 0.8, and 0.9 can induce the higher learning accuracy on sequential visual recognition. This reveals that the RMEE criterion can further enhance the robustness of the proposed MeMEE model without the RMEE

criterion, i.e., *μ* = 1. Since the model without RMEE criterion with 3.19% non-Gaussian noise only reaches 83.6% accuracy, the RMEE criterion can improve the learning accuracy of the proposed MeMEE model with non-Gaussian salt and pepper noise.

**Figure 6.** Effects of loss parameters on the learning performance of sequential classification.

#### **4. Discussion**

This paper presents an information theoretic learning framework for robust spikedriven continual meta-learning. Different from the previous SNN learning research, we first introduce the RMEE criterion to develop and improve the spike-based learning framework, which is significantly general and can also provide a series of theoretic insights. Moreover, the information theoretic framework allows us to obtain a direct understanding and better interpretation of the robust learning solutions of SNN models, compared with some previous studies focusing on improving the learning robustness of SNNs [36].

As a first step in establishing a rigorous framework for SNN continual meta-learning with RMEE, the presented research can be extended in both theoretical and practical aspects. From the theoretical point of view, one extension is to use the information potential to train the presented SNN model. For example, as shown in [37], Chen et al. presented a survival information potential algorithm for adaptive system training. This does not require computing of the kernel function and has good robustness performance accordingly. The other extension is to apply the proposed framework in other spike-based learning paradigms, including few-shot learning, multitask learning, and unsupervised learning [38].

From a practical point of view, the model is expected to be implemented on neuromorphic platforms to realize low-power and real-time systems for various types of applications. The state-of-the-art digital neuromorphic systems include Loihi [12], Tianjic [11], BiCoSS [13], CerebelluMorphic [14], LaCSNN [15], TrueNorth [39], and SpiNNaker [40]. By implementing embedded neuromorphic systems, it can be applied in different fields such as edge computing devices, brain–machine integration systems, and intelligent systems [41–43].

#### **5. Conclusions**

In this invited paper, we first presented an ITL-based scheme for robust spike-based continual meta-learning, which is improved by the RMEE criterion. A gradient descent learning principle is presented in a recurrent SNN architecture. Several tasks are realized to demonstrate the learning performance of the proposed MeMEE model, including autonomous navigation, robust working memory in the store–recall task and robust metalearning capability for the sMNIST data set. In the first autonomous navigation task, the SNN model learns to find the correct destination by continual meta-learning from the task reward and punishment. This demonstrates that the MeMEE model based on the proposed RMEE criterion realizes the meta-learning capability for navigation and outperforms the conventional RSNN model. In the second task, the proposed MeMEE model improves the working memory performance by recalling the stored noisy patterns. In the third task, the proposed MeMEE model with RMEE criterion can enhance the robustness in the meta-learning task for noisy sMNIST images. This invited paper provides a novel insight into the improvement of the spike-based machine learning performance based on information theoretic learning strategy, which is critical for the further research of artificial general intelligence. In addition, it can be implemented by the low-power neuromorphic system, which can be applied in edge computing of internet of things (IoT) and unmanned systems.

**Author Contributions:** S.Y. and B.C. contributed to the conceptualization, methodology, and writing of this paper. J.T. helped to conduct the experiment. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was funded partly by the National Natural Science Foundation of China with grant numbers (Grant No. 62006170, No. 62088102, No. U21A20485) and partly by China Postdoctoral Science Foundation (Grant Nos. 2020M680885, 2021T140510).

**Acknowledgments:** We would like to thank the editor and reviewer for their comments on the manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


## *Article* **Probabilistic Deterministic Finite Automata and Recurrent Networks, Revisited**

**Sarah E. Marzen 1,\*,† and James P. Crutchfield 2,\*,†**


† These authors contributed equally to this work.

**Abstract:** Reservoir computers (RCs) and recurrent neural networks (RNNs) can mimic any finitestate automaton in theory, and some workers demonstrated that this can hold in practice. We test the capability of generalized linear models, RCs, and Long Short-Term Memory (LSTM) RNN architectures to predict the stochastic processes generated by a large suite of probabilistic deterministic finite-state automata (PDFA) in the small-data limit according to two metrics: predictive accuracy and distance to a predictive rate-distortion curve. The latter provides a sense of whether or not the RNN is a lossy predictive feature extractor in the information-theoretic sense. PDFAs provide an excellent performance benchmark in that they can be systematically enumerated, the randomness and correlation structure of their generated processes are exactly known, and their optimal memorylimited predictors are easily computed. With less data than is needed to make a good prediction, LSTMs surprisingly lose at predictive accuracy, but win at lossy predictive feature extraction. These results highlight the utility of causal states in understanding the capabilities of RNNs to predict.

**Keywords:** time series prediction; finite state machines; hidden Markov models; recurrent neural networks; reservoir computers; long short-term memory

#### **1. Introduction**

Many real-world tasks rely on prediction. Given past stock prices, traders try to predict if a stock price will go up or down, adjusting investment strategies accordingly. Given past weather, farmers endeavor to predict future temperatures, rainfall, and humidity, adapting crop and pesticide choices. Manufacturers try to predict which goods will appeal most to consumers, adjusting raw materials purchases. Self-driving cars must predict the motion of other objects on and off the road. Furthermore, when it comes to biology, evidence suggests that organisms endeavor to predict their environment as a key survival strategy [1–3]. One simple metric often used to evaluate the quality of our predictive algorithms is simply the accuracy of our predictions—how well we can predict what will happen next given what has happened previously.

However, we also care about the cost of formulating and communicating a prediction of the next symbol in some sequence of symbols, either to another person or from one part an organism to another. Costs of formulation might include the time, memory, and/or energy taken to compute a prediction. Once the prediction is made, it is often communicated to some other downstream region that will use the prediction to take an action. This communication requires some amount of channel capacity, and channel capacity can be energetically expensive. All other concerns equal, one is inclined to employ a predictor with a lower transmission rate [4].

Here, we focus solely on communication, ignoring costs in formulating the prediction. As such, note that transmission rate is unrelated to sample complexity or time complexity.

**Citation:** Marzen, S.E.; Crutchfield, J.P. Probabilistic Deterministic Finite Automata and Recurrent Networks, Revisited. *Entropy* **2022**, *24*, 90. https://doi.org/10.3390/e24010090

Academic Editors: Lizhong Zheng and Chao Tian

Received: 26 November 2021 Accepted: 30 December 2021 Published: 6 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Rather, we allow for an unbounded number of samples in testing (thus avoiding the question of generalization error) and an unbounded time to train and compute predictions, and merely ask: what channel capacity do we need to faithfully communicate the predictions?

Simultaneously optimizing the objectives—high predictive accuracy and low code rate—leads to *predictive rate-distortion* [5–7]. The predictive rate-distortion curve separates combinations of achievable rates and distortions from unachievable rates and distortions. The closer a lossy predictive compressor is to the curve, the better. This diagnosis has been used, for example, to suggest that salamander retinal ganglion cells are near-optimal lossy predictors of visual input [8].

Surprisingly, we do not yet know how well recurrent neural networks perform relative to the predictive rate-distortion curve, though rate-distortion curves have been used to explain and calibrate the performance of artificial feedforward neural networks [9,10]. Note that recurrent neural networks allow us to store information, in principle, about semi-infinite pasts, while feedforward neural networks only allow for storage of finite pasts. The following calibrates the performance of various predictors (generalized linear models, reservoir computers, and recurrent neural networks) using the predictive ratedistortion curve. We stimulate predictors with output of probabilistic deterministic finite automata (PDFA), also called unifilar hidden Markov models in information theory [11]. The PDFAs used in the following are simple, in that their statistical complexity [12] and excess entropy [13,14] are finite and relatively small. The following explores PDFAs since optimal predictors of the time series they generate are easily computed [12], and the tradeoffs between code rate and predictive accuracy (encapsulated by the predictive ratedistortion function) are easily computed as well [7].

This work builds on seminal results establishing that both reservoir computers (RCs) [15,16] and recurrent neural networks (RNNs) [17] can reproduce any dynamical system, when given a sufficient number of nodes. Further work gave example RNNs that faithfully reproduce finite state automata, to the point that RNN nodes mimicked the automata states [18], and established bounds on the required RNN complexity [19]. One would conjecture, then, that Long Short-Term Memory (LSTM) architectures—an easily-trainable RNN variety [20,21]—should easily learn to predict the outputs of PDFAs. The further question we ask is: do these models not only predict, but predict *efficiently*?

We use predictive rate-distortion curves to calibrate the performance of three time series predictors: generalized linear models (GLMs) [22], RCs [15,16], and LSTMs [20]. Unsurprisingly, LSTMs are generally more efficient than reservoirs, which are generally more efficient than GLMs. Perhaps unsurprisingly, LSTMs are less accurate than both methods, seemingly due to overfitting. Surprisingly, despite the simplicity of the generated stochastic time series, we find that all tested prediction methods can fail to attain maximal predictive accuracy (measured by the probability of being correct) by as much as 50% and often need higher rates than necessary to attain maximal predictive performance. However, existing methods for inferring PDFAs [23] can correctly infer the PDFA and generate the optimal predictor with orders-of-magnitude less data. This leads us to conclude that prediction algorithms that first infer *causal states* [6,23–25] can surpass trained RNNs if the time series in question has (approximately) finite causal states, sometimes also called *predictive state representations* [26].

In Section 2, we describe how rate-distortion functions can provide a benchmark for prediction algorithms. In Section 3, we describe PDFAs, GLMs, RCs, and LSTMs. In Section 4, we describe our results. Section 5 summarizes our conclusions.

#### **2. Rate-Distortion Benchmarks for Prediction Algorithms**

Typically, when one talks about recurrent neural networks, one considers a setup as in Figure 1 (top). Input is sent to the network, which updates its state based on both the input and its previous state. The network's state is then used to make a prediction. The only metric that characterizes the final performance of the network, post-training, is the prediction accuracy—how well it predicts future symbols given past symbols.

**Figure 1.** At (**top**), a typical setup for a recurrent neural network (or any other predictor): input is sent to the recurrent neural network, which makes a prediction about future inputs. At (**bottom**), our setup for a recurrent neural network in which predictions must be made and the prediction must be communicated losslessly through the channel.

We now augment that setup slightly. Consider a channel over which the prediction must be communicated, as in Figure 1 (bottom). Now there are two metrics that characterize the network's performance, post-training: the predictive accuracy and the required channel capacity. In the particular setup of Figure 1 (bottom), the required channel capacity must be at least the entropy of the predictions [4]. If one is allowed longer blocklengths, meaning that one can communicate several predictions at once using the channel, the required channel capacity somewhat diminishes.

One can now trace out a plane of the two metrics, prediction accuracy and channel capacity, and ask which combinations of the two are achievable. The curve that separates the achievable combinations from the unachievable combinations is called the predictive rate-accuracy curve, very closely related to the predictive rate-distortion curve. See Figure 2.

Let *R* be the random variable representing our representation of the past that we use to predict the future, and *r* be its realization. When the accuracy is the conditional mutual information *I*[ −→*X* ; ←− *<sup>X</sup>* <sup>|</sup>*R*], the predictive rate-accuracy function is exactly the predictive information curve [5,6]. Finding representations that lie on the information curve motivates slow feature analysis [27], recovers canonical correlation analysis [28], and identifies the minimal sufficient statistics of prediction—the causal states [5]. Predictive information curves have even been used to evaluate the predictive efficiency of salamander retinal neural spiking patterns [8].

Here, however, we work only with binary processes, and we adopt the stance that predictive accuracy could be taken to be the probability that one's prediction is correct. Accordingly, we force our representation *r* ∈ {0, 1} to be a prediction, and calculate *accuracy* via:

$$a(r\_{t\prime}x\_{t+1}) = 1 - \delta\_{r\_{t\prime}x\_{t+1}\prime}$$

which implies:

$$E[a] = \sum\_{\xi\_{\underline{\boldsymbol{x}}}} p(\xleftarrow{\leftarrow}\_{\underline{t}}\_{t}) \sum\_{\boldsymbol{r}\_{\underline{t}} = \underline{\boldsymbol{x}}\_{t+1}} p(\boldsymbol{r}\_{t} | \xleftarrow{\leftarrow}\_{\underline{t}}) p(\underline{\boldsymbol{x}}\_{t+1} | \xleftarrow{\leftarrow}\_{\underline{t}}) \dots$$

The choice of distortion or accuracy measure is an important one, and determined by one's particular application.

**Figure 2.** A sample predictive rate-accuracy curve, which is dependent not on how we process the time series but only on intrinsic properties of the time series. It is quite possible, and typical, to have zero rate and a nonzero predictive accuracy, and so the meeting of the *x*-axis and *y*-axis is not at the origin. The rate can run between zero and one bit for the binary-valued time series we study here. The starred point, which encodes the rate and accuracy of a minimal optimal predictor, has a rate of the single-symbol Shannon entropy of the time series and a predictive accuracy that depends in a complicated way on the specific time series. (Note the slight difference between this communication setup and that of standard predictive rate-distortion.) It is possible to have rates larger than the rate of the starred point, up to and including one bit.

There is another way to understand predictive rate-accuracy curves. With an eye to making contact with nonpredictive rate-distortion theory, we summarize the setup of predictive rate-accuracy as follows. Semi-infinite pasts are drawn independently from the same process-dependent distribution and sent to an encoder, which then produces a prediction or a probability distribution over possible predictions. A predictive distortion measures how far the estimated predictions differ from correct predictions. Distortion is often taken, for example, to be the Kullback-Leibler divergence between the true distribution *p*( −→*x* | ←−*x* ) over futures −→*x* conditioned on the past ←−*x* and the distribution *p*( −→*<sup>x</sup>* <sup>|</sup>*r*) over futures conditioned on our *representation r* [29]. A predictive accuracy might then be some maximal achievable accuracy minus the predictive distortion. The predictive rate-accuracy curve R(*A*), the minimal necessary rate at a given expected accuracy, separates the plane of rates and predictive distortions into regions of achievable and unachievable combinations. A slight variant of the rate-distortion theorem gives:

$$\mathcal{R}(A) = \min\_{p(\vec{x}|r): E[a] \succeq A} I[\overleftarrow{X}; \mathcal{R}]\ ,\tag{1}$$

where *<sup>I</sup>*[·; ·] is the mutual information.

#### **3. Background**

In what follows, we review time-series generation and the widely-used prediction methods we compare. We first discuss PDFAs and then prediction methods.

#### *3.1. PDFAs and Predictive Rate-Distortion*

We focus on minimal PDFAs—for a given stochastic process, that with the smallest number of states. A PDFA consists of a set S of states *σ* ∈ S, a set A of emission symbols, and transition probabilities *<sup>p</sup>*(*σt*+1, *xt*|*σt*), where *<sup>σ</sup>t*, *<sup>σ</sup>t*+<sup>1</sup> ∈ S and *xt* ∈ A. The "deterministic" descriptor comes from the fact that *<sup>p</sup>*(*σt*+1|*xt*, *<sup>σ</sup>t*) has support on only one state. (This is "determinism" in the sense of formal language theory [30]—an automaton deterministically *recognizes* a string—not in the sense of nonstochastic. It was originally called *unifilarity* in the information theoretic analysis of hidden Markov chains [11]. Thus, PDFAs are also known as *unifilar hidden Markov models* [12].)

Here, we concern ourselves with minimal and binary-alphabet (<sup>A</sup> <sup>=</sup> {0, 1}) PDFAs. In dynamical systems theory minimal unifilar HMMs (minimal PDFAs) are called *-machines* and their states *σ causal states*. Due to the automaton's determinism, one can uniquely determine the state from the past symbols with probability 1. Each state is therefore a cluster of pasts that have the same conditional probability distribution over futures. As a result, all that one needs to know to optimally predict the future is given by the causal state [12].

For example, the simple two-state PDFA shown in Figure 3 generates the Even Process: only an even number of 1's are seen between two successive 0's. This leads to a simple prediction algorithm: find the parity of the number of 1's since the last 0; if even, we are in state *A*, so predict 0 and 1 with equal probability; if odd, we are in state *B*, so predict 1. There is only one past for which our prediction algorithm yields no fruit: given the past of all 1s a single state is never identified. One only knows that the machine is in either state *A* or *B* and the best prediction is a mixture of what the states indicate. Even though that past occurs with probability 0, it causes the Even Process to be an infinite-order Markov Process [31]. See Ref. [32] for a measure-theoretic treatment.

**Figure 3.** Minimal two-state PDFA that generates the Even Process, so-called since there are always an even number of 1s between 0's. Arrows indicate allowed transitions, while transition labels *p*|*s* indicate the transition (and so too emission) probabilities *<sup>p</sup>* <sup>∈</sup> [0, 1] for the symbol *<sup>s</sup>* ∈ A. Given a current state and next symbol, one knows the next state—the deterministic or unifilar property of this PDFA.

Causal states and -machines can be inferred from data in a variety of ways [6,23,25,33]. The causal states are uniquely useful to calculating predictive rate-distortion curves. Under weak assumptions, the predictive rate-accuracy function of Section 2 becomes:

$$\mathcal{R}(A) = \min\_{p(r|\sigma): E\left[d\right] \ge A} I\left[\mathcal{S}; \mathcal{R}\right],$$

with:

$$E[d] = \sum\_{\sigma\_t} p(\sigma\_t) \sum\_{\mathbf{x}\_{t+1} = r\_t} p(r\_t | \sigma\_t) p(\mathbf{x}\_{t+1} | \sigma\_t) \dots$$

See Ref. [7] for the proof. With this substitution—of a finite object (S) for an infinite one ( ←− *X* )—the Blahut-Arimoto algorithm can be used to accurately calculate the predictive rate-accuracy function, in that the algorithm provably converges to the optimal *<sup>p</sup>*(*r*|*σ*) [34]. The same cannot be said of the predictive information curve [7], which converges to a local optimum of the objective function, but may not converge to a global optimum.

In practice, we always augment the predictive rate-accuracy function with the rate and accuracy of the optimal predictor, which is (as described earlier) straightforwardly derived from the -machine. Simply put, we infer the causal state *σ<sup>t</sup>* from past data and predict the next symbol to be arg max*xt*+<sup>1</sup> *<sup>p</sup>*(*xt*+1|*σt*).

The following tests the various time series predictors on all of the (uniformly sampled) binary-alphabet -machine topologies [35] with randomly-chosen emission probabilities. Due to the super-exponential explosion of the set of topological -machines with number of states, we only look at binary-alphabet machines with four or fewer (causal) states. (There are 1338 unique topologies for four states, but over 10<sup>6</sup> for six states.) The analysis discards any -machine with zero-rate optimal predictor, which can arise depending on the emission probabilities.

#### *3.2. Time Series Methods*

We focus on three methods for time series prediction: generalized linear models (GLM), reservoir computers (RCs), and LSTMs.

The GLM we use predicts *xt* from a linear combination of the last *k* symbols *xt*−*k*, *xt*−*k*+1, ..., *xt*−1. More precisely, a GLM models the probability of *xt* being a 0 via:

$$p\_{GLM}(\mathbf{x}\_t = 0 | \mathbf{x}\_{t-k}, \dots, \mathbf{x}\_{t-1}) = \frac{e^{w\_k \mathbf{x}\_{t-k} + \dots + w\_1 \mathbf{x}\_{t-1} + w\_0}}{1 + e^{w\_k \mathbf{x}\_{t-k} + \dots + w\_1 \mathbf{x}\_{t-1} + w\_0}}.\tag{2}$$

The model's estimate of the probability of *xt* = 1 follows:

$$p\_{GLM}(\mathbf{x}\_t = 1 | \mathbf{x}\_{t-k}, \dots, \mathbf{x}\_{t-1}) = \frac{1}{1 + e^{w\_k x\_{t-k} + \dots + w\_1 x\_{t-1} + w\_0}} \,. \tag{3}$$

We use Scikit-learn logistic regression to find the best weights *w*0, *w*1, ... , *wk*. Predictions are then made via arg max*xt pGLM*(*xt*|*xt*−*k*,..., *xt*−1).

The RC is more powerful in that it uses logistic regression with features that contain information about symbols arbitrarily far into the past. We employ a *tanh* activation function, so that the reservoir's state advances via:

$$h\_{t+1} = \tanh(\mathcal{W}h\_t + \upsilon \mathbf{x}\_t + b) \tag{4}$$

and initialize *W*, *v*, *b* with i.i.d. normally distributed elements. The matrix *W* is then scaled so that it is near the "edge of chaos" [36–39], where RCs are conjectured to have maximal memory [40,41]. We then use logistic regression with *ht* as features to predict *xt*:

$$\begin{aligned} p\_{reservoir}(\mathbf{x}\_t = 0 | h\_t) &= \frac{e^{w^\top h\_t + w\_0}}{1 + e^{w^\top h\_t + w\_0}}, \\ p\_{reservoir}(\mathbf{x}\_t = 1 | h\_t) &= \frac{1}{1 + e^{w^\top h\_t + w\_0}}. \end{aligned}$$

It is straightforward to devise a weight matrix *<sup>W</sup>* and bias *<sup>b</sup>* so that *preservoir*(*xt*|*ht*) attains the restricted linear form of *pGLM* of Equations (2) and (3). That is, RCs are more powerful than GLMs, as they use nonlinear functions of semi-infinite pasts for their summary statistics. We use Scikit-learn logistic regression to find the best weights *w*<sup>0</sup> and *w*. Note that the weights *W*, *v*, and *b* are not learned, but held constant; we only train *w* and *w*0. Predictions are made via arg max*xt preservoir*(*xt*|*ht*).

Finally, we analyze the LSTM's predictive capabilities. LSTMs are no more powerful than vanilla RNNs; e.g., those as in Equation (4). However, they are far more trainable in that it is possible to achieve good results without extensive hyperparameter tuning [21]. An LSTM has several hidden states *ft*, *it*, *ot*, *ct*, and *ht* that update via the following:

$$\begin{aligned} f\_t &= \sigma\_\mathcal{S} \left( \mathcal{W}\_f \mathbf{x}\_t + \mathcal{U}\_f h\_{t-1} + b\_f \right) \\ i\_t &= \sigma\_\mathcal{S} \left( \mathcal{W}\_i \mathbf{x}\_t + \mathcal{U}\_i h\_{t-1} + b\_i \right) \\ o\_t &= \sigma\_\mathcal{S} \left( \mathcal{W}\_o \mathbf{x}\_t + \mathcal{U}\_o h\_{t-1} + b\_o \right) \\ c\_t &= f\_t \odot c\_{t-1} + i\_t \odot \sigma\_\mathcal{c} \left( \mathcal{W}\_c \mathbf{x}\_t + \mathcal{U}\_c h\_{t-1} + b\_c \right) \\ h\_t &= o\_t \odot c\_{t-1} \end{aligned}$$

where *σ<sup>g</sup>* is the sigmoid function and *σ<sup>c</sup>* is the hyperbolic tangent. The variable *ct* is updated linearly, therefore avoiding issues with vanishing gradients [42]. Meanwhile, the gating function *ft* allows us to forget the past selectively. We then predict the probability of *xt* given the past using:

$$p\_{LSTM}(\mathbf{x}\_{\mathbf{f}}=0|h\_{\mathbf{f}}) = \frac{e^{\mathbf{w}^{\top}h\_{\mathbf{f}} + \mathbf{w}\_{0}}}{1 + e^{\mathbf{w}^{\top}h\_{\mathbf{f}} + \mathbf{w}\_{0}}},$$

$$p\_{LSTM}(\mathbf{x}\_{\mathbf{f}}=1|h\_{\mathbf{f}}) = \frac{1}{1 + e^{\mathbf{w}^{\top}h\_{\mathbf{f}} + \mathbf{w}\_{0}}}.\tag{5}$$

Weights *w* and *w*<sup>0</sup> are learned while we estimate parameters *Wf* , *Uf* , *bf* , *Wi*, *Ui*, *Wo*, *Uo*, *bo*, *Wc*, *Uc*, and *bc* to maximize the log-likelihood. Predictions are made via arg max*xt pLSTM*(*xt*|*ht*).

Predictive accuracy is calculated by comparing the predictions to the actual values of the next symbol and counting the frequency of correct predictions. The code rate is calculated via the prediction entropy [4].

#### **4. Results**

An aim here is to thoroughly and systematically analyze the predictive accuracy as measured by the probability of correctly guessing the next symbol and code rate of our three time series predictors of a large swath of PDFAs in the small-data limit, in which only 5000 samples are shown to the RNN. To implement this, we ran through Ref. [35]'s topological -machine library—binary-alphabet PDFAs with four states or less and randomly chosen emission probabilities, in which transition probabilities were drawn from a uniform distribution. For each PDFA, we generated a length-5000 time series. The first half was presented to a predictor and used to train its weights. We then evaluated each time series predictor based on its predictions for the second half of the time series. Predictive accuracy and code rate were calculated and compared to the predictive rate-distortion function. Predictive accuracy was calculated as the probability of having a correct prediction; code rate was calculated empirically as the single-symbol entropy of the predictions [14].

Note that Bayesian structural inference (BSI) provides a useful comparison [23]. In BSI, we compute the maximum a posteriori (MAP) estimate of the PDFA generating an observed time series, and use this MAP estimate to build an optimal predictor of the process. BSI can correctly infer the PDFA essentially 100% of the time with orders-of-magnitude less data than used to monitor the three prediction methods tested here. Hence, it achieves optimal predictive accuracy with minimal rate. Our aim is to test the ability of GLMs, RCs, and RNNs to equal BSI's previously-published performance.

The time series predictors used have hyperparameters. A variety of orders (*k*'s) were used for the GLMs and reservoirs and LSTMs of different sizes (number of nodes) were tested. Learning rate and optimizer type, including gradient descent and Adam [43], were also varied for the LSTM, with little effect on results. Regularization was necessary and utilized in both *L*<sup>1</sup> and *L*<sup>2</sup> forms on all three predictors. As is typical, a validation set was used to select the strength and type of regularization, and results were reported on a separate test set. In total, 5000 steps of the time series were simulated, which was small enough to test how these machine learning methods responded to too little data, but enough data that the machine learning methods could have picked up on patterns.

#### *4.1. The Difference between Theory and Practice: The Even and Neven Process*

We first analyze two easily-described PDFAs, deriving RNNs that correctly infer causal states and, therefore, that match the optimal predictor—the -machine. We then compare the trained GLMs, RCs, and LSTMs to the easily-inferred optimal predictors. In theory, RCs and LSTMs should be able to mimic the derived RNNs, in that it is possible to find weights of an RC and LSTM that yield nodes that mimic the causal states of the PDFA. In practice, surprisingly, RCs and LSTMs have some difficulty.

First, we analyze the Even Process shown in Figure 3. The optimal prediction algorithm is easily seen by inspection of Figure 3. When we determine the machine is in state *A*, we predict a 0 or a 1 with equal probability; if it is in state *B*, we predict a 1. We determine whether or not it is in state *A* or *B* by the parity of the number of 1s since the last 0. If odd, it is in state *B*; if even, it is in state *A*. The inferred state is easily encoded by the following RNN:

$$h\_{t+1} = \mathbf{x}\_t (1 - h\_t) \,. \tag{6}$$

If *xt* is 0, the hidden state of the RNN "resets" to 0; e.g., state *A*. If *xt* = 1, then the hidden state updates by flipping from 0 to 1 or vice versa, mimicking the transitions from *A* to *B* and back. One can show that a one-node LSTM hidden state *ht* can, with proper weight choices, mimic the hidden state of Equation (6). With the correct hidden state inferred, it is straightforward to find *w* and *w*<sup>0</sup> such that Equation (5) yields optimal (and correct) predictions.

As one might then expect, and as Figure 4 confirms, LSTMs tend to have rates that are close to the optimal (maximal) rate and predictive accuracies that are only slightly below the optimal predictive accuracy. RCs and GLMs tend to have higher rates and lower predictive accuracies, but they are still within ∼13% of optimal. We can see this qualitatively just by examining the predictive rate-accuracy curve in Figure 4: the closer that a point is to the curve, the more efficiently that predictor predicts. Among the points on the curve, potentially the most desirable point is the one at the highest achievable accuracy, at the top right. The points from the LSTMs tend to be closer to the curve and closer to the point at the top right, followed by RCs, and followed by GLMs. Interestingly, the points from all processes lie on a one-dimensional curve, speaking to some hidden simplicity in the relationship between rate and accuracy that likely holds only for binary-valued processes.

**Figure 4.** Predictive rate–accuracy curve for the Even Process in Figure 3, along with empirical predictive accuracies and rates of GLMs, RCs, and LSTMs of various sizes: orders range from 1–10 for GLMs, number of nodes range from 1–61 for RCs, and number of nodes range from 1–121 for LSTMs. Despite the Even Process' simplicity, there is a noticeable difference between the predictors' performances and between their performances and the optimal achievable performance.

As one might also expect, LSTMs and RCs with additional nodes and GLMs with higher orders (higher *k*) have higher predictive accuracies than LSTMs and RCs with fewer nodes and GLMs with lower orders. However, viewed another way, given the simplicity of the stimulus—indeed, given that a one-node LSTM can, in theory, learn the Even Process the gap from the predictors' rates and accuracies to the optimal combinations of rate and accuracy is surprising. It is also surprising that none of the three predictors' rates fall below the maximal optimal rate.

Figure 5 introduces a similarly-simple three-state PDFA. If a 1 is observed after a 0, we are certain the machine is in state *B*; after state *B*, we know it will transition to state *A*; and then the parity of 0s following transition to state *A* tells us if it is in state *A* (even) or state *B* (odd). This PDFA is a combination of a Noisy Period-2 Process (between states *A* and *B*) and an Even Process (between states *A* and *C*).

**Figure 5.** Predictive rate-accuracy curve for the Neven Process (PDFA shown at left), along with empirical predictive accuracies and rates of GLMs, RCs, and LSTMs of various sizes: orders range from 1–10 for GLMs, number of nodes range from 1–61 for RCs, and number of nodes range from 1–121 for LSTMs. Despite Neven Process' simplicity, there is a noticeable gap between the predictor's performance and the optimal performance achievable.

Given the Neven Process's simplicity, it is unsurprising that we can concoct an RNN that can infer the internal state. Let *ht* = (*ht*,*A*, *ht*,*B*, *ht*,*C*) be the hidden state that is (1, 0, 0) if the internal state is *A*, (0, 1, 0) if the internal state is *B*, and (0, 0, 1) if the internal state is *C*. By inspection, we have:

$$\begin{aligned} h\_{\mathfrak{t}+1,\mathcal{A}} &= 1 - h\_{\mathfrak{t},\mathcal{A}} \\ h\_{\mathfrak{t}+1,\mathcal{B}} &= \mathfrak{x}\_{\mathfrak{t}} h\_{\mathfrak{t},\mathcal{A}} \\ h\_{\mathfrak{t}+1,\mathcal{C}} &= (1 - \mathfrak{x}\_{\mathfrak{t}}) h\_{\mathfrak{t},\mathcal{A}} \end{aligned}$$

One can straightforwardly find weights that lead to *pLSTM*(*xt*+1|*ht*) accurately reflecting the transmission (emission) probabilities. In other words, in theory a three-node RNN (and an equivalent three-node LSTM) can learn to predict the Neven process optimally.

However, the Neven Process' simplicity is belied by the gap between the predictors' accuracy and rate and the predictive rate-accuracy curve. In Figure 5, the point at zero rate implies that the predictor is spitting out the same symbol, regardless of input. The worst predictive accuracy falls short of the optimal by ∼15%, and none of the GLMs, RCs, or LSTMs get closer than ∼97% to optimal. Furthermore, almost all the rates surpass the maximal optimal predictor rate.

#### *4.2. Comparing GLMs, RCs, and LSTMs*

We now analyze the combined results obtained over all minimal PDFAs up to four states using two metrics. (Again, recall that they are 1338 unique machine topologies.) To compare across PDFAs, we first normalize the rate and accuracy by the rate and accuracy of the optimal predictor. Then, we find the distance from the predictor's rate and accuracy to the predictive rate-accuracy curve, which is similar in spirit to the metric of Ref. [44] and to the spirit of Ref. [8]. Note that this metric would have been markedly harder to estimate had we used nondeterministic probabilistic finite automata; that is, those without determinism (unifiliarity) in their transition structure [7].

Figure 6 showcases a histogram of the normalized distance to the predictive rateaccuracy curve, ignoring PDFAs for which the maximal optimal rate is 0 nats. The normalized distance for all three predictor types tends to be quite small, but even so, we can see differences in the three predictor types. LSTMs tend to have smaller normalized distances than RCs, and RCs tend to have smaller normalized distances to the predictive rate-accuracy curve than GLMs. In fact, LSTMs seem to be uniformly better lossy predictive feature extractors. Trained LSTMs on average have 0.8% normalized distance; RCs on average have 2.0% normalized distance; and GLMs on average have 4.5% normalized distance. When looking only at optimized LSTMs, RCs, and GLMs—meaning that the number of nodes or the order is chosen to minimize normalized predictive distortion—a few PDFAs still have high normalized predictive distortions of 4.6% for LSTMs, 9.7% for RCs, and 27.3% for GLMs.

**Figure 6.** (**Left**) Histogram of normalized predictive distortions for LSTMs (blue), RCs (orange), and GLMs (green) using 798 distinct PDFAs. While LSTMs tend to have far higher predictive accuracies, they also have a much larger probability than reservoirs or GLMs do of having noticeable inaccuracies. Some recorded normalized predictive distortions were negative, indicating the effects of finite sample size. (**Right**) Histogram of normalized distances to the predictive rate-accuracy curve for LSTMs (blue), RCs (orange), and GLMs (green) using 798 distinct PDFAs. It is apparent that LSTMs are closer to the predictive rate-accuracy curves than reservoirs and GLMs.

The same trend holds for the percentage difference between the predictive accuracy and the maximal predictive accuracy, which we call the *normalized predictive distortion*, with a crucial modification. Trained LSTMs on average have 21.5% normalized predictive distortion; RCs on average have 1.8% normalized predictive distortion; and GLMs on average have 4.2% normalized predictive distortion. When looking only at optimized LSTMs, RCs, and GLMs—meaning that the number of nodes or the order is chosen to minimize normalized predictive distortion—a few PDFAs still have high normalized predictive distortions of 50% for LSTMs, 13.5% for RCs, and 25.5% for GLMs. However, perhaps the most interesting aspect of the Figure 6 is that LSTMs are far more likely than reservoirs or GLMs to have large normalized predictive distortions, surprisingly.

Unsurprisingly, increasing the GLM order and the number of nodes of the RCs and LSTMs tends to increase predictive accuracy and decrease the normalized distance.

Our final aim is to understand the PDFA characteristics that cause them to be harder to predict accurately and/or efficiently. We have two suspects, which are the most natural measures of process "complexity". This first is the generated process' entropy rate *hμ*, the entropy of the next symbol conditioned on all previous symbols, which quantifies the intrinsic randomness of the stimulus. The second is the generated process' statistical complexity *Cμ*, the entropy of the causal states, which quantifies the intrinsic memory in the stimulus. The more random a stimulus, the harder it would be to predict; imagine

having to find the optimal predictor for a biased coin whose bias is quite close to 1/2. The more memory in a stimulus, the more nodes in a network or the higher the order of the GLM required, it would seem. We performed a multivariate linear regression, trying to use *hμ* and *Cμ* to predict the minimal normalized predictive distortion and minimal normalized distance. We find a small and positive correlation for LSTMs, reservoirs, and GLMs for predicting minimal deviations in accuracy from perfection, with an *R*<sup>2</sup> of 0.189, 0.134, and 0.132, respectively. For all three types of prediction algorithms, statistical complexity *Cμ* is positively correlated with deviations in accuracy. Entropy rate is positively correlated with deviations in accuracy for GLMs and reservoirs but, surprisingly, not LSTMs. Interestingly, the performance GLMs and RCs is impacted by increased randomness and increased memory in the stimulus, while the LSTMs' accuracy has little correlation with entropy rate and statistical complexity.

For the most part, we find that all three prediction methods–GLMs, RCs, and LSTMs tend to learn to predict the PDFA outputs near-optimally, in that prediction accuracies differ from the optimal prediction accuracy by an average of roughly 5%. LSTMs outperform RCs, which outperform GLMs. However, we discovered simple PDFAs that cause the best LSTM to fail by as much as 5%, the best RC to fail by as much as 10%, and the best GLM to fail by as much as 27%.

Since none of the RNNs achieved perfect prediction accuracy, but the BSI method did [23], we conclude that existing methods for inferring causal states [6,23,25,33] are useful, despite the historically dominant reliance on RNNs. For example, as previously mentioned, Bayesian structural inference correctly infers the correct PDFAs almost 100% of the time, leading to essentially zero prediction error, on training sets that are orders of magnitude smaller than those used here [23].

#### **5. Conclusions**

We have known for a long time that reservoirs and RNNs can reproduce any dynamical system [15–17], and we have explicit examples of RNNs learning to infer the hidden states of a PDFA when shown the PDFA's output [18]. We revisited these examples to better understand if the finding of Ref. [18] is typical. How often do RNNs and RCs learn efficient and accurate predictors of PDFAs, especially given that BSI can yield an optimal predictor with orders-of-magnitude less training data?

We conducted a rather comprehensive search, analyzing 798 randomly-generated PDFAs with four states or less. For each PDFA, we trained GLMs, RCs, and RNNs of varying orders or varying numbers of nodes. Larger orders and larger numbers of nodes led to more accurate and more efficient predictors. On average, the various time series predictors have ∼5% predictive distortion. In other words, we are apparently better at classifying MNIST digits than sometimes predicting the output of a simple PDFA. Again, existing algorithms [23] can optimally predict the output of the PDFAs considered here with orders-of-magnitude less training data. (MNIST is a database of handwritten digits.) These findings lead us to conclude that algorithms that explicitly focus on inference of causal states [6,23–25] have a place in the currently RNN-dominated field of time series prediction.

More importantly, in this small data limit, overfitting is an issue for LSTMs but not RCs or GLMs. However, LSTMs are somehow excellent lossy predictive feature extractors nonetheless. The mechanism behind this is a subject for future research.

Perhaps most importantly, the predictive rate-accuracy framework that we introduce here or similar such frameworks could be useful for calibrating the performance of time series predictors. We have added a cost that comparatively little research has focused on: that of communicating the prediction. Implicitly, we are arguing that predictors which do not have maximal predictive accuracy but do have small communication costs might be useful nonetheless.

**Author Contributions:** S.E.M. and J.P.C. conceptualized the article and wrote the article, and S.E.M. performed the experiments. All authors have read and agreed to the published version of the manuscript.

**Funding:** This material is based upon work supported by, or in part by, the Air Force Office of Scientific Research under award number FA9550-19-1-0411 and the U. S. Army Research Laboratory and the U. S. Army Research Office under grants W911NF-18-1-0028 and W911NF-21-1-0048.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Available upon reasonable request from the authors.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Summarizing Finite Mixture Model with Overlapping Quantification**

**Shunki Kyoya** *∗* **and Kenji Yamanishi**

Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan; yamanishi@mist.i.u-tokyo.ac.jp

**\*** Correspondence: kyoya.shunki@plus-zero.co.jp

**Abstract:** Finite mixture models are widely used for modeling and clustering data. When they are used for clustering, they are often interpreted by regarding each component as one cluster. However, this assumption may be invalid when the components overlap. It leads to the issue of analyzing such overlaps to correctly understand the models. The primary purpose of this paper is to establish a theoretical framework for interpreting the overlapping mixture models by estimating how they overlap, using measures of information such as entropy and mutual information. This is achieved by merging components to regard multiple components as one cluster and summarizing the merging results. First, we propose three conditions that any merging criterion should satisfy. Then, we investigate whether several existing merging criteria satisfy the conditions and modify them to fulfill more conditions. Second, we propose a novel concept named clustering summarization to evaluate the merging results. In it, we can quantify how overlapped and biased the clusters are, using mutual information-based criteria. Using artificial and real datasets, we empirically demonstrate that our methods of modifying criteria and summarizing results are effective for understanding the cluster structures. We therefore give a new view of interpretability/explainability for model-based clustering.

**Citation:** Kyoya, S.; Kenji, Y. Summarizing Finite Mixture Model with Overlapping Quantification. *Entropy* **2021**, *23*, 1503. https://

doi.org/10.3390/e23111503 Academic Editor: Pasi Fränti

Received: 28 September 2021 Accepted: 8 November 2021 Published: 13 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Keywords:** model-based clustering; merging mixture components; component overlap; interpretability

#### **1. Introduction**

#### *1.1. Motivation*

Finite mixture models are widely used for modeling data and finding latent clusters (see McLachlan and Peel [1] and Fraley and Raftery [2] for overviews and references). When they are used for clustering, they are typically interpreted by regarding each component as a single cluster. However, the one-to-one correspondence between the clusters and mixture components does not hold when the components overlap. This is because the clustering structure then becomes more ambiguous and complex. Let us illustrate this using a Gaussian mixture model estimated for the Wisconsin breast cancer dataset in Figure 1 (details of the dataset and estimation are discussed in Section 8.2). A number of the components overlap with one another, which makes it difficult to estimate the shape of distribution or number of clusters. Therefore, we need an analysis of the overlaps to correctly interpret the models.

**Figure 1.** Estimated Gaussian components for the Wisconsin breast cancer dataset [3].

We address this issue from two aspects. In the first aspect, we consider merging mixture components to regard several components as one cluster. We repeatedly select the most overlapping pairs of components to merge them. In this procedure, it is important how the degree of overlap is measured. A number of criteria for measuring cluster overlaps have been proposed [4–6], but they have not yet been compared theoretically. We give a theoretical framework for comparing merging criteria by defining three essential conditions that any method for merging clusters should satisfy. The more conditions any method satisfies, the better it is. From this viewpoint, we evaluate the existing criteria (entropy (Ent) [4], directly estimated misclassification (DEMP) [5] probability, mixture complexity (MC) [7]). We also modify these existing criteria so that they can satisfy more essential conditions.

In the second aspect, we consider how to summarize the merging results quantitatively. After merging mixture components, we obtain two types of clustering structures; those among the upper-components and those among sub-components within each uppercomponent, as illustrated in Figure 2. These structures might be still ambiguous because the upper-components are determined to be the different clusters, but they may overlap; the sub-components are determined to belong to the same cluster, but they may be scattered in the cluster. Therefore, we need to evaluate the degree to which the upper- and subcomponents are discriminated as different clusters. We realize this using the notions of *mixture complexity* (MC) [7] and *normalized mixture complexity* (NMC). They give realvalued quantification of the number of effective clusters and the degree of their separation, respectively. We therefore develop a novel method for cluster summarization.

Our hypotheses in this paper are summarized as follows:


We empirically verify them by experiments, using artificial and real datasets.

**Figure 2.** Upper-components and sub-components.

#### *1.2. Significance and Novelty of This Paper*

The significance and novelty of this paper is summarized below.

#### 1.2.1. Proposal of Theoretical Framework for Evaluating Merging Criteria

We give a theoretical framework for evaluating merging methods by defining the *essential conditions*. They are necessary conditions that any merging criterion should satisfy: (1) the criterion should take the best value when the components are entirely overlapped, (2) it should take the worst value when the components are entirely separated, and (3) it should be invariant with respect to the scale of the weights. We empirically confirm that the more essential conditions any merging method satisfies, the better the clustering structure obtained in terms of larger interdistances and smaller intradistances.

#### 1.2.2. Proposal of Quantitative Clustering Summarization

We propose a method for quantitatively summarizing clustering results based on MC and NMC. MC is an extended concept of the number of clusters into a real number from the viewpoint of information theory [7]. It quantifies the diversity among the components, considering their overlap and weight bias. NMC is defined by normalizing MC to remove the effects of weight bias. It quantifies the degree of the scatter of the components based only on their overlap. Furthermore, MC and NMC have desirable properties for clustering summarization: they are scale invariant and can quantify overlaps among more than two components. We empirically demonstrate that our MC-based method effectively summarizes the clustering structures. We therefore give a novel quantification of clustering structures.

#### **2. Related Work on Finite Mixture Models and Model-Based Clustering**

In this section, we present related work on finite mixture models and model-based clustering in four parts: roles of overlap, model, optimization, and visualization. The overlap has a particular impact on the construction of models.

#### *2.1. Roles of Overlap*

There has been widespread discussion about the roles of overlap in finite mixture models. One argues that the overlap is emerged to represent various distributions. While this flexibility is beneficial for modeling the data, various issues arise in applying them to clustering. For example, McLachlan and Peel [1] pointed out that some skew clusters required more than one Gaussian component to be represented. Moreover, Biernacki et al. [8] pointed out that the number of mixture components selected for estimating densities was typically more than that of clusters because of overlapping. Model selection methods based on clustering (complete) likelihood, such as the integrated complete likelihood (ICL) [8], the normalized maximum likelihood (NML) [9,10], and the decomposed normalized maximum likelihood (DNML) [11,12], have been proposed to obtain less-overlapping mixtures so that one component corresponds to one cluster. However, they have problems in that they need to define the shape of the clusters in advance. This leads to a trade-off between shape flexibility and component overlap in model-based clustering.

Others argue that the overlap represents that the data belong to more than one cluster. For example, in clustering documents by their topics, the data may have several topics. Such issues have been widely discussed in the field of overlapping clustering. For example, Banerjee et al. [13] extended the mixture model to allow the data to belong to multiple clusters based on membership matrices. Fu and Banerjee [14] considered the product of cluster distributions to represent multiple memberships of the data. Xu et al. [15] proposed methods for describing more complex memberships by calculating correlation weights between the data and the cluster. While these methods allow complex relationships between the data and the clusters, cluster shapes become simple.

The overlap is also used for measuring the complexity of clustering structures in the concept of MC [7]. It is a non-integer valued quantity, which implies the uncertainty of determining the number of clusters. MC was introduced in the scenario of change detection in [7]. This paper gives a new application scenario of MC in the context of quantifying clustering structures. Moreover, this paper also newly introduces NMC as a variant of MC, which turns out to be most effective in this context.

#### *2.2. Model*

We discuss the issue of constructing models achieving both flexible cluster shapes and interpretability. Allowing each cluster to have complex shapes is a solution to tackle this. For example, mixtures of non-normal distributions have been proposed for this purpose, as reviewed by Lee and McLachlan [16]. Modeling each cluster as a finite mixture model, called the mixture of mixture model or multi-layer mixture model, has been considered in this regard. Various methods have been proposed to estimate such mixture models based on maximum likelihood estimation [17,18] and Bayesian estimation [19,20]. However, additional parameters are required for assigning sub-components to upperclusters in many cases because changes of assignment do not change the overall distribution. Merging mixture components [4–6] is an alternative way of the composition of mixture models using single-layer estimations. In this approach, the criteria to measure the degree of component overlap have to be identified. Although various concepts have been developed to measure the degree of overlap, such as entropy [5], misclassification rate [4,6], and unimodality [4], they have not been satisfactorily compared yet.

#### *2.3. Optimization*

Merging components has also been discussed in the scenario of optimizing parameters in the mixture models. Ueda et al. [21] proposed splitting and merging mixture components to obtain better estimations, and Minagawa et al. [22] revised their methods to search the models with higher likelihoods. Zhao et al. [23] considered randomly swapping the mixture components during optimization, which allows a more flexible search than splitting and merging components. Because these methods aim only to optimize the models, there remains the problem of interpreting them.

We also refer to the agglomerative hierarchical clustering as a similar approach to merging components. Our methods are similar to the Bayesian hierarchical clustering methods [24,25] in that the number of merging is automatically decided. However, our approaches can not only create clusters, but also evaluate their shape and closeness under the assumption that the mixture models are given.

#### *2.4. Visualization*

Methods of interpreting clustering structures have been studied along with visualization methods. Visualizing the values of criteria with a dendrogram is useful for understanding cluster structures among sub-components [6]. Class-preserved projections [26] and parametric embedding [27] were proposed for visualizing structures among upperclusters by reducing data dimension. We present a method to interpret both structures uniformly based on the MC and NMC.

#### **3. Merging Mixture Components**

We assume that data *<sup>x</sup><sup>N</sup>* = *<sup>x</sup>*1, ... , *xN* and a finite mixture model are given. The probability distribution of the model *f* is written as follows:

$$f(\mathbf{x}) = \sum\_{k=1}^{K} \rho\_k \mathbf{g}\_k(\mathbf{x}),$$

where *K* denotes the number of components, *ρ*1, ... , *ρ<sup>K</sup>* denote the mixture proportions of each component summing up to one, and *<sup>g</sup>*(*x*|*θ*1), ... , *<sup>g</sup>*(*x*|*θK*) denote the probability distributions. We assume that the data *x<sup>N</sup>* are independently sampled from *f* . The random variable *X* following *f* is called an *observed variable*, because it can be observed as a data point. We also define the *latent variable <sup>Z</sup>* ∈ Z :<sup>=</sup> {1, ... , *<sup>K</sup>*} as the index of the component from which *X* originated. The pair (*X*, *Z*) is called a *complete variable*. The distribution of the latent variable *<sup>P</sup>*(*Z*) and the conditional distribution of the observed variable *<sup>P</sup>*(*X*|*Z*) can be given by the following:

$$P(Z=k) = \rho\_{k\prime} \quad P(X|Z=k) = \mathcal{g}\_k(X).$$

In the case that *f* is not known, we will replace *f* by its estimation ˆ *f* under the assumption that ˆ *f* is so close to *f* that *x<sup>N</sup>* can be approximately regarded as samples from ˆ *f* .

We discuss identifying cluster structures in *x<sup>N</sup>* and *f* by merging mixture components as described below. First, we define a criterion function denoted as Crit : Z×Z → <sup>R</sup>, which measures the degree of overlap or closeness between two components. For simplicity, we change the sign of the original definitions as needed so that Crit takes smaller values as the components are closer. Then, we choose the closest two components that minimize the criterion and merge them. By repeating the merging process several times, we finally obtain clusters. We show the pseudo-code and computational complexity of this procedure in Appendix A.

#### **4. Essential Conditions**

In this section, we propose three *essential conditions* that the criteria should satisfy, so that the criteria can be compared in terms of the conditions. To establish the conditions, we restrict the criteria to those that can be calculated from the posterior probability of the latent variables {*γk*(*xn*)}*k*,*<sup>n</sup>* defined as follows:

$$\gamma\_k(\mathbf{x}\_n) := P(Z = k | X = \mathbf{x}\_n) = \frac{\rho\_k \mathbf{g}\_k(\mathbf{x}\_n)}{f(\mathbf{x}\_n)},$$

where *k* is the index of the component. After merging the components *i* and *j*, the posterior probability can be easily updated as follows:

$$\gamma\_{i\cup j}(\mathbf{x}\_n) := P(Z \in \{i, j\} | X = \mathbf{x}\_n) = \gamma\_i(\mathbf{x}\_n) + \gamma\_j(\mathbf{x}\_n).$$

Note that some other merging methods reestimate the distribution of the merged components as a single component [4]. We do not consider these in this study because they lack the benefit that the merged components can have complex shapes.

For later use, we define Best(Crit) and Worst(Crit) as the best and worst values that the criteria can take:

$$\begin{aligned} \text{Best}(\text{Crit}) &:= \min \text{Crit}(i, j) \quad \text{w.r.t.} \left\{ \gamma\_{k, n} \right\}\_{k, n'} \\ \text{Worst}(\text{Crit}) &:= \max \text{Crit}(i, j) \quad \text{w.r.t.} \left\{ \gamma\_{k, n} \right\}\_{k, n'} \end{aligned}$$

where {*γk*,*n*}*k*,*<sup>n</sup>* is a set of *<sup>K</sup>* <sup>×</sup> *<sup>N</sup>* real values in [0, 1] that satisfies <sup>∑</sup>*<sup>k</sup> <sup>γ</sup>k*,*<sup>n</sup>* <sup>=</sup> 1 for all *<sup>n</sup>*.

We formulate the three conditions. They provide natural and minimum conditions on the behaviors in the extreme cases that the components are entirely overlapped or separated and on the scale invariance of the criteria. The conditions for the moderate cases that the components partially overlap should be investigated in further studies.

First, we define the condition that a criterion should take the best value when the two components entirely overlap. It is formally defined as follows.

**Definition 1.** *If a criterion satisfies that*

$$(\forall n, \ g\_i(\mathfrak{x}\_n) = \mathfrak{g}\_j(\mathfrak{x}\_n)) \Rightarrow \mathbf{Crit}(i, j) = \mathbf{Best}(\mathbf{Crit})\_{\prime i}$$

*then, we say that it satisfies the condition BO (*best in entirely overlap*).*

Next, we define the condition that the criterion should take the worst value when the two components are entirely separated.

**Definition 2.** *We consider that the sequence of the models* { *ft* <sup>=</sup> <sup>∑</sup>*<sup>k</sup> <sup>ρ</sup>k*,*tgk*,*t*}<sup>∞</sup> *<sup>t</sup>*=<sup>1</sup> *satisfies the following:*

$$\forall n\_{\prime} \; \mathcal{G}\_{i,t}(\mathbf{x}\_n) \mathcal{G}\_{j,t}(\mathbf{x}\_n) \to 0 \tag{1}$$

*as t* <sup>→</sup> <sup>∞</sup>*. We define* Crit*t*(*i*, *<sup>j</sup>*) *as the criterion value based on ft. Then, if* (1) *implies that*

$$\lim\_{t \to \infty} \text{Crit}\_t(i, j) \to \text{Wort}(\text{Crit})\_{\text{'} }$$

*we say that it satisfies the condition WS (*worst in entirely separate*).*

Note that this definition is written using limits in case that the distribution of the components has support in the entire space, such as the Gaussian distributions.

Finally, we define the condition that the value of the criterion should be invariant with the scale of mixture proportions.

**Definition 3.** *We consider that the components i and j are isolated from the other components, i.e., the sequence of the models* { *ft* <sup>=</sup> <sup>∑</sup>*<sup>k</sup> <sup>ρ</sup>k*,*tgk*,*t*}<sup>∞</sup> *<sup>t</sup>*=<sup>1</sup> *satisfies the following:*

$$(\mathcal{g}\_{i,t}(\mathfrak{x}\_n) + \mathcal{g}\_{j,t}(\mathfrak{x}\_n)) \mathcal{g}\_{k,t}(\mathfrak{x}\_n) \to 0$$

*for all <sup>k</sup>* <sup>=</sup> *<sup>i</sup>*, *<sup>j</sup> and <sup>n</sup> as <sup>t</sup>* <sup>→</sup> <sup>∞</sup>*. In addition, we consider another sequence of the mixture model* { ¯ *ft* <sup>=</sup> <sup>∑</sup>*<sup>k</sup> <sup>ρ</sup>*¯*k*,*tgk*,*t*}<sup>∞</sup> *<sup>t</sup>*=<sup>1</sup> *with different scales on the mixture proportions of the components i and j, i.e., <sup>ρ</sup>*¯*k*,*<sup>t</sup>* = *<sup>a</sup>ρk*,*<sup>t</sup>* (*<sup>k</sup>* = *<sup>i</sup>*, *<sup>j</sup>*) *holds for some <sup>a</sup>* > <sup>0</sup>*. We define* Crit*t*(*i*, *<sup>j</sup>*) *as the criterion value based on* ¯ *<sup>f</sup>* (*t*)*. Then, we say that the criterion satisfies the condition SI (*Scale invariance*) if for any a, the following holds:*

$$\lim\_{t \to \infty} \text{Crit}\_t(i, j) = \lim\_{t \to \infty} \overline{\text{Crit}}\_t(i, j) \dots$$

#### **5. Modifying Merging Methods**

In this section, we introduce the existing merging criteria and propose new criteria by modifying them so that they can satisfy more essential conditions.

#### *5.1. Entropy-Based Criterion*

First, we introduce the *entropy-based criterion* (Ent) proposed by Baudry et al. [5]. It selects the components that reduce the entropy of the latent variable the most. This criterion, denoted as CritEnt, is formulated as follows:

$$-\mathrm{Crit}\_{\mathrm{Ent}}(i,j) := \sum\_{n=1}^{N} \left( \Psi(\gamma\_i(\mathbf{x}\_n)) + \Psi\left(\gamma\_j(\mathbf{x}\_n)\right) - \Psi\left(\gamma\_{i \cup j}(\mathbf{x}\_n)\right) \right),$$

where <sup>Ψ</sup>(*x*) :<sup>=</sup> <sup>−</sup>*<sup>x</sup>* log *<sup>x</sup>*.

However, it violates the conditions BO and SI. Therefore, we propose to modify it in two regards. First, we correct the scale of the weights to make CritEnt satisfy SI. We propose a new criterion CritNEnt1 defined as follows:

$$-\text{Crit}\_{\text{NEnt1}}(i,j) := \frac{-\text{Crit}\_{\text{Ent}}(i,j)}{N(\check{\rho}\_i + \check{\rho}\_j)},$$

where *<sup>ρ</sup><sup>k</sup>* :<sup>=</sup> <sup>∑</sup>*<sup>n</sup> <sup>γ</sup>k*(*xn*)/*N*. This satisfies the condition SI.

Next, we propose removing the effects of the weight biases to make CritNEnt1 satisfy BO. We further introduce a new criterion CritNEnt2 defined as follows:

$$\begin{aligned} \mathrm{Crit\_{NEnt2}}(i,j) &:= \frac{\mathrm{Crit\_{NEnt1}}(i,j)}{\breve{H}\_{i,j}(Z)},\\ \breve{H}\_{i,j}(Z) &:= \sum\_{k \in \{i,j\}} \Psi\left(\frac{\breve{\rho}\_k}{\breve{\rho}\_i + \breve{\rho}\_j}\right). \end{aligned}$$

This satisfies all conditions: BO, WS, and SI.

#### *5.2. Directly Estimated Misclassification Probabilities*

Second, we introduce the criterion named directly estimated misclassification probabilities (DEMP) [4]. It selects the components with the highest misclassification probabilities. The criterion is formulated as follows:

$$-\text{Crit}\_{\text{DEM}}(i,j) := \max\left\{ \overline{\mathcal{M}}\_{j,i\prime} \overline{\mathcal{M}}\_{i,j} \right\}\_{\prime\prime}$$

where

$$\begin{aligned} \overline{\mathcal{M}}\_{j,i} := \breve{P}(\hat{z}(X) = j | Z = i) &:= \frac{\sum\_{\mathfrak{n}} \gamma\_{i}(\mathfrak{x}\_{\mathfrak{n}}) \mathbf{1}(\hat{z}(\mathfrak{x}\_{\mathfrak{n}}) = j)}{N \tilde{\rho}\_{i}}, \\ \hat{z}(\mathfrak{x}) &= \underset{k = 1, \ldots, \mathbb{K}}{\arg\max} \gamma\_{k}(\mathfrak{x}). \end{aligned}$$

However, this violates the condition BO when *z*ˆ(*xn*) is not *i* or *j* for some *n*. Therefore, we modify it by restricting the choice of the latent variable to component *i* or *j*. We define *<sup>z</sup>*ˆ*i*,*j*(*x*) as follows:

$$\mathcal{B}\_{i,j}(\mathbf{x}) := \underset{k=i,j}{\text{arg }\max} \,\gamma\_k(\mathbf{x}\_n).$$

and define CritDEMP2 by replacing *<sup>z</sup>*ˆ(*x*) with *<sup>z</sup>*ˆ*i*,*j*(*x*) in the definition of CritDEMP. Then, this satisfies all essential conditions.

#### *5.3. Mixture Complexity*

Finally, we propose a new criterion based on mixture complexity (MC) [7]. MC is an extended concept of (the logarithm of) the number of clusters into a real value considering the overlap and bias among the components. It is defined based on information theory, and formulated as follows:

$$\text{MC}\left(\{\gamma\_k(\mathbf{x}\_n)\}\_{k,\boldsymbol{\nu}};\left\{w\_{\boldsymbol{\nu}}\right\}\_{\boldsymbol{\eta}}\right) := \sum\_{k=1}^{K} \Psi(\tilde{\rho}\_k) - \sum\_{n=1}^{N} \frac{w\_n}{\mathcal{W}} \sum\_{k=1}^{K} \Psi\left(\gamma\_k(\mathbf{x}\_n)\right).$$

where {*wn*}*<sup>n</sup>* denotes the weights of the data *<sup>x</sup>N*, *<sup>W</sup>* :<sup>=</sup> <sup>∑</sup>*<sup>n</sup> wn* denotes their sum, and *<sup>ρ</sup><sup>k</sup>* is redefined as *<sup>ρ</sup><sup>k</sup>* :<sup>=</sup> <sup>∑</sup>*<sup>n</sup> wnγk*(*xn*)/*W*. Examples of MC for mixtures of two components are shown in Figure 3. In them, the exponential of the MCs take values between 1 and 2, according to the uncertainty in the number of clusters induced by the overlap or weight bias between the components.

**Figure 3.** Examples of MC for mixtures of two components. Images are obtained from [7].

We first propose a new merging criterion CritMC to select the components whose MCs are the smallest. It is defined as follows:

$$\text{Crit}\_{\mathsf{MCC}}(i,j) := \mathsf{MC}\left(\left\{\frac{\gamma\_k(\mathbf{x}\_n)}{\gamma\_{i \cup j}(\mathbf{x}\_n)}\right\}\_{k \in \{i,j\}, n}; \left\{\gamma\_{i \cup j}(\mathbf{x}\_n)\right\}\_n\right).$$

However, this does not satisfy the condition WS because of the effects of the weight biases. Therefore, we modify it by removing the biases to propose a new criterion, which we call the *normalized mixture complexity* (NMC) CritNMC. The criterion is defined as follows:

$$\text{Crit}\_{\text{NMC}}(i, j) := \frac{\text{Crit}\_{\text{MC}}(i, j)}{\check{H}\_{i, j}(Z)}.$$

It satisfies all conditions BO, WS, and SI. Note that it is equivalent to CritNEnt2 because CritNMC = <sup>1</sup> + CritNEnt2.

We summarize the relationships between the criteria and the essential conditions in Table 1. The modification led to the fulfillment of many conditions.

**Table 1.** Summary of the relationships between the criteria and the essential conditions. Check marks are attached to the conditions that are satisfied.


#### **6. Stopping Condition**

We also propose a new stopping condition based on NMC. First, we calculate the NMC for the (unmerged) mixture model *f* defined as follows:

$$\mathsf{NMC}\_0 := \frac{\mathsf{MC}\left(\{\gamma\_k(\mathfrak{x}\_n)\}\_{k,n}; \{1\}\_n\right)}{\widehat{H}(Z)}.$$

Since it represents the average degree of separation in the components of *f* , it can be used for the stopping condition for merging. Then, before merging components *i* and *j*, we compare CritNMC(*i*, *<sup>j</sup>*) to NML0. If CritNMC(*i*, *<sup>j</sup>*) <sup>≥</sup> NML0, then the merging algorithm halts without merging components *i* and *j*. Otherwise, the algorithm merges components *i* and *j* and continues further.

Note that this stopping criterion can be applied when a criterion other than CritNMC is used. In this case, we use the criterion to search the two closest components and use NMC to decide whether to merge them.

#### **7. Clustering Summarization**

In this section, we propose methods to quantitatively explain the merging results, using the MC and NMC.

We consider that a mixture model with *K*-component is merged into *L* uppercomponents. We define the sets *I*1, ... , *IL* that partition {1, ... , *K*} as the sets of the indices that are contained in each upper-component. Then, the MC and NMC among the upper-components, denoted as MC(up) and NMC(up), respectively, can be calculated as follows:

$$\begin{aligned} \mathsf{MC}(\mathsf{up}) &:= \mathsf{MC}\left(\left\{\sum\_{k \in I\_l} \gamma\_k(x\_n)\right\}\_{l,n}, \{1\}\_n\right), \\ \mathsf{NMC}(\mathsf{up}) &:= \frac{\mathsf{MC}(\mathsf{up})}{\sum\_l \Psi(\vec{\pi})} \end{aligned}$$

where *<sup>τ</sup><sup>l</sup>* denotes the weight of the upper-component *<sup>l</sup>* calculated as follows:

$$\widetilde{\pi}\_l := \frac{1}{N} \sum\_{k \in I\_l} \gamma\_k(\mathfrak{x}\_n) = \sum\_{k \in I\_l} \widetilde{\rho}\_k \cdot \mathfrak{x}\_l$$

For each *l*, the MC and NMC in the sub-components within the upper-component *l*, written as MC(*l*) and NMC(*l*), respectively, can be calculated as follows:

$$\begin{split} \mathsf{MC}(l) &:= \mathsf{MC}\left(\left\{\frac{\gamma\_{k}(\mathbf{x}\_{n})}{\sum\_{k' \in I\_{l}} \gamma\_{k'}(\mathbf{x}\_{n})}\right\}\_{k \in I\_{l}, \mathsf{n}}; \left\{\sum\_{k' \in I\_{l}} \gamma\_{k}(\mathbf{x}\_{n})\right\}\_{n}\right), \\ \mathsf{NMC}(l) &:= \frac{\mathsf{MC}(l)}{\sum\_{k \in I\_{l}} \Psi\left(\hat{\rho}\_{l}^{(k)}\right)}, \end{split}$$

where *<sup>ρ</sup>* (*k*) *<sup>l</sup>* denotes the relative weight of the sub-component *k* ∈ *Il* calculated as *ρ* (*k*) *<sup>l</sup>* :<sup>=</sup> *<sup>ρ</sup>k*/ <sup>∑</sup>*k*∈*Il <sup>ρ</sup>k* . NMC is undefined if the denominator is 0.

MC and NMC quantify the degree to which the components are regarded as clusters in different ways: larger values indicate that the components definitely look like different clusters. MC quantifies this by measuring (the logarithm of) the number of clusters continuously, considering the ambiguity induced by the overlap and weight bias among the components. It takes a value between 0 and the logarithm of the number of the components. In contrast, NMC measures the scattering of the components based only on their overlap. It takes a value between 0 and 1. They have also the desirable properties that they are scale invariant and can quantify overlaps among more than two components.

Therefore, we propose the summarization of clustering structures by listing MC(up), NMC(up), component weights, MC(*l*), and NMC(*l*) in a table, which we call the *clustering summarization*. The clustering summarization is useful for evaluating the confidence level of the clustering results.

We show an example of the clustering summarization using the mixture model illustrated in Figure 4. In this example, there are four Gaussian components as illustrated in Figure 4a, and two merged clusters on the left and right sides as illustrated in Figure 4b–d. The clustering summarization is presented in Table 2. For the uppercomponents, the exponential of MC is almost two, and the NMC is almost one. This indicates that two upper-components can be definitely regarded as different clusters. For both sub-components, the exponential of MC is larger than one. This indicates that they have more complex shapes than a single component. Moreover, the structures within Component 1 are more complex than those in 2, because the MC and NMC are larger.

(**c**) Cluster 1 (**d**) Cluster 2 **Figure 4.** Example of the merged mixture model. Images are obtained from [7].


**Table 2.** Example of a clustering summarization.

#### **8. Experiments**

In this section, we present the experimental results to demonstrate the effectiveness of merging the mixture components and modifying the criteria.

#### *8.1. Analysis of Artificial Dataset*

To reveal the differences among the criteria, we conducted experiments with artificially generated Gaussian mixture models. First, we randomly created a two-dimensional Gaussian mixture model *f* = ∑*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> *<sup>ρ</sup><sup>k</sup>* <sup>N</sup> (*x*; *<sup>μ</sup>k*, <sup>Σ</sup>*k*) as follows:

$$\begin{aligned} K &:= 50, \\ (\rho\_1, \dots, \rho\_K) &\sim \text{Dir}(1, \dots, 1), \\ \mu\_{1'}, \dots, \mu\_K &\overset{\text{i.i.d.}}{\sim} \mathcal{N}\left(\mu; [0, 0], 3^2 \times I\_2\right), \\ a\_{1'} b\_{1'} , \dots, a\_{K'} b\_K &\overset{\text{i.i.d.}}{\sim} \mathcal{U}[0.5, 1.5], \\ \Sigma\_k &:= [[a\_k, 0], [0, b\_k]] \quad (k = 1, \dots, K), \end{aligned}$$

where Dir(*α*, ... , *α*) denotes the Dirichlet distribution, and *U*[*m*, *M*] denotes the uniform distribution from *m* to *M*. Then, we sampled 5000 points *x*<sup>5000</sup> from *f* , and ran the merging algorithms without stopping conditions. The algorithms were evaluated using the (maximum) intra-cluster distance Dintra and (minimum) inter-cluster distance Dinter defined as follows:

$$\begin{aligned} \mathcal{D}\_{\text{intra}} &:= \max\_{k=1,\dots,K} \frac{\sum\_{\mathbf{n}} \gamma\_{k}(\mathbf{x}\_{\mathbf{n}}) \left|| \mathbf{x}\_{\mathbf{n}} - \widetilde{\mu}\_{k} \right||^{2}}{\sum\_{\mathbf{n}'} \gamma\_{k}(\mathbf{x}\_{\mathbf{n}'})}, \\ \mathcal{D}\_{\text{inter}} &:= \min\_{1 \le i < j \le K} \left|| \widetilde{\mu}\_{i} - \widetilde{\mu}\_{j} \right||^{2} \end{aligned}$$

where *<sup>μ</sup>*1,..., *<sup>μ</sup><sup>K</sup>* denote the centers of the components defined as

$$
\widetilde{\mu}\_k := \frac{\sum\_{\mathfrak{n}} \gamma\_k(\mathfrak{x}\_{\mathfrak{n}}) \mathfrak{x}\_{\mathfrak{n}}}{\sum\_{\mathfrak{n}'} \gamma\_k(\mathfrak{x}\_{\mathfrak{n}'})}.
$$

The clustering structure is said to be *better*, as Dintra is smaller and Dinter is larger. Both distances are measured with several *K* and compared among the algorithms with different criteria. Although we may obtain better results for these metrics by using them as merging criteria in a similar way as used in hierarchical clustering [28,29], we used them only for comparison rather than optimizing them.

The experiments were performed 100 times by randomly generating *f* and the data. Accordingly, the ranking of the criteria was calculated for each distance. Table 3 presents the average rank of each criterion. As seen from the table, the modifications of the criteria improved the rank. In addition, DEMP2 and NMC, satisfying all essential conditions, were always in the top three. These results indicate the effectiveness of the essential conditions.


**Table 3.** Average ranks of the criteria. For each *K*, the best rank is denoted in boldface.

To further investigate the relationships between the essential conditions and resulting cluster structures, we illustrated the cluster obtained in a trial where the intra-cluster distance was the largest in Figure 5. For the criterion Ent, one cluster continued to grow. This is because Ent lacks the condition SI, and is advantageous for larger clusters. For the criterion NEnt1, the growth of the larger clusters was mitigated by adding the condition SI to Ent. Nevertheless, the intra-cluster distances were still large because NEnt lacked the condition BO. It tended to create unnecessarily large clusters because it tended to merge larger and more distant components rather than smaller and closer components. The criterion NMC improved such a disadvantage by adding the condition BO to NEnt1. For the criterion MC, distant components were merged, as the condition WS was not satisfied. NMC overcame this by adding the condition WS to MC. The differences between DEMP and DEMP2 were unclear in Figure 5c,d, and both criteria elucidated the cluster structure well because they satisfied relatively many conditions. We conclude that the essential conditions are effective for obtaining better cluster structures.

(**d**) DEMP2 (3.45) (**e**) MC (3.92) (**f**) NMC (2.79)

**Figure 5.** Scatter plots for the cluster with *K* = 20 whose intra-cluster distance is the largest. The thickness of the color corresponds to the posterior probabilities. The numbers in the parenthesis show Dintra.

#### *8.2. Analysis of Real Dataset*

We discuss the results of applying the merging algorithms and clustering summarization to eight types of real datasets with true cluster labels. The details of the datasets and processing are described in Appendix B.

#### 8.2.1. Evaluation of Clustering Using True Labels

First, we compared the clustering performance of the merging algorithms by measuring similarity between estimated and true cluster labels. Formally, given the dataset {*xn*}*<sup>n</sup>* and the true labels {*z <sup>n</sup>*}*n*, we first estimated the clustering structures using {*xn*}*<sup>n</sup>* without seeing {*z <sup>n</sup>*}*n*, and obtained the estimated labels {*z*ˆ*n*}*n*. We define *<sup>K</sup>* and *<sup>K</sup>* as the number of the true and estimated clusters. Then, we evaluated the similarity between {*z <sup>n</sup>*}*<sup>n</sup>* and {*z*ˆ*n*}*<sup>n</sup>* using the adjusted Rand index (ARI) [30] and F-measure. ARI takes values between -1 and 1, and F-measure takes values between 0 and 1. Their larger value corresponds to better clustering. Both indices can be applied when the number of true and estimated clusters is different.

To run the merging algorithms, the mixture models should be estimated first. In our experiments, we estimated them by the variational Bayes Gaussian mixture model with *K* = 20 [31] implemented in the Scikit-learn package [32]; we adopted this, as it exhibited good performance in our experiments. We used the prior distributions of the mixture proportions as the Dirichlet distributions with *α* = 0.1, and we set the other parameters for prior distributions as the default values in the package. For each dataset, we fitted the algorithm ten times with different initializations and used the best one.

We compared the merging algorithms with three types of model-based clustering algorithms based on the Gaussian mixture model, which are summarized in Table 4. First, we estimated the number of components, using BIC [33]. It selects a suitable model for

describing the densities, and the mixture components tend to overlap. Nevertheless, it has been widely used for clustering by regarding each component as a cluster. Second, we estimated the number of clusters using DNML [11,12]. It selects a model whose components can be regarded as clusters by considering the description length of the latent and observed variables. Finally, we estimated the clusters as the mixture of Gaussian mixture models implemented by Malsiher-Walli et al, [20]. By fixing two integers *K* and *L*, *K* Gaussian mixture models were estimated with *L* components. The number of clusters was automatically adjusted by shrinking the redundant clusters. As in the original paper, we set *K* = 30, *L* = 15 (and some specific parameters in the paper) for the DLB dataset and *K* = 10, *L* = 5 for the other datasets.

**Table 4.** Overview of the comparison methods.


We estimated the models ten times and compared the average score among the methods. The average number of clusters are listed in Table 5, and F-measure and ARI are listed in Tables 6 and 7.

Two clusters that achieved the best score and that were obtained by the heuristics proposed in Section 6 are described. The best scores of the merging algorithms exceeded those of all other methods for six out of eight datasets. In particular, the merging methods satisfying many essential conditions, such as DEMP, DEMP2, and NMC, obtained high scores with a smaller number of clusters. Therefore, it can be said that the merging algorithms with more essential conditions are effective for elucidating the clustering structures. Moreover, the scores with NMC-based stopping conditions exceeded those of all other methods for four out of eight datasets.

**Table 5.** Estimated number of clusters. Merge (best F-measure) is the number of clusters when F-measure is highest. Merge (best ARI) is the number of clusters when ARI is highest. Merge (NMC) is the number of clusters obtained by the NMC-based stopping condition.



**Table 5.** *Cont.*

**Table 6.** F-measure for the real datasets. For each merging algorithm, scores that exceed all comparison methods are denoted in boldface.



**Table 7.** ARI for the real datasets. For each merging algorithm, scores that exceed all comparison methods are denoted in boldface.

To further investigate the relationships between the performances of the algorithms and the shapes of the datasets, we estimated the proportion of outliers based on the *k*nearest neighbor distances <sup>D</sup>(5) nn . We calculated the ratio of the 5-nearest neighbor distance <sup>D</sup>(5) nn (*xn*) and its average (1/*N*) <sup>∑</sup>*n* <sup>D</sup>(*k*) nn (*xn*) for each data point, and we plotted the proportions for which the ratio exceeded 2.0, 3.0, 4.0, and 5.0 in Figure 6. As seen from the figure, the datasets where the merging methods did not work well, such as AIS, DLB, WSC, and YST, contained relatively many outliers. This is reasonable because the merging algorithms do not aim to merge distant clusters. We can conclude that the merging methods are particularly effective when the datasets have fewer outliers or when we want to find the aggregated clusters.

**Figure 6.** The proportions of the data *xn* that satisfy <sup>D</sup>(5) nn (*xn*)/[(1/*N*) <sup>∑</sup>*n* <sup>D</sup>(5) nn (*xn*)] <sup>&</sup>gt; 2.0, 3.0, 4.0, 5.0.

#### 8.2.2. Results of Clustering Summarization

Next, we analyzed the results of the merging methods using the clustering summarization proposed in Section 7. As examples, we show one result obtained using the NMC and NMC-based stopping conditions for the Flea beetles and Wisconsin breast cancer datasets. The clustering results are summarized in Tables 8 and 9, respectively. For the upper-components in Flea beetles dataset, the exponential of MC(up) was close to 3.0, and NMC(up) was close to 1.0; we see that the effective number of clusters was around three, and the clusters were well-separated. Components 2 and 3 were unmerged, and the exponentials of MC and NMC of Component 1 were close to 1.0 and 0.0, respectively. This indicates that each cluster can be represented by almost a single Gaussian distribution. Furthermore, the (exponentials of) MC and the NMC of the upper-components in the Wisconsin cancer dataset were 1.66 and 0.763, respectively. It can be expected that the situation was a partial overlap of the two clusters. For Components 1 and 2, NMCs were relatively large. This shows that partially separated components are needed to describe each component. MC of Component 2 was smaller than that of Component 1. Then, it is expected that Component 2 had simpler shapes than Component 1; however, the former seemed to have small components that might be outliers because NMC was larger. Plots of the predicted clusters are illustrated for the Flea beetles and Wisconsin breast cancer datasets in Figures 7 and 8, respectively. We observe that the predictions described previously match to the actual plots. Therefore, we can reveal significant information about the clustering structures by observing the clustering summarizations.


**Table 8.** Clustering summarization for the Flea beetles dataset.

**Table 9.** Clustering summarization for the Wisconsin breast cancer dataset.


**Figure 7.** Predicted cluster labels for the Flea beetles dataset.

**Figure 8.** Predicted cluster labels for the Wisconsin breast cancer dataset.

8.2.3. Relationships between Clustering Summarization and Clustering Quality

Finally, we confirmed that MC and NMC in the sub-components were also related to the quality of classification. To confirm this, we conducted additional experiments discussed below. First, we ran the merging algorithms until *K* = 1 without the stopping conditions. Then, for every merged clusters created at *K* = *K*start, ... , 1, we counted the number of data points classified into them. We define *<sup>N</sup>*(*k*) *<sup>C</sup>* as the number of points with true labels *k* classified into the merged cluster *C*. Then, we evaluated the quality of the cluster *C* using the entropy calculated as follows:

$$H\_{\mathbb{C}} = -\sum\_{k=1}^{K^\star} \frac{N\_{\mathbb{C}}^{(k)}}{\sum\_{k'} N\_{\mathbb{C}}^{(k)}} \log \frac{N\_{\mathbb{C}}^{(k)}}{\sum\_{k'} N\_{\mathbb{C}}^{(k)'}},$$

where the cluster *<sup>C</sup>* for <sup>∑</sup>*k <sup>N</sup>*(*k*) *<sup>C</sup>* <sup>=</sup> 0 were omitted. This takes values between 0 and log *<sup>K</sup>*. Smaller values are preferred, because *HC* becomes small when most of the points within the component share the same cluster label. We calculated the MC/NMC and *HC* within the clusters for all datasets and merging algorithms, and we plotted the relationships between them in Figure 9. Note that the unmerged clusters were omitted because the NMC could not be defined. From the figure, it is evident that both MC and NMC had positive correlations with *HC*. The correlation coefficients were 0.794 and 0.637 for MC and NMC, respectively. This observation is useful in applications. If the obtained cluster has smaller MC and NMC, then we can confirm that it contains only one group. Otherwise, we need to assume that it contains more than one group. Therefore, we conclude that MC and NMC indicate the confidence level of the cluster structures.

**Figure 9.** Scatter plots of the MC/NMC and the entropy of the true cluster label.

#### **9. Discussion**

To improve the interpretability of the mixture models with overlap, we have established novel methodologies to merge the components and summarize the results.

For merging mixture components, we proposed essential conditions that the merging criteria should satisfy. Although there have been studies creating some rules in the clustering approach [34,35], they have not been applied to clustering by merging components. The proposed essential conditions for merging criteria contributed to comparing and modifying existing criteria. The limitation of our conditions is that they only provide the necessary conditions for extreme cases, where the components are entirely overlapped

or separated. The conditions for the moderate cases that the components partially overlap should be investigated in further studies.

We also proposed a novel methodology to interpret the merging results based on clustering summarization. While previous studies [6,26,27] have focused on interpreting the structures among sub-components or upper-clusters only, our methods can quantify both structures uniformly based on the MC and NMC. They represented the overview of the structures in the mixture models by evaluating how much the components were distinguished based on the degree of overlap and weight bias.

We verified the effectiveness of our methods, using artificial and real datasets. In the artificial data experiments, we confirmed that the intra- and inter-cluster distances were improved corresponding to the modification of the criteria. Further, by observing the clusters with maximum intra-cluster distance, we found that the essential conditions were helpful to prevent the clusters from merging distant components or growing too much. In the real data experiments, we confirmed that the best scores of the proposed methods were better than the comparison methods for many datasets, and the scores obtained using the stopping condition were also better for the datasets containing relatively smaller outliers. In addition, we confirmed that the clustering summary was helpful to interpret the merging results. It was related to the shape of the clusters, weight biases, and the existence of the outliers. Further, we found that the MC and NMC within the components were also related to the quality of the classification. Therefore, the clustering summary also represented the confidence level of the cluster structures.

#### **10. Conclusions**

We have established the framework of theoretically interpreting overlapping mixture models by merging the components and summarizing merging results. First, we proposed three essential conditions for evaluating cluster-merging methods. They declared necessary properties that the merging criterion should satisfy. In this framework, we considered Ent, DEMP, and MC and their modifications to investigate whether they satisfied the essential conditions. The stopping condition based on NMC was also proposed.

Moreover, we proposed the clustering summarization based on MC and NMC. They quantify how overlapped the clusters are, how biased the clustering structure is, and how scattered the components are in a respective cluster. We can conduct this analysis from higher level clusters to lower level components to give a comprehensive survey of the global clustering structure. We then quantitatively explained the shape of the clusters, weight biases, and existence of the outliers.

In the experiments, we empirically demonstrated that the modification of the merging criteria improved the ability to find better clustering structures. We also investigated the merging order for each criterion and found that the essential conditions were helpful to prevent the clusters from merging distant components or growing too much. Further, we confirmed, using the real dataset, that the clustering summary revealed varied information in the clustering structure, such as the shape of the clusters, weight biases, the existence of the outliers, and even the confidence level of the cluster structures. We believe that this methodology gives a new view of the interpretability/explainability for model-based clustering.

We have studied how to interpret the overlapping mixture models after they were estimated. It remains for future study to apply merging criteria even in the phase of estimating mixture models.

**Author Contributions:** Conceptualization, S.K. and K.Y.; methodology, S.K. and K.Y.; software, S.K.; validation, S.K. and K.Y.; formal analysis, S.K.; investigation, S.K. and K.Y.; resources, S.K.; data curation, S.K.; writing—original draft preparation, S.K. and K.Y.; writing—review and editing, S.K. and K.Y.; visualization, S.K.; supervision, K.Y.; project administration, K.Y.; funding acquisition, K.Y. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was partially supported by JST KAKENHI JP19H01114 and JST-AIP JPMJCR19U4.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** All of the datasets used in this paper can be obtained in the manner described in the README file at https://github.com/ShunkiKyoya/summarize\_cluster\_overlap. Moreover, all of the experimental results can be reproduced by executing the .ipynb files.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Details of the Merging Algorithm**

We show the pseudo-code and computational complexity of the merging algorithm. First, the pseudo-code of merging mixture components is shown in Algorithm A1.

#### **Algorithm A1** Merging mixture components

**Require:** data *xN*, finite mixture model *f* , criterion function Crit.


7: **end while**

8: **return** The current components.

Next, we discuss the computational complexity in this algorithm given *x<sup>N</sup>* and *f* below. First, the cost of calculating {*γk*(*xn*)}*k*,*<sup>n</sup>* can be written as O(*T*dist*NK*), where *<sup>T</sup>*dist is the cost to calculate *f*(*x*) for a point. To merge components, it is needed to repeat updating {Crit(*i*, *<sup>j</sup>*)}*i*,*<sup>j</sup>* and {*γk*(*xn*)}*k*,*<sup>n</sup>* at most (*<sup>K</sup>* <sup>−</sup> <sup>1</sup>) times. The cost for updating {Crit(*i*, *<sup>j</sup>*)}*i*,*<sup>j</sup>* and {*γk*(*xn*)}*k*,*<sup>n</sup>* are O(*T*crit*K*2) and O(*N*), respectively, where *<sup>T</sup>*crit is the cost to calculate Crit(*i*, *<sup>j</sup>*) for a pair of the components. Overall, we need O(*K*(*T*dist + *<sup>T</sup>*crit*K*<sup>2</sup> + *<sup>N</sup>*)) to complete the algorithm.

For the criteria referred to in this section, their computational complexity *T*crit are O(*N*) for Ent, NEnt1, DEMP2, MC, and NMC (NEnt2), and O(*NK*) for DEMP.

#### **Appendix B. Details of the Datasets in the Real Data Experiment**

The datasets used in the real data experiment are summarized in Table A1. We show the detail and preprocessing of them below. All variables in the datasets are normalized after they are selected.


**Table A1.** Summary of the real dataset, where *N* denotes the number of points, *d* denotes the number of features, and *K* denotes the number of true clusters.

The AIS dataset [36] consists of the physical measurements of athletes who trained at the Australian Institute of Sport. Two cluster labels are male and female. As did Lee and McLachlan [16] and Malsiner-Walli et al. [20], we use three variables: BMI, LBM, and body fat percentage (BFat).

The Flea beetles dataset [37] consists of two physical measurements (width and angle) of flea beetles. Three cluster labels are the different species, named Concinna, Heikertingeri, and Heptapotamica.

The Crabs dataset [38] describes five morphological measurements (frontal lobe size, rear width, carapace length, carapace width, and body depth) of 200 crabs. Four cluster labels are formed by combining two color forms and two sexes (male and female).

The DLBCL dataset [39] contains fluorescent intensities of multiple conjugated antibodies (markers) on the cells derived from the lymph nodes of patients diagnosed with DLBCL (diffuse large B-cell lymphoma). As did Lee and McLachlan [40] and Malsiner-Walli et al. [20], we consider four labels corresponding to the cell populations.

The Ecoli dataset [41,42] contains cellular localization sites of proteins. We consider five variables named mcg, gvh, aac, alm1, and alm2. Binary attributes are omitted here. For the labels, we consider five localization sites named cp, im, imU, om, and pp. The other localization sites are omitted because there are little data assigned to them.

The Yeast dataset [41,42] also describes cellular localization sites of proteins. As did Franczac et al. [43] and Malsiner-Walli et al. [20], we select three variables and two cluster labels from the dataset. For the variables, we consider three attributes of proteins, named mcg, alm, and vac. For the labels, we consider two localization sites named CYT and ME3.

The Seeds dataset [44] consists of the seven geometric parameters of grains: area, perimeter, compactness, length of kernel, width of the kernel, asymmetry coefficient, and length of kernel groove. Three cluster labels are kernels belonging to different varieties of wheat: Kama, Rosa, and Canadian.

The Wisconsin breast cancer dataset [3] describes characteristics of the cell nuclei in the images of breast masses. Two cluster labels are benign and malignant. As did Fraley and Raftery [2] and Malsiner-Walli et al. [20], we select three variables: extreme area, extreme smoothness, and mean texture.

#### **References**


## *Article* **Information-Corrected Estimation: A Generalization Error Reducing Parameter Estimation Method**

**Matthew Dixon 1,† and Tyler Ward 2,\*,†**


**Abstract:** Modern computational models in supervised machine learning are often highly parameterized universal approximators. As such, the value of the parameters is unimportant, and only the out of sample performance is considered. On the other hand much of the literature on model estimation assumes that the parameters themselves have intrinsic value, and thus is concerned with bias and variance of parameter estimates, which may not have any simple relationship to out of sample model performance. Therefore, within supervised machine learning, heavy use is made of ridge regression (i.e., L2 regularization), which requires the the estimation of hyperparameters and can be rendered ineffective by certain model parameterizations. We introduce an objective function which we refer to as Information-Corrected Estimation (ICE) that reduces KL divergence based generalization error for supervised machine learning. ICE attempts to directly maximize a corrected likelihood function as an estimator of the KL divergence. Such an approach is proven, theoretically, to be effective for a wide class of models, with only mild regularity restrictions. Under finite sample sizes, this corrected estimation procedure is shown experimentally to lead to significant reduction in generalization error compared to maximum likelihood estimation and L2 regularization.

**Keywords:** generalization error; overfitting; information criteria; entropy

#### **1. Introduction**

Kullback and Leibler [1] showed that minimizing a divergence *<sup>ρ</sup>KL*(*<sup>f</sup>* , *<sup>g</sup>θ*) between the truth, *f* , and a parametric model density, *gθ*, is necessary and sufficient for making accurate predictions about data using the model defined by *θ*. Recent work [2] on Berk–Nash equilibria has shown the central role that KL divergence plays in game theoretic choice models such as multi-armed bandits and stochastic multi-party games. KL divergence thus plays a leading role in machine learning and neuroscience, with several inferential approaches developed in the information theory literature. Such approaches for minimizing KL divergence employ a range of methods, including data partitioning, Bayesian indirect inference and M-estimation [3–5]. These approaches are quite distinct from the standard penalized loss minimization framework and, as such, are non-trivial to combine with supervised learning methods such as neural networks.

It is well known that maximum likelihood estimation (MLE) introduces an asymptotic bias in the KL divergence minimizer which is problematic for both model estimation and model selection. For many models, where the parameters *θ* are themselves important, this may be investigated as parameter bias and parameter variance. However, for models common in modern machine learning, the parameters themselves do not have any easily interpreted meaning. For these models, the parameters themselves are irrelevant and only the accuracy (in terms of KL divergence) of the model predictions matter. Within the information theory literature, this has often been referred to simply as bias (e.g., *b*(*G*) from [6]). To distinguish it from parameter bias, one might refer to it as "prediction bias"

**Citation:** Dixon, M.; Ward, T. Information-Corrected Estimation: A Generalization Error Reducing Parameter Estimation Method. *Entropy* **2021**, *23*, 1419. https:// doi.org/10.3390/e23111419

Academic Editors: Lizhong Zheng and Chao Tian

Received: 2 October 2021 Accepted: 25 October 2021 Published: 28 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

or "generalization error". Generalization error is the more common terminology (see, for example, Equation 1.1.6 [7]) and will be used here.

Before the widespread use of machine learning, most models had interpretable parameters, and thus there is a large literature focused on reducing parameter bias. For instance, the jackknife [8] (leave-one-out cross-validation) estimator is an early example. More relevant to this paper is the approach of Firth [9] and later Kosmidis [10,11]. More recently, Pagui, Salvan, and Sartori [12] proposed a parameter bias reducing estimation methodology. An extensive review of the literature around this point can be found in [13]. Unfortunately, these approaches do not consider the impact on KL divergence-based generalization error and thus are not applicable to the field of machine learning where the parameters themselves are devoid of meaning. Heskes [14] shows that classifiers do have a notion of bias-variance decomposition for generalization error, but it is not computable from parameter bias and parameter variance. Therefore, parameter bias reducing formulations are not useful within machine learning unless it can be shown that they also reduce generalization error.

In fact, to seat the approach taken in this paper to generalization error, we recall much earlier and seminal work at the intersection of statistics and information theory. Akaike [15], and later Takeuchi [16], proposed information criteria (AIC and TIC, respectively) for model selection designed explicitly to reduce generalization error. Konishi and Kitagawa [6] extended the approach of Takeuchi to cases where MLE was not used to fit the underlying model, but still restricted themselves to the question of model selection. Stone [17] proved that Akaike's Information Criterion (AIC) is asymptotically equivalent to jackknifing when the estimator is finite. Takeuchi himself showed that TIC is an extension of AIC with fewer restrictions, and thus it too is equivalent to jackknifing whenever AIC would be valid.

For highly parameterized models, as are common in machine learning, model selection such as this is of limited utility. The parameter count may necessarily be very large, and thus none of the models fit using MLE may be acceptable. Then, merely choosing among them is unlikely to produce acceptable results. Within this field, typically *L*<sup>2</sup> or similar regularization is used to reduce generalization error. See Section 11.5.2 [18], for a typical example. For a more recent innovation, refer to [19]. Note that regularization schemes such as this often increase parameter bias while decreasing generalization error. Golub, Heath, and Wahba [20] showed that *L*<sup>2</sup> regularization is asymptotically equivalent to crossvalidation for linear models, subject to certain assumptions. For nonlinear models, it has long been known that *L*<sup>2</sup> regularization is not always valid, and it is trivial to construct example models (See Section 4.1 for one such example) where this approach is always harmful in expectation.

Therefore, it is important to develop a method to reduce generalization error in model estimation analogous to the way that *L*<sup>2</sup> regularization would commonly be used for a highly parameterized model, but having applicability for a wider family of models, especially those for which *L*<sup>2</sup> regularization is not applicable. It is not the goal of this paper to perform a wide survey of generalization error reducing approaches, but we will rather propose an additional approach, investigate its properties, and show that it has superior performance when compared against *L*<sup>2</sup> regularization, which is currently the dominant generalization error reducing estimation procedure within the field of machine learning.

To this end, this paper introduces a generalization error reducing estimation approach referred to as Information Corrected Estimation (ICE). This estimator is proven to have a generalization error of only *O*(*n*<sup>−</sup> <sup>3</sup> <sup>2</sup> ) instead of *O*(*n*−1) as is the case for MLE, and is shown to be valid within a neighborhood around the MLE parameter estimate. Optimizing over this ICE objective function instead of the negative log likelihood thus produces parameters with superior out of sample performance.

Takeuchi's TIC and Firth's approach have never seen widespread use due to the computational and numerical issues that arise from the computation of this adjustment [21], and the ICE estimator in its raw form would have similar problems. Therefore, this paper also proposes an efficient approximation of this correction term, and shows through

numerical experiments that the approximation is effective at improving model performance across a range of models.

#### **2. Preliminaries**

Let us assume that we have data *<sup>x</sup><sup>n</sup>* :<sup>=</sup> {*x*1, ... , *xn*} generated from an unknown joint density function *<sup>f</sup>*(*x*) of *Xn* :<sup>=</sup> {*X*1, ... , *Xn*}. Where necessary, we define *Zn* to denote a second sample drawn from *f*(*x*), independent of *Xn*, and *x <sup>n</sup>* is the observed realization of *Zn*. We consider a model M*<sup>p</sup>* given by a parametric family of densities <sup>M</sup>*<sup>p</sup>* :<sup>=</sup> {*g*(·|*θ*) <sup>|</sup> *<sup>θ</sup>* <sup>∈</sup> **<sup>Θ</sup>** <sup>⊆</sup> <sup>R</sup>*p*}, for some compact Euclidean parameter space **<sup>Θ</sup>**, which is misspecified and hence excludes the truth *f* . Henceforth, the distribution over *x* identified by *<sup>θ</sup>* may be referred to as *<sup>g</sup>θ*(*x*) :<sup>=</sup> *<sup>g</sup>*(*x*|*θ*) where it is notationally convenient to do so.

Suppose that *<sup>θ</sup>*<sup>0</sup> is the quasi-true parameter of model <sup>M</sup>, and *<sup>θ</sup>*ˆ(*Xn*) is the random variable representing the MLE of *θ*<sup>0</sup> fit on a dataset, *xn*. The negative log-likelihood of *Xn* under the distribution *g<sup>θ</sup>* is

$$-\ell(\boldsymbol{\theta}, X\_{\boldsymbol{\theta}}) := -\frac{1}{n} \sum\_{i=1}^{n} \log \mathcal{g}\_{\boldsymbol{\theta}}(\boldsymbol{\mathfrak{x}\_{i}}),\tag{1}$$

where −-(*θ*, *Xn*) is written including a <sup>1</sup> *<sup>n</sup>* to make the expectation of this quantity *<sup>O</sup>*(1) and asymptotically independent of *n*. Similarly, the minus sign is incorporated because −-(*θ*, *Xn*) is a strictly non-negative quantity if *<sup>g</sup>θ*(*xi*) is a probability. The MLE, *<sup>θ</sup>*ˆ(*Xn*), minimizes the negative log likelihood of the data set with respect to the model:

$$\boldsymbol{\theta}(\mathbf{x}\_{\boldsymbol{n}}) := \operatorname\*{argmin}\_{\boldsymbol{\theta}} [-\ell(\boldsymbol{\theta}, \mathbf{x}\_{\boldsymbol{n}})].\tag{2}$$

The expectation of −-(*θ*, *Xn*) is the cross entropy between *f* and *gθ*:

$$-\mathcal{L}(\boldsymbol{\theta}) := \mathbb{E}\_{\boldsymbol{X}\_{\boldsymbol{\theta}}}[-\ell(\boldsymbol{\theta}, \boldsymbol{X}\_{\boldsymbol{\theta}})].\tag{3}$$

Here, the expectation is a function only of *θ* and of the distribution *f* that generated the data *Xn*. As a function of the distribution *f* , this value is *O*(1), but could be large for poorly conditioned *f* . The quasi-true parameter *θ*<sup>0</sup> is

$$\theta\_0 := \operatorname\*{argmin}\_{\theta} [-\mathcal{L}(\theta)].\tag{4}$$

#### *Generalization Error in KL Divergence Based Loss Functions*

Kullback and Leibler [1] viewed "information" as discriminating the sample data drawn from one distribution against another, and defined the KL-divergence *ρKL* between distributions in terms of the ability to make predictions about one by knowing the other. Here,

$$\rho\_{KL}(f,\mathfrak{g}\_{\mathfrak{d}}) = \int \log[\frac{f(\mathfrak{x})}{\mathfrak{g}\_{\mathfrak{d}}(\mathfrak{x})}] f(\mathfrak{x}) d\mathfrak{x}.\tag{5}$$

This value is in general unknowable, but given a sample *Xn* from *f* , −-(*θ*, *Xn*) will converge asymptotically to *<sup>ρ</sup>KL*(*<sup>f</sup>* , *<sup>g</sup>θ*) plus an additive constant that depends only on *<sup>f</sup>* . The convergence relies on White's regularity conditions [22].

A well known result by Stone [17] shows that the MLE is a biased estimator of the minimum KL-divergence:

$$\mathbb{E}\_{X\_n}[-\ell(\boldsymbol{\theta}(X\_n), X\_n)] < \mathbb{E}\_{X\_n}[-\ell(\boldsymbol{\theta}\_{0\prime} X\_n)],\tag{6}$$

because it is evaluated on the data *Xn* which was used to fit *θ*ˆ. Cross-validation was developed as a model selection technique to select a model from a group that actually minimizes <sup>E</sup>*Xn* [*ρKL*(*gθ*<sup>0</sup> , *<sup>g</sup>θ*ˆ(*Xn*))] and not merely <sup>E</sup>*Xn* [−-(*θ*ˆ(*Xn*), *Xn*)] in the limit of large *n*. Takeuchi [16] and Akaike [15] explicitly modeled this bias (generalization error) of an estimation procedure *θ*(*Xn*) as

$$b := \mathbb{E}\_{X\_n} \left[ \ell(\boldsymbol{\theta}(X\_n), X\_n) - \mathbb{E}\_{X\_n^\prime} \left[ \ell(\boldsymbol{\theta}(X\_n), X\_n^\prime) \right] \right]. \tag{7}$$

Our goal is to obtain an estimate, *b*∗, of the generalization error *b* without using the MLE. We will then add this term to the objective function to develop the estimator *θ*∗(*Xn*) so as to cancel the lower order terms of the generalization error. This estimator will then minimize <sup>E</sup>*Xn* [*ρKL*(*gθ*<sup>0</sup> , *<sup>g</sup>θ*∗(*Xn*))] more effectively than MLE, and potentially would in turn produce improved predictions from the model fitted over finite training sets.

**Remark 1.** *We note that under MLE, b* = *O*( <sup>1</sup> *<sup>n</sup>* ) *[16]. Equivalently, one could say that a particular realization of the generalization error* -(*θ*(*Xn*), *Xn*) <sup>−</sup> <sup>E</sup>*X n* [-(*θ*(*Xn*), *X <sup>n</sup>*)] *is itself Op*( <sup>1</sup> *<sup>n</sup>* )*. Here, Op*( <sup>1</sup> *<sup>n</sup>* ) *is used to indicate that the quantity is a random variable with finite variance, whose mean is O*( <sup>1</sup> *n* )*.*

#### **3. Information Corrected Estimation (ICE)**

We propose the following penalized likelihood function:

**Definition 1** (ICE Objective)**.**

$$-\ell^\*(\boldsymbol{\theta}) = -\ell(\boldsymbol{\theta}) + \frac{1}{n} \mathrm{tr}(I\_{\boldsymbol{\theta}} I\_{\boldsymbol{\theta}}^{-1}),\tag{8}$$

*where J<sup>θ</sup> is the negative expected Hessian*

$$J\_{\theta} := -\mathbb{E}\_{\mathcal{X}}[\partial\_{\theta}^{2} \log \mathcal{g}(\mathcal{X}|\theta)] = -\int f(\mathbf{x}) \partial\_{\theta}^{2} \log \mathcal{g}(\mathbf{x}|\theta) d\mathbf{x},\tag{9}$$

*and I<sup>θ</sup> is the Fisher Information matrix*

$$I\_{\theta} := \mathbb{E}\_{X}[\partial\_{\theta} \log \mathcal{g}(X|\theta) \partial\_{\theta} r \log \mathcal{g}(X|\theta)]. \tag{10}$$

*with* ˆ*Iθ,* ˆ*J<sup>θ</sup> being their estimates over the data. Let θ*∗ *denote the minimizer of* (8)*.*

The trace term in Equation (8) will be familiar from Takeuchi [16]. However, Takeuchi showed only that this was the leading order of the bias for the MLE estimate *θ*ˆ, and therefore the proof found there is not sufficient to justify a new estimator that will itself be the target of optimization, and is required to be valid away from *θ*ˆ. As in Takeuchi, because *I* and *J* are unknowable, we will substitute their approximations computed from the training data, ˆ*I<sup>θ</sup>* and ˆ*J<sup>θ</sup>* during the actual computation of this objective. The numerical impact of this approximation will be examined in Section 4.2.1.

**Remark 2.** *Though AIC was developed before TIC, it is easily reproduced as a special case of TIC. Subject to certain conditions (guaranteed by the requirements of [15]), at least in expectation, I <sup>θ</sup>*<sup>ˆ</sup> <sup>=</sup> *<sup>J</sup> <sup>θ</sup>*<sup>ˆ</sup> *. Thus, the quantity within the TIC trace term, I θ*ˆ *J* −1 *<sup>θ</sup>*<sup>ˆ</sup> *, is the identity matrix. Therefore, its trace is equal to p, the parameter count of the model, recovering AIC. TIC itself can be derived using a proof that is similar to, though somewhat simpler than, the one we include in (A2), of which Takeuchi's proof is a special case that is valid only at the MLE estimate θ*ˆ*.*

We also define <sup>ˆ</sup>*J*<sup>∗</sup> to be the negative hessian of <sup>−</sup>-<sup>∗</sup>(*θ*) rather than <sup>−</sup>-(*θ*), and similarly for <sup>ˆ</sup>*I*∗, with expectations written as *<sup>J</sup>*<sup>∗</sup> and *<sup>I</sup>*∗. Analogously, −L∗(*θ*) is the expectation of −-<sup>∗</sup>(*θ*) and *<sup>θ</sup>*<sup>∗</sup> is the minimizer of <sup>−</sup>-<sup>∗</sup>(*θ*), while *θ*<sup>∗</sup> <sup>0</sup> is the minimizer of −L∗(*θ*).

We refer to the estimation of *θ*∗, by minimization of this corrected likelihood function as Information-Corrected Estimation (ICE). As the terminology suggests, we depart from the corrective approach used in Information Criterion, by directly minimizing the bias corrected likelihood function. Note that unlike *L*<sup>2</sup> regularization, the correction term is parameter-free and thus would not require cross validation to estimate a hyperparameter such as the *λ* used by *L*2.

General properties of this estimator are proved, and a set of regularity conditions are provided such that the estimator is asymptotically normal, and produces a bias that is *Op*(*n*−3/2) instead of the usual *Op*(*n*−1). Though this adds only a half-order to the bias correction, for most problems with reasonably large *n*, any increase in order is likely to greatly reduce bias. Experimental results demonstrate superior properties of ICE for linear models compared to MLE with and without *L*<sup>2</sup> regularization.

**Remark 3.** *For models satisfying White's regularity conditions (See [22]), it is known that Jθ*<sup>0</sup> *is positive definite (thus non-singular) and continuous, and also that Iθ*<sup>0</sup> *is continuous with respect to θ. Therefore,* <sup>1</sup> *<sup>n</sup> tr*(*Iθ*<sup>0</sup> *<sup>J</sup>* −1 *<sup>θ</sup>*<sup>0</sup> ) *would always be well defined in an open region around <sup>θ</sup>*0*. Similarly, the solution θ*∗ *would be expected to have the same properties, and hence (for large enough n) the estimate* <sup>1</sup> *<sup>n</sup> tr*(ˆ*Iθ*<sup>∗</sup> <sup>ˆ</sup>*<sup>J</sup>* −1 *<sup>θ</sup>*<sup>∗</sup> ) *would be well defined when computed using the estimates* <sup>ˆ</sup>*Iθ*<sup>∗</sup> *and* <sup>ˆ</sup>*Jθ*<sup>∗</sup> *.*

**Remark 4.** *N.B: Though* −-<sup>∗</sup>(*θ*) *is an estimator of* <sup>L</sup>(*θ*) *accurate to within <sup>O</sup>*(*n*<sup>−</sup> <sup>3</sup> <sup>2</sup> )*, that does not mean that* <sup>L</sup>(*θ*∗) *is reduced by any particular amount relative to* <sup>L</sup>(*θ*ˆ)*. We expect that using this corrected objective will always (if it can be calculated accurately) generate some improvement by virtue of more accurately representing the true performance of the model out of sample, but there is no proof that this level of improvement has any particular form or asymptotic behavior.*

Our approach preserves the linear complexity of training with respect to *n*. However, the computation of ˆ*J* −1 *<sup>θ</sup>*<sup>∗</sup> at each iteration of the numerical solver requires the inversion of a symmetric positive definite matrix with a complexity of *O*(*p*3). Hence the approach is not suitable for high dimensional datasets without adjustment. See Section 5 for optimized approximations that are viable for larger parameter counts. Further exploration of large models based on this approach are beyond the scope of the present work.

**Remark 5.** *It is clear from inspection that if* −-(*θ*) *is strictly convex, then so too is* <sup>−</sup>-<sup>∗</sup>(*θ*) *for large enough n.*

We first provide a proof of asymptotic convergence of *θ*∗ under certain regularity conditions. With this convergence result in place, we then show that minimizing (8) leads to an *O*(*n*−3/2) bias term, an improvement over the *O*( <sup>1</sup> *<sup>n</sup>* ) term produced by MLE.

*Local Behavior of the ICE Objective*

Suppose the following conditions hold:


Then for sufficiently large *<sup>n</sup>* there exists a compact subset *<sup>U</sup>* <sup>⊂</sup> <sup>Θ</sup> containing *<sup>θ</sup>*0, *<sup>θ</sup>*ˆ, such that:


Items (1–3) follow from Lemma A1 (see Appendix A.2). These are additional regularity conditions that are prerequisites for later theorems.

Item (4) follows from Theorem A1 in Appendix A.3. This states that the estimate *θ*∗ is asymtotically normal in a way that is analogous to classical asymptotic normality results for MLE. It is only true almost surely because results (1–3) upon which it relies are only true almost surely.

Item (5) follows from Theorem A2 in Appendix A.4. This item establishes the superior accuracy of the ICE objective compared to the MLE objective function in predicting out of sample errors. Like item (4) this is only true almost surely because intermediate results on which it relies are only true almost surely.

The reduction in generalization error seen arises from the optimization over the superior ICE objective function, analogous to the way that *L*<sup>2</sup> regularization is used for this purpose.

**Remark 6.** *The regularity conditions described here are only slightly more strict than the conditions described by White [22]. In particular, models having three continuous derivatives as required by White, but not 5 as needed here are thought to be very rare. Requirement (2) is just the definition of θ*0*, which White labels differently, and requirement (3) excludes a pathological corner case, the further study of which is beyond the scope of this paper.*

**Remark 7.** *Note that as* −-(*θ*, *<sup>x</sup>n*) *is convex in the neighborhood of <sup>θ</sup>*0*, so too is* <sup>−</sup>-<sup>∗</sup>(*θ*) *for large enough n because* −-<sup>∗</sup>(*θ*) → −-(*θ*)*. Thus it can be concluded that the local behavior of* <sup>−</sup>-∗ *in the neighborhood of θ*<sup>0</sup> *is not appreciably worse than the behavior of* − *if the problem is not too ill conditioned.*

#### **4. Direct Computation Results**

The following experiments have been designed to compare MLE, MLE with *L*<sup>2</sup> regularization, and ICE for regression. Each experiment involves simulation of training and test sets and is implemented in R. See the attached code to run each experiment.

Each of these experiments has been performed using the raw formula for ICE provided in Equation (8) with minimal adjustments. All gradients are computed using R's default finite difference approach. This means that for a model with *p* parameters, the objective function is dominated by the inversion of *J*, which costs *O*(*p*3) time and *O*(*p*2) space. The use of finite difference gradients further increases the time complexity to *O*(*p*4), compounding the problem. This approach is therefore viable for small models with few parameters, but not realistic for larger models. Optimizations to overcome this limitation will be considered in upcoming Section 5. The use of finite difference derivatives was not found to produce appreciable numerical differences in the final output, so analytic derivatives were not used for this analysis.

The code and results for this section is provided in [23]. Throughout this section, the following estimators will be compared.


#### *4.1. Gaussian Error Model*

We begin by considering the simplest case of univariate linear regression with Gaussian residuals. The advantage of this simple model is that the exact form of the correction term can be derived analytically and aids therefore in building intuition on its behavior. For such a toy model, *<sup>y</sup>* <sup>∼</sup> *<sup>N</sup>*(*μ*, *<sup>σ</sup>*2) and, for simplicity, the following example will consider *μ* to be a constant, but it is equally applicable if *μ* = *μ*(*x*). Consider the parameters of the model to therefore be *<sup>θ</sup>* := (*μ*, *<sup>σ</sup>*) with their optimal values being *<sup>θ</sup>*<sup>0</sup> := (*μ*0, *<sup>σ</sup>*0). The the probability density function is

$$g(y, \theta) = \frac{1}{\sqrt{2\pi\sigma}} e^{-\frac{\left(y - \mu\right)^2}{2\sigma^2}}.\tag{11}$$

It is known a priori that *<sup>L</sup>*<sup>2</sup> regularization cannot improve this model, as if *<sup>μ</sup>*<sup>0</sup> <sup>=</sup> 0, any decrease in the magnitude of *μ* is likely to be systematically harmful. Similarly, a decrease in *σ* below *σ*ˆ results in a decrease in model distribution entropy, and hence would be generally making overfitting worse, and would generate a correspondingly higher KL-divergence than the MLE estimate. Consequently, we would expect any *λ* computed through cross-validation to be statistically indistinguishable from zero, and *L*<sup>2</sup> regularization to be generally harmful whenever *<sup>λ</sup>* <sup>=</sup> 0.

#### Generalization Error Analysis

The Gaussian model described was generated with *<sup>μ</sup>*<sup>0</sup> = 0.2, *<sup>σ</sup>*<sup>0</sup> = 0.2, and *dy* = 0.001. For each of *n* ∈ {16, 32, 64, 128, 256, 512, 1024}, 500 independent simulations of the data *y*1, ... , *yn* were performed, and then the parameters were fit from that data. In each simulation, *θ* was computed using MLE, MLE with *L*<sup>2</sup> regularization, and ICE. The *λ* parameter for *L*<sup>2</sup> regularization was computed using 2-way cross-validation on the available data, and as expected, none of the computed values of *λ* were statistically different from zero.

For each estimate of *<sup>θ</sup>*, the KL-divergence *<sup>ρ</sup>KL*(*<sup>f</sup>* , *<sup>g</sup>θ*) was computed (using the known value of *θ*0), and the results were compared. The ICE parameter estimation method showed statistically significant improvement over MLE at the 5-sigma level out to *n* = 64, and was improved by just under 1-sigma at *n* = 1024.

The KL-divergence results graphed against *n* on a log-log scale are shown in Figure 1. Every value of *n* is normalized by the average KL divergence of the MLE methodology to improve legibility. The *L*<sup>2</sup> series is statistically indifferent from the MLE series at 2 standard deviations beyond *n* = 32, and the two are not materially different for any *n*. The ICE series is at least 4.5 standard deviations below the MLE series until *n* = 1024.

**Figure 1.** A comparison of the KL-divergence (y-axis) of various estimation methods against the number of training samples *n*. Each KL divergence value was divided by the average KL divergence of the MLE estimate for that value of *n*. The ICE and *L*<sup>2</sup> series are shown with 2 standard deviation error bars.

**Remark 8.** *In addition to the series shown in Figure 1, a series was computed using the true value of J, estimated from a much larger sample n = 1024 from the underlying distribution, and this series was indistinguishable from the series computed using* ˆ*J for every n, thus it was not graphed. This validates Takeuchi's approach of approximating J with* ˆ*J in this instance.*

As expected, the difference in *μ* between ICE and MLE is not statistically significant (at three standard deviations) for any *n*, but the ICE computed value of *σ* (shown in Figure 2) is considerably larger than the MLE estimate, especially for small values of *n*. This explains the greatly reduced KL-divergence noted in Figure 1.

**Figure 2.** The error in the estimated *σ*ˆ*ICE* and the Z-score of the estimate against the number of training samples *n*.

Note that the difference in estimated *σ* is always statistically significant when compared to the MLE value. This is because both MLE and ICE are fit on the same data, so ICE would always have a larger *σ* than MLE regardless of the actual data chosen from the distribution *f* . This is the cause of the large z-scores shown, always exceeding 200. We know from elementary statistics that correlation between the mean and std. deviation causes the MLE estimate of *σ*ˆ to be systematically low by a factor of *<sup>n</sup>*−<sup>1</sup> *<sup>n</sup>* . Indeed, the ICE estimate of *<sup>σ</sup>*<sup>∗</sup> is closely tracking *<sup>σ</sup>*<sup>0</sup> whereas *<sup>σ</sup>*<sup>ˆ</sup> is closely tracking *<sup>σ</sup>*0(*n*−1) *<sup>n</sup>* as expected. This is one example where reducing generalization error also reduces parameter bias as a side effect.

#### *4.2. Friedman's Test Case*

We now extend the example from Section 4.1 to the case where *μ* is no longer constants. For this example, we chose a standard regression test set, which is nonlinear in the features, based on Section 4.3 of [24]:

$$y\_i = \mu\_\theta(x\_i) + \varepsilon\_i, \ \varepsilon \sim N(0, \sigma^2), \tag{12}$$

where the Friedman model is

$$
\mu\_{\boldsymbol{\theta}}(\mathbf{x}\_{i}) = \theta\_{0}\sin(\pi\mathbf{x}\_{(i,0)}\mathbf{x}\_{(i,1)}) + \theta\_{1}(\mathbf{x}\_{(i,2)} - \theta\_{2})^{2} + \theta\_{3}\mathbf{x}\_{(i,3)} + \theta\_{4}\mathbf{x}\_{(i,4)}.\tag{13}
$$

The random features, *Xj*, are i.i.d. uniform random and the parameter values are fixed. The true parameter set, *<sup>θ</sup>*<sup>0</sup> = (10.0, 20.0, 0.5, 10.0, 5.0, 1.0), reserves the last parameter (1.0) for the value of *σ*.

Note that here *σ* must be treated as an unknown parameter. To do otherwise implies that the modeler knows the amount of noise expected in the data. In the case of a known noise term, overfitting is impossible since overfitting arises when a model reduces the projected noise below its actual value, which can never arise when the noise level is known.

The model probability density *<sup>g</sup>*(*x*, *<sup>y</sup>*|*θ*) of *<sup>y</sup>* is given by

$$\lg(\mathbf{x}\_i, \mathbf{y}\_i|\boldsymbol{\theta}) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(\boldsymbol{\mu}\_i - \mathbf{y}\_i)^2}{2\sigma^2}}.\tag{14}$$

Recall that in Section 4.1, the value of *μ* was considered to be a constant. This example is a natural extension of Section 4.1, and was chosen due to the well-explored difficulty of Friedman's problem.

We simulate 500 batches of equally sized training sets of length *n* ∈ {16, 32, 64, 128, 256, 512, 1024}. The test set is always of length 1024 to ensure accuracy for the smaller values of *n*. The starting point of the optimization is generated by adding a random perturbation, *<sup>δ</sup><sup>θ</sup>* <sup>∼</sup> *<sup>N</sup>*(0, 0.1), to each parameter. As before, the KL-divergence is computed between the distribution represented by the parameters and the true distribution, and these values are compared between estimation methods.

For each test sample, the KL divergence is computed using numerical integration with a *dy* increment of 0.01 over the interval containing *μ* ± 10*σ* for both the true and model distributions. The computed probabilities are verified to numerically sum to unity within an error of <sup>±</sup>10−3.

In each simulation, *θ* is computed using MLE, MLE with *L*<sup>2</sup> regularization, and ICE. The *λ* parameter for *L*<sup>2</sup> regularization is computed using 4-way cross validation on each batch of the training data.

As shown in Figure 3 and Table 1, *L*<sup>2</sup> is not effective for any value of *n*, and is is completely inactivated for *<sup>n</sup>* <sup>&</sup>gt; 32. Where regularization is used (i.e., *<sup>λ</sup>* <sup>=</sup> 0), it generally underperforms MLE. ICE is effective across the entire data range, outperforming MLE for every *n*, and always by a statistically significant margin of at least 5 sigma.

**Figure 3.** Comparison of the KL-divergence, averaged across 500 replications, of estimation methods against the number of training samples *n*. Each KL divergence value was divided by the average KL divergence of the MLE estimate for that value of *n*. The ICE and *L*<sup>2</sup> series are shown with 2 standard deviation error bars.

**Table 1.** Comparison of the average KL divergence across 500 replications for several model estimators given a fitting set size of *n*. For estimators other than *θ*ˆ, the values in parentheses denotes the t-statistic of the difference between this estimator and *θ*ˆ, with negative values indicating that the listed estimator has a lower KL divergence.


#### 4.2.1. Impact of ˆ*J* Approximation

It was noted previously that Takeuchi used ˆ*J* (and likewise, ˆ*I*) in place of the true value of *J*, and we do so here as well. Though there is no realistic way to avoid this approximation in the real world, and the optimized approach discussed in Section 5 has an entirely different set of approximations, the impact of this approximation will be briefly characterized here.

In Table 2, we revisit Table 1, but now drop the *L*<sup>2</sup> regualarization column, and add a new column where the ICE objective is allowed to use a much better approximated value of *J*, in this case approximated from 1024 independently drawn samples regardless of *n*.

**Table 2.** Comparison of the average KL divergence across 500 replications for ICE estimators with and without approximation of *J* given a fitting set size of *n*. For estimators other than *θ*ˆ, the values in parentheses denotes the t-statistic of the difference between this estimator and *θ*ˆ, with negative values indicating that the listed estimator has a lower KL divergence.


As can be seen from Table 2, using the true value of *J* is at most marginally helpful. In fact, for most values of *n* it displays slightly better average results, but slightly higher std. deviation of those results, and thus reduced T-statistics. Thus, we conclude that the Takeuchi's approximation, replacing *J* with ˆ*J* is reasonable. The same conclusion was reached in Section 4.1, see the remark there. We note also that the ICE estimator using ˆ*J* exhibits substantially better performance for very low sample sizes, but further investigation of this phenomenon is beyond the scope of the current paper.

In Table 3, we show the average matrix norms of *J*, ˆ*J*, and also of the diagonal of ˆ*J*, referred to as the matrix *D*. The matrix *D* will be examined further in Section 5, and is included here for completeness. We also show the norms of several matrix differences.

We note that the ICE objective values themselves exhibit much lower variation than the matrix norms show in Table 3. In particular though the matrix *D* is not actually converging to *J* as *n* increases, we see from the correction term it generates that this difference does not appear to have a material impact for larger *n*. We thus conclude that the major eigenvectors of (ˆ*<sup>J</sup>* <sup>−</sup> *<sup>J</sup>*) and (*<sup>D</sup>* <sup>−</sup> *<sup>J</sup>*) are very nearly orthogonal to the gradient vectors used to construct ˆ*I* for large *n*.


**Table 3.** Mean matrix norms of *J*, its approximations, and differences from these approximations across 500 replications.

It is not clear from examining the trace terms in Table 3 that *D* is a worse approximation of *J* than ˆ*J* is, even for small *n* where the impact of the ICE approach is most significant. A more complete investigation of the spectrum of these matrices is beyond the scope of the present work.

#### *4.3. Multivariate Logistic Regression*

The previous experiment is based on a well-known test case. In this second experiment, we assess the general performance of ICE under (i) varying dimensionality of the true data distribution, (ii) increasing misspecification, and (iii) increasing training set sizes. To achieve this goal, we generate a more exhaustive set of data from a more complex data generation process.

#### 4.3.1. Data Generation Process

The synthetic data are designed to exhibit a number of characteristics needed to broadly evaluate the efficacy of ICE. First, the regressors should be sufficiently correlated so as to ensure that model selection is representative of typical datasets. However, we avoid multi-collinearity by ensuring the smallest eigenvalue is above a certain threshold. We additionally control the condition number of the covariance matrix Σ by randomly generating a symmetric positive definite covariance matrix <sup>Σ</sup> <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* using the eigen-decomposition

$$
\Sigma = \mathcal{U}ID\mathcal{U}^T,\tag{15}
$$

where *<sup>U</sup>* is an orthogonal random matrix with elements *Uij* <sup>∼</sup> *<sup>N</sup>*(0, 1) and *<sup>D</sup>* is diagonal matrix of positive eigenvalues. The eigenvalues are uniformly distributed over the interval [*a*, *b*] so that the condition number of Σ is *b*/*a* and the eigenvalues are kept distinct. Here, *a* is chosen to be 1 <sup>×</sup> <sup>10</sup>−<sup>4</sup> and *<sup>b</sup>* is chosen to be 0.1.

Using a Cholesky decomposition <sup>Σ</sup> <sup>=</sup> ΓΓ*<sup>T</sup>* and the random mean vector *<sup>μ</sup>* <sup>∼</sup> *<sup>N</sup>*(0, 1), we generate correlated gaussian vectors of dimension *p* with the properties

$$X\_i = \mu + \Gamma\_{i\bar{j}} Z\_{\bar{j}\bar{\prime}}, Z\_{\bar{j}} \sim N(0, 1), \forall j \in 1, \dots, p. \tag{16}$$

The data (*xn*, *yn*) are generated under a logistic regression

$$p(y=1|\mathbf{x}, \theta\_0) = f(\mathbf{x}|\theta\_0) = \frac{1}{1 + e^{-\mathbf{x}\theta\_0}}.\tag{17}$$

A key challenge in assessing the efficacy of bias reduction is to avoid generating excessively low entropy distributions. In such cases, bias reduction will have marginal effect as the parameters are all nearly zero. To avoid such scenarios, the intercept parameter of the true model is adjusted a-posterior until the following conditions are met:

$$\begin{array}{ll} 1. & c < \mathbb{E}\_{\mathbf{Z}}[p(\mathbf{Y} = \mathbf{1} | \mathbf{X}, \boldsymbol{\theta}\_{0})] < d \\ 2. & -\mathcal{L}(\boldsymbol{\theta}\_{0}) > \epsilon \end{array}$$

where *c* = 0.35, *d* = 0.65, and = 0.2. If these conditions can not be met, then the replication is discarded.

4.3.2. Model Performance Comparison

As in prior sections, KL divergence is computed between the estimated model and the true model for each of the estimation methods. The T-statistics of the difference with the corresponding MLE KL divergence are computed, with negative T-statistics showing that an approach is performing better than the MLE approach. For *L*<sup>2</sup> regularization in this section, the value of *λ* is computed via cross-validation, using two folds, on the provided fitting set.

Table 4 compares the KL divergences *ρKL* from the true distribution to the model distributions produced using various estimation approaches applied to misspecified data. Here *m* denotes the number of regressors that are not predictive, i.e., *θ*<sup>0</sup> contains *m* zeros. The experiment is replicated 300 times using the data generation process described above and the test set is fixed at 100,000 observations.

**Table 4.** Comparison of the KL divergence for the different estimation approaches applied to misspecified data. The values in parentheses denote the t-statistic relative to MLE. For *<sup>p</sup>* <sup>=</sup> {5, 10, 20} there are *<sup>m</sup>* <sup>=</sup> {2, 4, 8} non-explanatory variables added.


We observe that the t-statistic for *θ*∗ is most significant for relatively small sample sizes, particularly *n* = 500. For these small sizes, the improvement over MLE is greater, though noisier. There is uniform decay in improvement over *θ*ˆ as *n* grows, until for *p* = 10 and *p* = 20 the largest sizes are no longer statistically significant. This is expected, as both the MLE and ICE estimates are converging towards the true value of *θ*0, and for large enough sample sizes the ICE correction would be dominated by numerical error, particularly the ill conditioning of *J*.

The *L*<sup>2</sup> estimate improves for small values of *p*, but then becomes progressively worse for large values of *<sup>p</sup>*. We observe that for dimensionality above *<sup>p</sup>* = 5, the *<sup>L</sup>*<sup>2</sup> regularization described here is no longer effective in reducing the KL-divergence. For low values of *p* the value of *θx* has comparatively low variance, and thus the logistic function is reasonably locally approximated as linear. For higher *p* this approximation is less realistic and the performance of *L*<sup>2</sup> regularization degrades.

For the ICE estimates, larger values of *p* show fluctuations that are often not statistically significant. It is apparent that larger *p* is increasing the variance of the ICE divergences, probably due to numerical errors and ill conditioning. Larger values of *n* reduce the absolute size of the divergence improvement whereas larger values of *p* seem to increase it.

Note that though the t-statistics are degrading for large *n*, the absolute magnitude of the differences is asymptotically small. For these sizes, the results are insignificant, but more importantly, immaterial.

4.3.3. Convergence Analysis for Large *n*

For 10 randomly chosen example problems, under which the model coefficients are now fixed, the convergence behavior for large *n*, the training set size, is explored. Note that the test set remains fixed at 100,000 observations for each problem. Table 5 compares the KL divergence (averaged over all 10 problems) under MLE (*θ*ˆ), *L*<sup>2</sup> regularization, and ICE for progressively larger sample sizes. The divergences *<sup>ρ</sup>KL*(*<sup>f</sup>* , *<sup>g</sup>θ*ˆ) and *<sup>ρ</sup>KL*(*<sup>f</sup>* , *<sup>g</sup>θ*<sup>∗</sup> *ICE* ) converge to zero as *<sup>n</sup>* <sup>→</sup> <sup>∞</sup>, as does *<sup>ρ</sup>KL*(*<sup>f</sup>* , *<sup>g</sup>θ*<sup>∗</sup> *L*2 ).

**Table 5.** Comparison of the KL divergence under the MLE *θ*ˆ, *L*<sup>2</sup> regularization and ICE regularization *θ*∗ *ICE* against a large sample size for the case when *<sup>p</sup>* <sup>=</sup> 10 and *<sup>m</sup>* <sup>=</sup> 4.


Generally the *θ*∗ *ICE* estimates are seen to converge slightly faster than the *<sup>θ</sup>*<sup>ˆ</sup> estimates. The regularization in *θ*∗ *<sup>L</sup>*<sup>2</sup> is observed to be beneficial for very small sample sizes, but then becomes marginally detrimental for large *n*.

#### **5. Optimized Computation Results**

For any model satisfying White's Regularity Criteria, it is known that the matrix *J* is positive definite near the MLE optimum *θ*ˆ. This implies that *J* is diagonally dominated, and indeed considering just its diagonal elements *D*, it is known that *tr*(*ID*−1) > 0. Indeed *tr*(*ID*−1) differs strongly from *tr*(*I J*−1) most strongly for models with strong regressor interactions. Therefore, using finite difference gradients, consider the following approximations for the ICE objective function:


Clearly, we expect that *θ*∗ <sup>4</sup> above is the least accurate approximation, and items *θ*<sup>∗</sup> <sup>2</sup> and *θ*∗ <sup>3</sup> have varying levels of accuracy depending on the problem at hand. The cost comparison of these approaches is shown in Table 6.

**Table 6.** The asymptotic computational cost (per iteration) of various proposed approximations as a function of parameter count *<sup>p</sup>*. Cost is amortized when *<sup>J</sup><sup>θ</sup>* = *<sup>J</sup> <sup>θ</sup>*<sup>ˆ</sup> assuming that *<sup>n</sup>* <sup>≈</sup> *<sup>p</sup>*. Note that a typical model will cost *<sup>O</sup>*(*p*) in time and space for both the objective function and its gradients.


**Remark 9.** *When computing gradients for use in a solver, often approximation error will have only a marginal impact on the final result, though it may increase the number of iterations needed for convergence. Broyden's method [25] is a typical example of this approach in action. Efficient approximations of* [*∂<sup>θ</sup>* ˆ*J*] *might similarly have only a minor effect on accuracy and iteration count. The construction of approximate analytical derivatives is beyond the scope of the present work.*

These approximations were computed and compared for the Friedman (see Section 4.2) problem, and the results are shown in Table 7 below.

**Table 7.** Comparison of the average KL divergence across 200 replications for MLE and several variants of ICE given a fitting set size of *n*. For estimators other than *θ*ˆ, the values in parentheses denotes the t-statistic of the difference between this estimator and *θ*ˆ, with negative values indicating that the listed estimator has a lower KL divergence.


From Table 7, it is apparent that approach *θ*∗ 4, taking *<sup>J</sup>* <sup>=</sup> <sup>I</sup> is not effective. This is not surprising as the actual *J* matrix has dramatic differences in scale between regressors. Approximation *θ*∗ 3, taking *<sup>J</sup>* <sup>=</sup> *<sup>D</sup>* is accurate enough that it cannot be statistically distinguished from the direct computation of ICE by the test above. Approximation *θ*∗ <sup>2</sup> tends to underperform approximation (3).

Therefore, we propose taking *J* = *D* as a more numerically stable approximation of the ICE objective.

#### **6. Conclusions**

Takeuchi [16] is believed to be the first to have proposed using an objective function similar to ICE in order to reduce generalization error, though it was applied via model selection. Firth [9] introduced a similar term to reduce parameter bias in model fitting, as opposed to model selection, though he derived it only for exponential model families and did not consider its effect on generalization error. It is not known why this approach did not find widespread use, but one may infer that the *O*(*p*4) computational cost and instability was enough to keep it from wider adoption.

In this paper, we reintroduce the objective function of [16] and provide a more general proof of its widespread applicability. We then show that efficient implementations costing only *O*(*p*) are possible. Under finite sample sizes, this bias correction term is shown experimentally in several models to lead to significant reduction in bias compared to maximum likelihood estimation with and without *L*<sup>2</sup> regularization. ICE offers many advantages over *L*<sup>2</sup> penalized maximum likelihood estimation: (i) it's suitable for most nonlinear models, (ii) it's provably asymptotically convergent; and (iii) does not rely on any parameters which would need to be provided by the operator or deduced through cross-validation.

**Author Contributions:** Conceptualization, M.D. and T.W.; methodology, M.D. and T.W.; software, T.W.; validation, M.D. and T.W.; formal analysis, M.D. and T.W.; writing—original draft preparation, M.D. and T.W.; writing—review and editing, M.D. and T.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data and code are available at https://doi.org/10.6084/m9.figshare. 14312852.v1.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **Appendix A. Proofs**

*Appendix A.1. White's Regularity Conditions*

**Definition A1** (White's regularity conditions)**.** *White [22] provides the following regularity conditions:*


*Appendix A.2. Proof of Finite Variance*

**Lemma A1** (Finite variance)**.** *Suppose the following conditions hold:*


*Then, for sufficiently large <sup>n</sup> there exists a compact subset <sup>U</sup>* <sup>⊂</sup> <sup>Θ</sup> *containing <sup>θ</sup>*0, *<sup>θ</sup>*ˆ, *such that*


**Proof.** Assumptions (4) and (5) establish existence of some set, *S*, containing *θ*<sup>0</sup> such that <sup>L</sup>(*θ*) is bounded on *<sup>S</sup>*, and its estimate, -(*θ*, *Xn*) has finite variance. Therefore, <sup>−</sup>-(*θ*, *xn*) is also bounded on *S* almost surely. Similarly for the first 5 derivatives. White's criteria imply that *Jθ*<sup>0</sup> is positive definite on an open set around *θ*0, and thus one can form a compact set *U* ⊂ *S* containing an open set around *θ*<sup>0</sup> on which the minimum eigenvalue of *J<sup>θ</sup>* is bounded away from 0.

Note that *J<sup>θ</sup>* is three times differentiable on *U* by Assumption 4, as is ˆ*Jθ*, as established above. Then ˆ*J* −1 *<sup>θ</sup>* is also positive definite and bounded on *U*. It can be shown to also have three derivatives by using the well-known matrix relation

$$
\partial\_{\theta}A^{-1} = -A^{-1}(\partial\_{\theta}A)A^{-1}.\tag{A1}
$$

It follows that ˆ*J* −1 *<sup>θ</sup>* is also positive definite, nonsingular, and bounded on *U*. Similarly for ˆ*Iθ*, and thus *tr*(ˆ*I<sup>θ</sup>* ˆ*J* −1 *<sup>θ</sup>* ) is bounded with finite variance on *<sup>U</sup>*. It also has three bounded derivatives with finite variance.

Therefore, −-<sup>∗</sup>(*θ*, *Xn*) → −-(*θ*, *Xn*), and *<sup>θ</sup>*<sup>∗</sup> <sup>→</sup> *<sup>θ</sup>*<sup>0</sup> as *<sup>n</sup>* <sup>→</sup> <sup>∞</sup>, with the convergence being in probability. This means that *U* contains *θ*0, *θ*ˆ, and *θ*<sup>∗</sup> almost surely for large enough *n*. Similarly, on *U* we have three continuous, bounded derivatives of *∂<sup>k</sup> θ*-<sup>∗</sup>(*θ*, *xn*) almost surely.

#### *Appendix A.3. Proof of Asymptotic Normality*

**Theorem A1** (Asymptotic Normality)**.** *Provided the conditions hold in Lemma A1, namely, that*


*Then*

$$\sqrt{n}(\boldsymbol{\theta}^\*-\boldsymbol{\theta}\_0^\*) \to N(0, (f\_{\boldsymbol{\theta}\_0^\*}^\*)^{-1} f\_{\boldsymbol{\theta}\_0^\*}^\* (f\_{\boldsymbol{\theta}\_0^\*}^\*)^{-1}) .$$

**Proof.** As the first derivatives of are continuous, the mean value theorem may be applied:

$$
\partial\_{\theta} \ell^\*(\theta\_0^\*) = \partial\_{\theta} \ell^\*(\theta^\*) + (\theta^\* - \theta\_0^\*) \hat{f}\_{\theta}^\* = (\theta^\* - \theta\_0^\*) \hat{f}\_{\theta}^\*. \tag{A2}
$$

*θ*¯ is between *θ*<sup>∗</sup> and *θ*<sup>∗</sup> 0. Under the assumptions of Lemma A1, and given its finite variance, ˆ*Jθ*¯ is almost surely (in the large *n* limit) positive definite, and thus invertible as both *θ*∗ and *θ*∗ <sup>0</sup> are in *<sup>U</sup>*, and *<sup>θ</sup>*¯ is between them. Therefore,

$$(\theta^\* - \theta\_0^\*) = (\hat{f}\_\emptyset^\*)^{-1} \partial\_\theta \ell^\*(\theta\_0^\*). \tag{A3}$$

Applying the mean value theorem a second time gives

$$
\hat{f}\_{\theta} = \hat{f}\_{\theta\_0^\*}^\* + (\bar{\theta} - \theta\_0^\*) \hat{f}\_{\theta\_1 \prime}^\* \tag{A4}
$$

with *θ*<sup>1</sup> between *θ*¯ and *θ*<sup>∗</sup> 0. If the order of (*θ*<sup>∗</sup> <sup>−</sup> *<sup>θ</sup>*<sup>∗</sup> <sup>0</sup> ) = *Op*(*δ*), where *<sup>δ</sup>* :<sup>=</sup> *<sup>n</sup>*−1/2, then

$$
\hat{J}^\*\_{\Phi} = \hat{J}^\*\_{\theta^\*\_0} + O\_{\mathbb{P}}(\delta). \tag{A5}
$$

As all of the ˆ*J*<sup>∗</sup> are bounded away from zero in probability, we have

$$(\hat{f}\_{\Phi}^\*)^{-1} = (\hat{f}\_{\theta\_0^\*}^\*)^{-1} + O\_p(\delta),\tag{A6}$$

with the equality holding in probability. In the large *n* limit, *δ* → 0, and thus

$$(\theta^\* - \theta\_0^\*) = (\hat{f}\_{\theta\_0^\*}^\*)^{-1} \partial\_\theta \ell^\*(\theta\_0^\*). \tag{A7}$$

As *∂θ*-<sup>∗</sup>(*θ*<sup>∗</sup> <sup>0</sup> ) is the sum of *<sup>n</sup>* independent vectors, it is asymptotically normally distributed by the central limit theorem, and its mean is 0 by the definition of *θ*∗ 0. Its variance is therefore V[*∂θ*<sup>∗</sup> -<sup>∗</sup>(*θ*<sup>∗</sup> <sup>0</sup> )] = <sup>E</sup>[*∂θ*-<sup>∗</sup>(*θ*<sup>∗</sup> <sup>0</sup> )(*∂θ*-<sup>∗</sup>(*θ*<sup>∗</sup> <sup>0</sup> ))*T*] = <sup>1</sup> *<sup>n</sup>* <sup>ˆ</sup>*Iθ*<sup>∗</sup> 0

Substituting this into Equation (A7) yields

$$\sqrt{n}(\boldsymbol{\theta}^\* - \boldsymbol{\theta}\_0^\*) = N(0, (\hat{f}\_{\boldsymbol{\theta}\_0^\*}^\*)^{-1} \hat{I}\_{\boldsymbol{\theta}\_0^\*}^\* (\hat{f}\_{\boldsymbol{\theta}\_0^\*}^\*)^{-1}),\tag{A8}$$

establishing the result.

*Appendix A.4. Proof of Prediction Bias Order under ICE*

**Theorem A2** (Prediction Bias Estimation under ICE)**.** *By minimizing* −-<sup>∗</sup> *instead of* −-*, the first order terms of the prediction bias are cancelled leaving a Op*(*n*−3/2) *residual term*

$$-\mathcal{L}(\hat{\theta}^\*) = -\ell(\theta^\*(X\_n), X\_n) + O\_p(n^{-3/2}).\tag{A9}$$

**Proof.** Note that in Takeuchi's proof [16], the use of *θ*ˆ was prescribed. Therefore, this approach could be used only for model selection, and not for model fitting. Now consider the bias under the ICE estimator *θ*∗.

$$\begin{split} b(\boldsymbol{\theta}^\*(\boldsymbol{X}\_n), \boldsymbol{X}\_n) &= \quad \mathbb{E}\_{\boldsymbol{X}\_n} [\log \mathcal{g}(\boldsymbol{X}\_n | \boldsymbol{\theta}^\*(\boldsymbol{X}\_n)) - \log \mathcal{g}(\boldsymbol{X}\_n | \boldsymbol{\theta}\_0)] \\ &+ \quad \mathbb{E}\_{\boldsymbol{X}\_n} [\log \mathcal{g}(\boldsymbol{X}\_n | \boldsymbol{\theta}\_0) - n \mathbb{E}\_{\boldsymbol{Z}\_n} [\log \mathcal{g}(\boldsymbol{Z}\_n | \boldsymbol{\theta}\_0)]] \\ &+ \quad n \mathbb{E}\_{\boldsymbol{X}\_n} [\mathbb{E}\_{\boldsymbol{Z}\_n} [\log \mathcal{g}(\boldsymbol{Z}\_n | \boldsymbol{\theta}\_0)] - \mathbb{E}\_{\boldsymbol{Z}\_n} [\log \mathcal{g}(\boldsymbol{Z}\_n | \boldsymbol{\theta}^\*(\boldsymbol{X}\_n))]]. \end{split}$$

As the second term is zero, this can be simplified to

$$\begin{array}{rcl} b(\boldsymbol{\theta}^\*(\mathbf{X}\_{\boldsymbol{\mathsf{H}}}), \mathbf{X}\_{\boldsymbol{\mathsf{H}}}) &=& -n \mathbb{E}\_{\mathbf{X}\_{\boldsymbol{\mathsf{H}}}} [\ell(\boldsymbol{\theta}^\*(\mathbf{X}\_{\boldsymbol{\mathsf{H}}}), \mathbf{X}\_{\boldsymbol{\mathsf{H}}}) - \ell(\boldsymbol{\theta}\_{\boldsymbol{\mathsf{H}}}, \mathbf{X}\_{\boldsymbol{\mathsf{H}}})] \\ &- & n \mathbb{E}\_{\mathbf{X}\_{\boldsymbol{\mathsf{H}}}} [\mathcal{L}(\boldsymbol{\theta}\_0) - \mathcal{L}(\boldsymbol{\theta}^\*(\mathbf{X}\_{\boldsymbol{\mathsf{H}}}))]. \end{array}$$

Define *δ* = <sup>√</sup><sup>1</sup> *<sup>n</sup>* , and then recall from White [22] that (*θ*<sup>ˆ</sup> <sup>−</sup> *<sup>θ</sup>*0) is *Op*(*δ*). Similarly, recall from Theorem A1 that (*θ*<sup>∗</sup> <sup>−</sup> *<sup>θ</sup>*<sup>∗</sup> <sup>0</sup> ) is also *Op*(*δ*). As constructed, the error term *<sup>b</sup>*(*θ*∗(*Xn*), *<sup>X</sup>n*) is *Op*(1) (actually *O*(1) as it is an expectation), and terms of order *Op*(*δ*) and higher will be dropped. Therefore, as in Takeuchi's derivation [16], the Taylor expansions below will be truncated at second order in *δ*, dropping terms of order *Op*(*δ*3) or higher. Additionally, we will occasionally drop indications of *Xn* where the meaning is clear and it greatly simplifies the notation.

With that truncation, recalling that *<sup>θ</sup>*<sup>0</sup> is a minimum of <sup>L</sup>(*θ*) and thus has zero gradient:

$$n\mathbb{E}X\_n\left[\mathcal{L}(\boldsymbol{\theta}\_0) - \mathcal{L}(\boldsymbol{\theta}^\*(X\_n))\right] = \frac{n}{2}\mathbb{E}X\_n\left[\left(\boldsymbol{\theta}^\* - \boldsymbol{\theta}\_0\right)^T I\_{\boldsymbol{\theta}\_0} \left(\boldsymbol{\theta}^\* - \boldsymbol{\theta}\_0\right)\right] + O(\delta). \tag{A10}$$

Recall Theorem A1, and recall the that for the quadratic form:

$$\mathbb{E}[\varepsilon^T A \varepsilon] = \text{tr}(A\Sigma) + \mu^T A \mu, \qquad \mathbb{E}[\varepsilon] = \mu\_\prime \mathbb{V}[\varepsilon] = \Sigma. \tag{A11}$$

Therefore,

$$\begin{array}{rcl} \frac{n}{2} \mathbb{E} \mathbf{x}\_{n} [(\boldsymbol{\theta}^{\*} - \boldsymbol{\theta}\_{0})^{T} J \boldsymbol{\theta}\_{0} (\boldsymbol{\theta}^{\*} - \boldsymbol{\theta}\_{0})] &=& \frac{1}{2} tr (J \boldsymbol{\theta}\_{0} [(J\_{\boldsymbol{\theta}\_{0}^{\*}}^{\*})^{-1} I\_{\boldsymbol{\theta}\_{0}^{\*}}^{\*} (J\_{\boldsymbol{\theta}\_{0}^{\*}}^{\*} (J\_{\boldsymbol{\theta}\_{0}^{\*}}^{\*})^{-1})]) \\ &+ \quad \frac{n}{2} (\boldsymbol{\theta}\_{0}^{\*} - \boldsymbol{\theta}\_{0})^{T} J \boldsymbol{\theta}\_{0} (\boldsymbol{\theta}\_{0}^{\*} - \boldsymbol{\theta}\_{0}) \\ &+ \quad O(\delta) .\end{array}$$

Now note that the second term on the right is a constant, and therefore would take no part in any optimization. Therefore, it can be safely ignored and

$$n\mathbb{E}\_{\mathbf{X}\_{\pi}}[\mathcal{L}(\boldsymbol{\theta}\_{0}) - \mathcal{L}(\boldsymbol{\theta}^{\*}(\mathbf{X}\_{\pi}))] = \frac{1}{2}tr(\boldsymbol{I}\_{\boldsymbol{\theta}\_{0}}[(\boldsymbol{I}\_{\boldsymbol{\theta}\_{0}^{\*}}^{\*})^{-1}\boldsymbol{I}\_{\boldsymbol{\theta}\_{0}^{\*}}^{\*}(\boldsymbol{I}\_{\boldsymbol{\theta}\_{0}^{\*}}^{\*})^{-1})]) + O(\delta).\tag{A12}$$

Addressing the first term, again taking a Taylor expansion, we find that

$$\ell(\boldsymbol{\theta}\_{0}) = \ell(\boldsymbol{\theta}^{\*}) + (\boldsymbol{\theta}\_{0} - \boldsymbol{\theta}^{\*})^{T} \partial\_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}^{\*}) + \frac{1}{2} (\boldsymbol{\theta}\_{0} - \boldsymbol{\theta}^{\*})^{T} \partial\_{\boldsymbol{\theta}}^{2} \ell(\boldsymbol{\theta}^{\*}) (\boldsymbol{\theta}\_{0} - \boldsymbol{\theta}^{\*}) + O\_{p}(\boldsymbol{\delta}^{3}).\tag{A13}$$

Therefore, recalling that *nOp*(*δ*3) = *Op*(*δ*) gives

$$\begin{split} n\mathbb{E}\_{\mathbf{X}\_{n}}[\ell(\boldsymbol{\theta}^{\*}(\mathbf{X}\_{n}),\mathbf{X}\_{n})-\ell(\boldsymbol{\theta}\_{0},\mathbf{X}\_{n})] &= \quad \frac{1}{2}tr(J\_{\boldsymbol{\theta}^{\*}}[(J\_{\boldsymbol{\theta}^{\*}\_{0}}^{\*})^{-1}I\_{\boldsymbol{\theta}^{\*}\_{0}}^{\*}(J\_{\boldsymbol{\theta}^{\*}\_{0}}^{\*})^{-1})]) \\ &+ \quad \frac{n}{2}(\boldsymbol{\theta}^{\*}\_{0}-\boldsymbol{\theta}\_{0})^{T}J\_{\boldsymbol{\theta}^{\*}}(\boldsymbol{\theta}^{\*}\_{0}-\boldsymbol{\theta}\_{0}) \\ &+ \quad n\mathbb{E}\_{\mathbf{X}\_{n}}[(\boldsymbol{\theta}^{\*}-\boldsymbol{\theta}\_{0})^{T}\boldsymbol{\partial}\_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}^{\*})] \\ &+ \quad O\_{p}(\delta). \end{split}$$

Now examine the last term:

$$\begin{split} n\mathbb{E}\_{\mathcal{X}\_{\boldsymbol{n}}}[(\boldsymbol{\theta}^{\*}-\boldsymbol{\theta}\_{0})^{\boldsymbol{T}}\partial\_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}^{\*})] &= \quad n\mathbb{E}\_{\mathcal{X}\_{\boldsymbol{n}}}[(\boldsymbol{\theta}^{\*}-\boldsymbol{\theta}\_{0}^{\*})^{\boldsymbol{T}}\partial\_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}^{\*})] \\ &+ \quad n\mathbb{E}\_{\mathcal{X}\_{\boldsymbol{n}}}[(\boldsymbol{\theta}\_{0}^{\*}-\boldsymbol{\theta}\_{0})^{\boldsymbol{T}}\partial\_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}^{\*})]. \end{split}$$

However, (*θ*<sup>∗</sup> <sup>0</sup> <sup>−</sup> *<sup>θ</sup>*0) does not depend on *<sup>X</sup>n*, so it can be pulled out of the expectation, then substitute in a first order Taylor expansion

$$\begin{split} \mathbb{E}\mathbf{x}\_{\pi}[(\boldsymbol{\theta}\_{0}^{\*}-\boldsymbol{\theta}\_{0})^{\top}\boldsymbol{\partial}\_{\theta}\ell(\boldsymbol{\theta}^{\*})] &= \ (\boldsymbol{\theta}\_{0}^{\*}-\boldsymbol{\theta}\_{0})^{\top}\mathbb{E}\mathbf{x}\_{\pi}[\boldsymbol{\partial}\_{\theta}\ell(\boldsymbol{\theta}^{\*})] \\ &= \ (\boldsymbol{\theta}\_{0}^{\*}-\boldsymbol{\theta}\_{0})^{\top}\mathbb{E}\mathbf{x}\_{\pi}[\boldsymbol{\partial}\_{\theta}\ell(\boldsymbol{\theta}\_{0}^{\*})+(\boldsymbol{\theta}^{\*}-\boldsymbol{\theta}\_{0}^{\*})\boldsymbol{\partial}\_{\theta}^{2}\ell(\boldsymbol{\theta}\_{0}^{\*})+O\_{p}(\delta^{2})] \\ &= \ (\boldsymbol{\theta}\_{0}^{\*}-\boldsymbol{\theta}\_{0})^{\top}\boldsymbol{\partial}\_{\theta}\mathcal{L}(\boldsymbol{\theta}\_{0}^{\*}) \\ &\quad + \ (\boldsymbol{\theta}\_{0}^{\*}-\boldsymbol{\theta}\_{0})^{\top}\mathbb{E}\mathbf{x}\_{\pi}\left[(\boldsymbol{\theta}^{\*}-\boldsymbol{\theta}\_{0}^{\*})\big|\boldsymbol{\partial}\_{\theta}^{2}\mathcal{L}(\boldsymbol{\theta}\_{0}^{\*})+O(\delta^{3})\right] \\ &= \ (\boldsymbol{\theta}\_{0}^{\*}-\boldsymbol{\theta}\_{0})^{\top}\boldsymbol{\partial}\_{\theta}\mathcal{L}(\boldsymbol{\theta}\_{0}^{\*})+O(\delta^{3}), \end{split}$$

with the last equality following from the fact that <sup>E</sup>*X<sup>n</sup>* [(*θ*<sup>∗</sup> <sup>−</sup> *<sup>θ</sup>*<sup>∗</sup> <sup>0</sup> )] = 0 and the substitution of (*θ*<sup>∗</sup> <sup>0</sup> <sup>−</sup> *<sup>θ</sup>*0)*TO*(*δ*2) = *<sup>O</sup>*(*δ*3). This term is therefore a constant, up to *<sup>O</sup>*(*δ*3), and takes no part in optimization of *θ*, thus it can be dropped from further consideration. Therefore *<sup>n</sup>*E*X<sup>n</sup>* [(*θ*<sup>∗</sup> <sup>0</sup> <sup>−</sup> *<sup>θ</sup>*0)*T∂θ*-(*θ*∗)] = *O*(*δ*), and

$$\begin{array}{rcl} n \mathbb{E}\_{\mathcal{X}\_n} [ (\boldsymbol{\theta}^\* - \boldsymbol{\theta}\_0)^T \boldsymbol{\partial} \boldsymbol{\theta} \boldsymbol{\ell} (\boldsymbol{\theta}^\*) ] &=& n \mathbb{E}\_{\mathcal{X}\_n} [ (\boldsymbol{\theta}^\* - \boldsymbol{\theta}\_0^\*)^T \boldsymbol{\partial} \boldsymbol{\ell} (\boldsymbol{\theta}^\*) ] \\ &+& O(\boldsymbol{\delta}). \end{array}$$

Recombining these terms yields

$$\begin{split} n\mathbb{E}\_{\mathbf{X}\_{n}}\left[\ell\left(\boldsymbol{\theta}^{\*}(\mathbf{X}\_{n}),\mathbf{X}\_{n}\right)-\ell\left(\boldsymbol{\theta}\_{0},\mathbf{X}\_{n}\right)\right] &=& \frac{1}{2}tr\big(J\_{\boldsymbol{\theta}^{\*}}\big[\left(J\_{\boldsymbol{\theta}^{\*}}^{\*}\right)^{-1}I\_{\boldsymbol{\theta}^{\*}}^{\*}\left(J\_{\boldsymbol{\theta}^{\*}}^{\*}\right)^{-1}\big]\big)\big) \\ &+& \frac{n}{2}\big(\boldsymbol{\theta}\_{0}^{\*}-\boldsymbol{\theta}\_{0}\big)^{T}J\_{\boldsymbol{\theta}^{\*}}\left(\boldsymbol{\theta}\_{0}^{\*}-\boldsymbol{\theta}\_{0}\right) \\ &+& n\mathbb{E}\_{\boldsymbol{X}\_{n}}\big[\big(\boldsymbol{\theta}^{\*}-\boldsymbol{\theta}\_{0}^{\*}\big)^{T}\boldsymbol{\partial}\_{\boldsymbol{\theta}}\ell\big(\boldsymbol{\theta}^{\*}\big)\big] \\ &+& O(\delta). \end{split}$$

Thus the bias (neglecting the constant terms) is then

$$\begin{split} b(\boldsymbol{\theta}^\*(\boldsymbol{X}\_{\boldsymbol{n}}), \boldsymbol{X}\_{\boldsymbol{n}}) &= \quad - \frac{1}{2} tr(\boldsymbol{I}\_{\boldsymbol{\theta}^\*} [(\boldsymbol{I}\_{\boldsymbol{\theta}^\*}^\*)^{-1} \boldsymbol{I}\_{\boldsymbol{\theta}^\*}^\* (\boldsymbol{I}\_{\boldsymbol{\theta}^\*}^\*)^{-1})]) \\ &\quad - \frac{1}{2} tr(\boldsymbol{I}\_{\boldsymbol{\theta}^\*} [(\boldsymbol{J}\_{\boldsymbol{\theta}\_0^\*}^\*)^{-1} \boldsymbol{I}\_{\boldsymbol{\theta}\_0^\*}^\* (\boldsymbol{J}\_{\boldsymbol{\theta}\_0^\*}^\*)^{-1})) \\ &\quad - \frac{n}{2} (\boldsymbol{\theta}\_0^\* - \boldsymbol{\theta}\_0)^T \boldsymbol{I}\_{\boldsymbol{\theta}^\*} (\boldsymbol{\theta}\_0^\* - \boldsymbol{\theta}\_0) \\ &\quad - \boldsymbol{n} \mathbb{E}\_{\boldsymbol{X}\_{\boldsymbol{n}}} [(\boldsymbol{\theta}^\* - \boldsymbol{\theta}\_0^\*)^T \boldsymbol{\partial}\_{\boldsymbol{\theta}} \boldsymbol{\ell} (\boldsymbol{\theta}^\*)] \\ &\quad + \quad O(\boldsymbol{\delta}). \end{split}$$

Because -<sup>∗</sup>(*θ*) = -(*θ*) + *Op*(*δ*2), it follows that *J*<sup>∗</sup> *<sup>θ</sup>* <sup>=</sup> *<sup>J</sup><sup>θ</sup>* <sup>+</sup> *Op*(*δ*2) and thus *<sup>J</sup>θ*<sup>∗</sup> (*J*<sup>∗</sup> *<sup>θ</sup>*<sup>∗</sup> )−<sup>1</sup> <sup>=</sup> *<sup>I</sup>* <sup>+</sup> *<sup>O</sup>*(*δ*2). Similarly for *<sup>J</sup>θ*<sup>0</sup> . In addition *<sup>I</sup>θ*<sup>∗</sup> <sup>=</sup> *<sup>I</sup>θ*<sup>∗</sup> <sup>0</sup> <sup>+</sup> *<sup>O</sup>*(*δ*), thus the two trace terms can be simplified and combined:

$$\begin{aligned} b(\boldsymbol{\theta}^\*(\boldsymbol{X}\_n), \boldsymbol{X}\_n) &= -\operatorname{tr}(\boldsymbol{I}\_{\boldsymbol{\theta}^\*} \boldsymbol{I}\_{\boldsymbol{\theta}^\*}^{-1}) \\ &- \frac{n}{2} (\boldsymbol{\theta}\_0^\* - \boldsymbol{\theta}\_0)^T \boldsymbol{I}\_{\boldsymbol{\theta}^\*} (\boldsymbol{\theta}\_0^\* - \boldsymbol{\theta}\_0) \\ &- n \mathbb{E}\_{\boldsymbol{X}\_n} [(\boldsymbol{\theta}^\* - \boldsymbol{\theta}\_0^\*)^T \boldsymbol{\partial}\_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}^\*)] \\ &+ o(\delta) .\end{aligned}$$

As *θ*∗ <sup>0</sup> is a minimum of L∗, begin by Taylor expanding the derivatives

$$
\partial\_{\theta} \mathcal{L}(\theta\_0^\*) = \partial\_{\theta} \mathcal{L}(\theta\_0) - (\theta\_0^\* - \theta\_0) I\_{\theta\_0} = -(\theta\_0^\* - \theta\_0) I\_{\theta\_0}.\tag{A14}
$$

Moreover, show from the definition of L<sup>∗</sup> that

$$
\partial\_{\theta} \mathcal{L}^\*(\theta\_0^\*) = 0 = \partial\_{\theta} \mathcal{L}(\theta\_0^\*) + \frac{1}{n} \partial\_{\theta} tr(I)^{-1} \tag{A15}
$$

Then, after Taylor expanding <sup>L</sup>(*θ*<sup>∗</sup> <sup>0</sup> ) around *<sup>θ</sup>*0, it is seen that

$$\delta(\theta\_0^\* - \theta\_0) = \frac{1}{n} J\_{\theta\_0}^{-1} \partial\_\theta tr(II^{-1}).\tag{A16}$$

Noting that *J* −1 *<sup>θ</sup>* <sup>=</sup> *<sup>O</sup>*(1) and *tr*(*I J*−1) = *<sup>O</sup>*(1), it holds that (*θ*<sup>∗</sup> <sup>0</sup> <sup>−</sup> *<sup>θ</sup>*0) = *<sup>O</sup>*(*δ*2). Therefore, *<sup>n</sup>* <sup>2</sup> (*θ*<sup>∗</sup> <sup>0</sup> <sup>−</sup> *<sup>θ</sup>*0)*<sup>T</sup> <sup>J</sup>θ*<sup>∗</sup> (*θ*<sup>∗</sup> <sup>0</sup> <sup>−</sup> *<sup>θ</sup>*0) = *<sup>O</sup>*(*δ*2), and can be neglected. Then, the bias becomes

$$\begin{aligned} b(\boldsymbol{\theta}^\*(\boldsymbol{X}\_{\boldsymbol{n}}), \boldsymbol{X}\_{\boldsymbol{n}}) &= -\operatorname{tr}(\boldsymbol{I}\_{\boldsymbol{\theta}^\*} \boldsymbol{I}\_{\boldsymbol{\theta}^\*}^{-1}) \\ &- \begin{array}{ll} n \mathbb{E} \boldsymbol{X}\_{\boldsymbol{n}} \big[ (\boldsymbol{\theta}^\* - \boldsymbol{\theta}\_0^\*)^T \partial\_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}^\*) \big] \\ + \quad \boldsymbol{O}(\boldsymbol{\delta}) .\end{array} \end{aligned}$$

However, for the same reason, *∂θ*-(*θ*∗) = *<sup>O</sup>*(*δ*2), so the last term *<sup>n</sup>*E*X<sup>n</sup>* [(*θ*<sup>∗</sup> <sup>−</sup> *<sup>θ</sup>*<sup>∗</sup> 0 )*T ∂θ*-(*θ*∗)] = *O*(*δ*), and it too can be absorbed into the residual.

Therefore,

$$\begin{aligned} b(\theta^\*(X\_n), X\_n) &= -tr(I\_{\theta^\*} I\_{\theta^\*}^{-1}) \\ &+ \quad O(\delta)\_t \end{aligned}$$

and

$$-\mathcal{L}(\hat{\boldsymbol{\theta}}) = -\ell(\boldsymbol{\theta}^\*(\boldsymbol{Z}\_{\boldsymbol{n}}), \boldsymbol{Z}\_{\boldsymbol{n}}) + \frac{1}{n}\mathrm{tr}(\boldsymbol{I}\_{\boldsymbol{\theta}}\boldsymbol{I}\_{\boldsymbol{\theta}}^{-1}) + \mathcal{O}\_{\mathcal{P}}(\boldsymbol{\delta}^3) + \mathcal{C},\tag{A17}$$

where the constant *C* is composed of the neglected constant terms from earlier stages

$$\begin{aligned} \mathcal{C} &= -\frac{1}{2} (\boldsymbol{\theta}\_0^\* - \boldsymbol{\theta}\_0)^T \boldsymbol{I}\_{\boldsymbol{\theta}0} (\boldsymbol{\theta}\_0^\* - \boldsymbol{\theta}\_0), \\ &+ \quad - (\boldsymbol{\theta}\_0^\* - \boldsymbol{\theta}\_0)^T \boldsymbol{\partial}\_{\boldsymbol{\theta}} \mathcal{L} (\boldsymbol{\theta}\_0^\*). \end{aligned}$$

However, the last term is *O*(*δ*3), and the first is *O*(*δ*2), so this may be approximated as

$$\mathcal{C} = -\frac{1}{2} (\boldsymbol{\theta}\_0^\* - \boldsymbol{\theta}\_0)^T J\_{\boldsymbol{\theta}\_0} (\boldsymbol{\theta}\_0^\* - \boldsymbol{\theta}\_0). \tag{A18}$$

Recalling again that (*θ*<sup>∗</sup> <sup>0</sup> <sup>−</sup> *<sup>θ</sup>*0) = *<sup>O</sup>*(*δ*2), it is clear that *<sup>C</sup>* <sup>=</sup> *<sup>O</sup>*(*δ*4), and may thus be absorbed into the *O*(*δ*3) residual term. Therefore,

$$-\mathcal{L}(\hat{\boldsymbol{\theta}}) = -\ell(\boldsymbol{\theta}^\*(\boldsymbol{Z}\_{\mathbb{H}}), \boldsymbol{Z}\_{\mathbb{H}}) + \frac{1}{n} \mathrm{tr}(\boldsymbol{I}\_{\boldsymbol{\theta}} \boldsymbol{I}\_{\boldsymbol{\theta}}^{-1}) + O(\boldsymbol{\delta}^3). \tag{A19}$$

Comparing this to the form of -<sup>∗</sup>(*θ*): − L(*θ*ˆ) = <sup>−</sup>-<sup>∗</sup>(*θ*∗(*Zn*), *Zn*) + *Op*(*δ*3). (A20)

Thus, by minimizing −-<sup>∗</sup> instead of −-, the first order terms of the prediction bias are canceled and, in expectation, a more accurate model is produced.

#### **References**


## *Article* **On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements**

**Farzad Shahrivari and Nikola Zlatanov \***

Electrical and Computer Systems Engineering, Monash University, Alliance Ln, Clayton, VIC 3168, Australia; Farzad.shahrivari@monash.edu

**\*** Correspondence: Nikola.zlatanov@monash.edu

**Abstract:** In this paper, we investigate the problem of classifying feature vectors with mutually independent but non-identically distributed elements that take values from a finite alphabet set. First, we show the importance of this problem. Next, we propose a classifier and derive an analytical upper bound on its error probability. We show that the error probability moves to zero as the length of the feature vectors grows, even when there is only one training feature vector per label available. Thereby, we show that for this important problem at least one asymptotically optimal classifier exists. Finally, we provide numerical examples where we show that the performance of the proposed classifier outperforms conventional classification algorithms when the number of training data is small and the length of the feature vectors is sufficiently high.

**Keywords:** supervised classification; independent and non-identically distributed features; analytical error probability

#### **1. Introduction**

#### *1.1. Background*

Supervised classification is a machine learning technique that maps an input feature vector to an output label based on a set of correctly labeled training data. There is no single learning algorithm that works best on all supervised learning problems, as shown by the no free lunch theorem in [1]. As a result, there are many algorithms proposed in the literature whose performance depends on the underlying problem and the amount of training data available. The most widely used algorithms in the literature are decision trees [2,3], Support Vector Machines (SVM) [4,5], Rule-Based Systems [6], naive Bayes classifiers [7], k-nearest neighbors (KNN) [8], logistic regressions, and neural networks [9,10].

#### *1.2. Motivation*

In the following, we discuss the motivation for this work.

1.2.1. Lack of Tight Upper Bounds on the Performance of Classifiers

In general, there are no tight upper bounds on the performance of the classifiers used in practice. Many of the previous works only provide experimental performance results. However, this approach has drawbacks. For example, one has to rely on the trial-and-error approach in order to develop a good classifier for a given problem, which impacts the reliability. Next, the algorithms whose performance has been verified only experimentally may work for a given problem, but may fail to work when applied to a similar problem. Finally, experimental results do not provide intuition into the underlying problem, whereas the analytical results provide the understanding of the underlying problem and the corresponding solutions.

Motivated by this, in the paper, we aim to investigate classifiers with analytical upper bounds on their performance.

**Citation:** Shahrivari, F.; Zlatanov, N. On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements. *Entropy* **2021**, *23*, 1045. https://doi.org/10.3390/ e23081045

Academic Editors: Lizhong Zheng and Chao Tian

Received: 26 July 2021 Accepted: 10 August 2021 Published: 13 August 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

1.2.2. Independent and Non-Identically Distributed Features

In general, we can categorize the statistical properties of the feature vectors, which are the input to the classifier, into three types. To this end, let *Yn*(*X*) = *<sup>Y</sup>*1(*X*),*Y*2(*X*), ... ,*Yn*(*X*) denote the input feature vector to the supervised classifier, where *n* is the length of the feature vector and *X* is the label to which the feature vector *Yn*(*X*) belongs. Then, we can distinguish the following three types of feature vectors depending on the statistics of the elements in the feature vector *Yn*(*X*).

The first type of feature vector is when the elements of *Yn*(*X*) are independent and identically distributed (i.i.d.). This is the simplest features model, but also the least applicable in practice. This model is identical to hypothesis testing, which has been well investigated in the literature [11–13]. As a result, tight upper bounds on the performance of supervised learning algorithms for this type of feature vector are available in the hypothesis testing literature. For instance, the authors in [11] showed that the posterior entropy and the maximum a posterior error probability decay to zero with the length of the feature vector at the identical exponential rate, where the maximum achievable exponent is the minimum Chernoff information. In [12], the authors determine the requirements for the length of the vector *Yn*(*X*) and the number of labels *m* in order to achieve vanishing exponential error probability in testing *m* hypothesis that minimizes the rejection zone. In [13], the authors provide an upper bound and a lower-bound on the error probability of Bayesian *m*-ary hypothesis testing in terms of conditional entropy.

The second type of feature vectors is when the elements of *Yn*(*X*) are mutually dependent and non-identically distributed (d.non-i.d.). This type of features model is the most general model and the most applicable in practice. However, it is also the most difficult to tackle analytically. As a result, supervised learning algorithms proposed for this features model lack analytical tight upper bounds on their performance [14–23]. This is because there are not any frameworks that produce closed-form results when deriving statistics of vectors with d.non-i.d. elements when the underlying distributions are unknown. Then how can we investigate analytically classifiers for practical scenarios when the feature vectors have d.non-i.d. elements? A possible approach leads us to the third type of feature vectors, explained in the following.

The third type of feature vectors is when the elements of *Yn*(*X*) are mutually independent but non-identically distributed (i.non-i.d.). This features model is much simpler than the d.non-i.d. features model and, more importantly, it is analytically tractable, as we show in this paper. Furthermore, this features model is applicable in practice. Specifically, there exists a class of algorithms, known as Independent Component Analysis (ICA), that transform vectors with d.non-i.d. elements into vectors with i.non-i.d. elements with a zero or a negligible loss of information [24–28]. The origins of ICA can be traced back to Barlow [29], who argued that a good representation of binary data can be achieved by an invertible transformation that transform vectors with d.non-i.d. elements into vectors with i.non-i.d. elements. Finding such a transformation with no prior information about the distribution of the data has been considered an open problem until recently [28]. Specifically, the authors in [28] show that this hard problem can be accurately solved with a branch and bound search tree algorithm, or tightly approximated with a series of linear problems. Thereby, the authors in [28] provide the first efficient set of solutions to Barlow's problem. So far, the complexity of the fastest such algorithm is O *<sup>n</sup>* <sup>×</sup> <sup>2</sup>*n* [28]. Nevertheless, since there exist such invertible transformations (i.e., no loss of information) which can transform vectors with d.non-i.d. elements into vectors with i.non-i.d. elements, we can tackle the features model comprised of d.non-i.d. elements by first transforming it (without loss of information) into the features model comprised of i.non-i.d. elements and then tackling the i.non-i.d. features model.

Motivated by this, in this paper, we investigate supervised classification of feature vectors with i.non-i.d. elements.

#### 1.2.3. Small Training Set

The main factor that impacts the accuracy of supervised classification is the amount of training data. In fact, most supervised algorithms are able to learn only if there is a very large set of training data available [30]. The main reason for this is the curse of dimensionality [31,32], which states that "the higher the dimensionality of the feature vectors, the more training data are needed for the supervised classifier" [33]. For example, supervised classification methods such as random forest [34,35] and KNN [36] suffer from the curse of dimensionality. However, having large training data sets is not always possible in practice. As a result, designing a supervised classification algorithm that exhibits good performance even when the training data set is extremely small is important.

Motivated by this, in this paper, we investigate supervised classifiers for the case when *t* training feature vectors per label are available, where *t* = 1, 2, ...

#### *1.3. Contributions*

In this paper, we propose an algorithm for supervised classification of feature vectors with i.non-i.d. elements when the number of training feature vectors per label is *t*, where *t* = 1, 2, ... Next, we derive an upper bound on the error probability of the proposed classifier for uniformly distributed labels and prove that the error probability exponentially decays to zero when the length of the feature vector, *n*, grows, even when only one training vector per label is available, i.e., when *t* = 1. Hence, the proposed classification algorithm provides an asymptotically optimal performance even when the number of training vectors per label is extremely small. We compare the performance of the proposed classifier with the naive Bayes classifier and to the KNN algorithm. Our numerical results show that the proposed classifier significantly outperforms the naive Bayes classifier and the KNN algorithm when the number of training feature vectors per label is small and the length of the feature vectors *n* is sufficiently high.

The proposed algorithm is a form of the nearest neighbor classification algorithm, where the nearest neighbor is searched in the domain of empirical distributions. As a result, we refer to the algorithm as *the nearest empirical distribution*. The nearest empirical distribution algorithm is not new and, to the best of our knowledge, it was first proposed in [37] for the case when the elements of *Yn*(*X*) are i.i.d., i.e., for the equivalent problem of hypothesis testing. However, in this paper, we propose the nearest empirical distribution algorithm for the case when the elements of *Yn*(*X*) are i.non-i.d., which is much more complex than the problem of hypothesis testing where the elements of *Yn*(*X*) are i.i.d.

To the best of our knowledge, this is the first paper that investigates the important problem of classifying feature vectors with i.non-i.d. elements and provides an upper bound on its error probability. The novelty of this paper is not with the classifier itself, but rather in showing the importance of the problem of classifying feature vectors with i.non-i.d elements and in showing analytically that at least one classifier with an asymptotically optimal error probability exists when at least one training feature vectors per label is available.

The remainder of this paper is structured as follows. In Section 2, we formulate the considered classification problem. In Section 3, we provide our classifier and derive an upper bound on its error probability. In Section 4, we provide numerical examples of the performance on the proposed classifier. Finally, Section 5 concludes the paper.

#### **2. Problem Formulation**

The machine learning model is comprised of a label *X*, a feature vector *Yn*(*X*) = *<sup>Y</sup>*1(*X*),*Y*2(*X*), ... ,*Yn*(*X*) of length *n* mapped to the label *X*, and a learned label *X*ˆ , as shown in Figure 1. In this paper, we adopt the information-theoretic style of notations and thereby random variables are denoted by capital letters and their realizations are denoted with small letters. The feature vector *Yn*(*X*) is the input to the machine learning algorithm whose aim is to detect the label *X* from the observed feature vector *Yn*(*X*). The performance of the machine learning algorithm is measured by the error probability P<sup>e</sup> = Pr, *<sup>X</sup>* <sup>=</sup> *<sup>X</sup>*<sup>ˆ</sup> - .

**Figure 1.** A typical structural modelling of the classification learning problem.

We adopt the modeling in [38–40] and represent the dependency between the label *X* and the feature vector *<sup>Y</sup>n*(*X*) via a joint probability distribution *pX*,*Y<sup>n</sup>* (*x*, *<sup>y</sup>n*). Now, in order to gain a better understanding of the problem, we include the joint probability distribution *pX*,*Y<sup>n</sup>* (*x*, *<sup>y</sup>n*) into the model in Figure 1. To this end, since *pX*,*Y<sup>n</sup>* (*x*, *<sup>y</sup>n*) = *pYn*|*X*(*yn*|*x*)*pX*(*x*) holds, instead of *pX*,*Y<sup>n</sup>* (*x*, *<sup>y</sup>n*), we can include the conditional probability distribution *pYn*|*X*(*yn*|*x*) and the probability distribution *pX*(*x*) into the model in Figure 1, and thereby obtain the model in Figure 2.

**Figure 2.** An alternative modeling of the classification learning problem.

Now, the classification learning model in Figure 2 is a system comprised of a label generating source *<sup>X</sup>* according to the distribution *pX* (*x*), a feature vector generator modelled by the conditional probability distribution *pYn*|*<sup>X</sup>* (*yn*|*x*), a feature vector *<sup>Y</sup>n*, a classifier that aims to detect *X* from the observed feature vector *Yn*, and the detected label *X*ˆ . Note that the system model in Figure 2 can be seen equivalently as a communication system comprised of a source *X*, a channel with input *X* and output *Yn*, and a decoder (i.e., detector) that aims to detect *X* from *Yn*. The notation used in this paper, letter *X* for labels and letter *Y* for features, is based on the notation used in information theory for modelling communication systems. In the classification model shown in Figure 2, we assume that the label *<sup>X</sup>* can take values from the set <sup>X</sup> , according to *pX* (*x*) = 1/|X |, where |·| denotes the cardinality of a set. Next, we assume that the *i*-th element of the feature vector *Yn*, *Yi*, for *<sup>i</sup>* <sup>=</sup> 1, 2, ... , *<sup>n</sup>*, takes values from the set <sup>Y</sup> <sup>=</sup> , *<sup>y</sup>*1, *<sup>y</sup>*2, ... , *<sup>y</sup>*|Y|- , according to the conditional probability distribution *pYi*|*<sup>X</sup>* (*yi*|*x*).

Moreover, we assume that the elements of the feature vector *Y<sup>n</sup>* are i.non-i.d. As a result, the feature vector *<sup>Y</sup><sup>n</sup>* takes values from the set <sup>Y</sup>*<sup>n</sup>* according to the conditional probability distribution *pYn*|*<sup>X</sup>* (*yn*|*x*) given by

$$p\_{\mathbf{y}^n|\boldsymbol{\mathcal{X}}}(\mathbf{y}^n|\mathbf{x}) = p\_{\mathbf{y}\_1, \mathbf{y}\_2, \dots, \mathbf{y}\_n|\boldsymbol{\mathcal{X}}}(\mathbf{y}\_1, \mathbf{y}\_2, \dots, \mathbf{y}\_n|\mathbf{x}) \stackrel{(\circ)}{=} \prod\_{i=1}^n p\_{\mathbf{y}\_i|\boldsymbol{\mathcal{X}}}(y\_i|\mathbf{x}) \stackrel{(\circ)}{=} \prod\_{i=1}^n p\_i(y\_i|\mathbf{x}),\tag{1}$$

where (*a*) comes from the fact that elements in the feature vector *Y<sup>n</sup>* are mutually independent and (*b*) is for the sake of notational simplicity, where *pi* is used instead of *pYi*|*<sup>X</sup>* . As a result of (1), the considered classification model in Figure 2 can be represented equivalently as in Figure 3.

**Figure 3.** An alternative modelling of the classification learning problem when the elements of *Yn*(*X*) are mutually independent but non-identically distributed (i.non-i.d.).

Next, we assume that *pi* (*yi*|*x*), <sup>∀</sup>*i*, and thereby *pYn*|*<sup>X</sup>* (*yn*|*x*) are unknown to the classifier. Instead, the classifier knows <sup>X</sup> , <sup>Y</sup>, and for each *xi* ∈ X , where *<sup>i</sup>* <sup>=</sup> 1, 2, ... , |X |, it has access to a finite set of *<sup>t</sup>* correctly labelled input–output pairs (*xi*, *<sup>y</sup>*ˆ*<sup>n</sup> i*1 ),(*xi*, *<sup>y</sup>*ˆ*<sup>n</sup> i*2 ),..., (*xi*, *<sup>y</sup>*ˆ*<sup>n</sup>* ), denoted by <sup>T</sup>*i*, referred to as the training set for label *xi*.

*it* Finally, we assume that the following holds

$$\sum\_{l=1}^{n} p\_l(y|\mathbf{x}\_i) \neq \sum\_{l=1}^{n} p\_l(y|\mathbf{x}\_j), \text{ for } y \in \mathcal{Y} \text{ and } i \neq j. \tag{2}$$

The condition in (2) means that the distribution of the feature vectors *Yn*(*X*) for label *X* = *i* is not a perturbation of distribution of the feature vectors *Yn*(*X*) for label *X* = *j*. As a result, the proposed classifier only applies to the subset of data vectors with i.non-i.d. elements that satisfy (2).

For the classification system model defined above and illustrated in Figure 3, we wish to propose a classifier that exhibits an asymptotically optimal error probability P<sup>e</sup> = Pr, *<sup>X</sup>* <sup>=</sup> *<sup>X</sup>*<sup>ˆ</sup> - with respect to the length of *<sup>Y</sup>n*, *<sup>n</sup>*, for any *<sup>t</sup>* <sup>≥</sup> 1, i.e., for any *<sup>t</sup>* <sup>≥</sup> 1, <sup>P</sup><sup>e</sup> <sup>→</sup> 0 as *n* → ∞. Moreover, we wish to obtain an analytical upper bound on the error probability of the proposed classifier for a given *t* and *n*.

#### **3. The Proposed Classifier and Its Performance**

In this section, we propose our classifier, derive an analytical upper bound on its error probability, and prove that the classifier exhibits an asymptotically optimal performance when the length of the feature vector *<sup>Y</sup>n*, *<sup>n</sup>*, satisfies *<sup>n</sup>* <sup>→</sup> <sup>∞</sup>. This is conducted in the following.

For a given vector v*<sup>n</sup>* = (*v*1, *<sup>v</sup>*2,..., *vn*), let the Minkowski distance *<sup>r</sup>* be defined as

$$\left\|\mathbf{v}\right\|\_{r} = \left(\sum\_{i=1}^{n} v\_i^r\right)^{(1/r)}.\tag{3}$$

Moreover, for a given feature vector *<sup>y</sup><sup>k</sup>* = (*y*1, *<sup>y</sup>*2, , ... , *yk*), let <sup>I</sup>[*y<sup>k</sup>* <sup>=</sup> *<sup>y</sup>*] be a function defined as

$$\mathcal{Z}[y^k = y] = \sum\_{i=1}^k \mathcal{Z}[y\_i = y]\_\prime \tag{4}$$

where <sup>Z</sup>[*yi* <sup>=</sup> *<sup>y</sup>*] is an indicator function assuming the value 1 if *yi* <sup>=</sup> *<sup>y</sup>* and 0 otherwise. Hence, <sup>I</sup>[*y<sup>k</sup>* <sup>=</sup> *<sup>y</sup>*] counts the number of elements in *<sup>Y</sup><sup>k</sup>* that have the value *<sup>y</sup>*.

*3.1. The Proposed Classifier*

Let *y*ˆ*nt <sup>i</sup>* be a vector obtained by concatenating all training feature vectors for the input label *xi* as

$$
\hat{y}\_{i}^{nt} = \left(\hat{y}\_{i\_1}^{n}, \hat{y}\_{i\_2}^{n}, \dots, \hat{y}\_{i\_l}^{n}\right). \tag{5}
$$

Let *Py*ˆ*nt <sup>i</sup>* be the empirical probability distribution of the concatenated training feature vector for label *xi*, *y*ˆ*nt <sup>i</sup>* , given by

$$P\_{\mathcal{Y}\_i^{nt}} = \left[ \frac{\mathcal{Z}\left[\mathcal{Y}\_i^{nt} = y\_1\right]}{nt}, \frac{\mathcal{Z}\left[\mathcal{Y}\_i^{nt} = y\_2\right]}{nt}, \dots, \frac{\mathcal{Z}\left[\hat{y}\_i^{nt} = y\_{|\mathcal{Y}|}\right]}{nt} \right]. \tag{6}$$

Let *y<sup>n</sup>* be the observed feature vector at the classifier whose label it wants to detect and let *Pyn* denote the empirical probability distribution of *yn*, given by

$$P\_{y^n} = \left[\frac{\mathcal{Z}\left[y^n = y\_1\right]}{n}, \frac{\mathcal{Z}\left[y^n = y\_2\right]}{n}, \dots, \frac{\mathcal{Z}\left[y^n = y\_{|\mathcal{Y}|}\right]}{n}\right].\tag{7}$$

Using the above notations, we propose the following classifier.

**Proposition 1.** *For the considered system model, we propose a classifier with the following classification rule*

$$\hat{\mathbf{x}} = \mathbf{x}\_{i\prime}\text{ where }i = \arg\min\_{i} \left\| P\_{y^n} - P\_{\hat{y}\_i^{nt}} \right\|\_{r\prime} \tag{8}$$

*where r* ≥ 1 *and ties are resolved by assigning the label among the ties uniformly at random. (For example, if Pyn* − *Py*ˆ*nt i <sup>r</sup>* <sup>=</sup> *Pyn* − *Py*ˆ*nt j <sup>r</sup> holds for, <sup>i</sup>* <sup>=</sup> *j, we set <sup>x</sup>*<sup>ˆ</sup> <sup>=</sup> *xi or <sup>x</sup>*<sup>ˆ</sup> <sup>=</sup> *xj uniformly at random).*

As seen from (8), the proposed classifier assigns the label *xi* if the empirical probability distribution of the concatenated training feature vector mapped to label *xi*, *Py*ˆ*nt <sup>i</sup>* is the closest, in terms of Minkowski distance *r*, to the empirical probability distribution of the observed feature vector *Pyn* . In that sense, the proposed classifier can be considered as the nearest empirical distribution classifier.

#### *3.2. Upper Bound on the Error Probability*

The following theorem establishes an upper bound on the error probability of the proposed classifier.

**Theorem 1.** *Let* P¯ *j, for j* <sup>=</sup> 1, 2, . . . , |X |*, be a vector defined as*

$$\mathcal{P}\_{\vec{\jmath}} = \left[ \mathfrak{p}(y\_1|x\_{\vec{\jmath}}), \mathfrak{p}(y\_2|x\_{\vec{\jmath}}), \dots, \mathfrak{p}(y\_{|\mathcal{Y}|}|x\_{\vec{\jmath}}) \right],\tag{9}$$

*where* p¯(*y*|*xj*) *is given by*

$$\Phi(y|\mathbf{x}\_{\dot{\jmath}}) = \frac{1}{n} \sum\_{k=1}^{n} p\_k(y|\mathbf{x}\_{\dot{\jmath}}).\tag{10}$$

*Then, for a given r* ≥ 1*, the error probability of the proposed classifier is upper bounded by*

$$\mathbb{P}\_{\mathbf{e}} \le 2|\mathcal{Y}| \mathbf{e}^{-2n\mathbf{e}^2} + 2|\mathcal{Y}| \mathbf{e}^{-2nt^{1/3}\mathbf{e}^2},\tag{11}$$

*where is given by*

$$\mathfrak{c} = \min\_{\substack{i,j\\i\neq j}} \frac{||P\_{\mathcal{Y}^{st}\_i} - \bar{P}\_j||\_r}{(2 + t^{-1/3})|\mathcal{Y}|^{1/r}}.\tag{12}$$

**Proof of Theorem 1.** Without loss of generality we assume that *<sup>x</sup>*<sup>1</sup> is the input to *pYn*|*<sup>X</sup>* (*yn*|*x*) and *y<sup>n</sup>* is observed.

Let <sup>A</sup> *<sup>k</sup>*, for 1 ≤ *k* ≤ |Y|, be a set defined as

$$\mathcal{A}\_k^\varepsilon = \left\{ y^n : \left| \frac{\mathcal{T}[y^n = y\_k]}{n} - \mathfrak{p}(y\_k|\mathbf{x}\_1) \right| \le \varepsilon \right\}. \tag{13}$$

Furthermore, let <sup>B</sup> *<sup>k</sup>* , for 1 ≤ *k* ≤ |Y|, be a set defined as

$$\mathcal{B}\_k^{\mathfrak{c}} = \left\{ \hat{y}^{\mathfrak{n}t} : \left| \frac{\mathcal{Z}\left[\hat{y}^{\mathfrak{n}T} = y\_k\right]}{nt} - \mathfrak{p}(y\_k|\mathbf{x}\_1) \right| \le \frac{\mathfrak{c}}{\sqrt[3]{t}} \right\}.\tag{14}$$

$$\text{Let } \mathcal{A}^{\varepsilon} = \bigcap\_{k=1}^{|\mathcal{Y}|} \mathcal{A}\_{k}^{\varepsilon} \text{ and } \mathcal{B}^{\varepsilon} = \bigcap\_{k=1}^{|\mathcal{Y}|} \mathcal{B}\_{k}^{\varepsilon}. \text{ Now, for any } y^{n} \in \mathcal{A}^{\varepsilon}, \text{ we have}$$

$$\left(\sum\_{k=1}^{|\mathcal{Y}|} \left|\frac{\mathcal{Z}[y^{n} = y\_{k}]}{n} - \mathfrak{p}(y\_{k}|\mathbf{x}\_{1})\right|^{r}\right)^{1/r} \stackrel{(a)}{\leq} \left(\sum\_{k=1}^{|\mathcal{Y}|} \mathfrak{c}^{r}\right)^{1/r},\tag{15}$$

where (*a*) follows from (13). Moreover, for *y*ˆ*nt* <sup>1</sup> ∈ B, we have

$$\left(\sum\_{k=1}^{|\mathcal{Y}|} \left| \frac{\mathbb{Z}[\mathcal{Y}\_1^{\mathrm{nt}} = y\_k]}{nt} - \mathfrak{p}(y\_k|\mathbf{x}\_1) \right|^r \right)^{1/r} \overset{(\*)}{\leq} \left(\sum\_{k=1}^{|\mathcal{Y}|} \left(\frac{\varepsilon}{\sqrt[3]{t}}\right)^r \right)^{1/r},\tag{16}$$

where (*a*) follows from (14). Next, we have the following upper bound

$$\begin{split} & \left( \sum\_{k=1}^{|\mathcal{Y}|} \left| \frac{\mathbb{Z}[y^n = y\_k]}{n} - \frac{\mathbb{Z}[y\_1^{nt} = y\_k]}{nt} \right|^r \right)^{1/r} \\ &= \left( \sum\_{k=1}^{|\mathcal{Y}|} \left| \frac{\mathbb{Z}[y^n = y\_k]}{n} - \mathfrak{p}(y\_k|\mathbf{x}\_1) - \left( \frac{\mathbb{Z}[\hat{y}\_1^{nt} = y\_k]}{nt} - \mathfrak{p}(y\_k|\mathbf{x}\_1) \right) \right|^r \right)^{1/r} \\ &\overset{(a)}{\leq} \left( \sum\_{k=1}^{|\mathcal{Y}|} \left| \frac{\mathbb{Z}[y^n = y\_k]}{n} - \mathfrak{p}(y\_k|\mathbf{x}\_1) \right|^r \right)^{1/r} + \left( \sum\_{k=1}^{|\mathcal{Y}|} \left| \frac{\mathbb{Z}[\hat{y}\_1^{nt} = y\_k]}{nt} - \mathfrak{p}(y\_k|\mathbf{x}\_1) \right|^r \right)^{1/r}, \end{split} \tag{17}$$

where (*a*) follows from the Minkowski inequality. Combining (15)–(17), we obtain

$$\left(\sum\_{k=1}^{|\mathcal{Y}|} \left| \frac{\mathcal{Z}[y^n = y\_k]}{n} - \frac{\mathcal{Z}[\mathcal{Y}\_1^{nt} = y\_k]}{nt} \right|^r \right)^{1/r} \le |\mathcal{Y}|^{1/r} \epsilon + |\mathcal{Y}|^{1/r} \frac{\epsilon}{\sqrt[3]{t}}.\tag{18}$$

Hence, the Minkowski distance between the empirical probability distribution of the observed vector *y<sup>n</sup>* and the empirical probability distribution of the concatenated training vector for label *x*<sup>1</sup> is upper bounded by the right hand side of (18). We now derive a lower bound for *y*ˆ*nt <sup>i</sup>* , where *<sup>i</sup>* <sup>=</sup> 1. For any *xi*, such that *<sup>i</sup>* <sup>=</sup> 1, we have

$$\begin{aligned} &\left(\sum\_{k=1}^{|\mathcal{Y}|} \left|\frac{\mathbb{Z}[y^n = y\_k]}{n} - \frac{\mathbb{Z}[\hat{y}\_i^{\rm int} = y\_k]}{nt}\right|^r\right)^{1/r} + \left(\sum\_{k=1}^{|\mathcal{Y}|} \varepsilon^r\right)^{1/r} \\ &\overset{(i)}{\geq} \left(\sum\_{k=1}^{|\mathcal{Y}|} \left|\frac{\mathbb{Z}[y^n = y\_k]}{n} - \frac{\mathbb{Z}[\hat{y}\_i^{\rm int} = y\_k]}{nt}\right|^r\right)^{1/r} + \left(\sum\_{k=1}^{|\mathcal{Y}|} \left|\frac{\mathbb{Z}[y^n = y\_k]}{n} - \mathbb{P}(y\_k|\mathbf{x}\_1)\right|^r\right)^{1/r} \\ &\overset{(b)}{\geq} \left(\sum\_{k=1}^{|\mathcal{Y}|} \left|\frac{\mathbb{Z}[\hat{y}\_i^{\rm int} = y\_k]}{nt} - \mathbb{P}(y\_k|\mathbf{x}\_1)\right|^r\right)^{1/r} \end{aligned} \tag{19}$$

where (*a*) follows from (15) and (*b*) is again due to the Minkowski inequality. The expression in (19), can be written equivalently as

$$\begin{aligned} &\left(\sum\_{k=1}^{|\mathcal{Y}|} \left| \frac{\mathcal{Z}[y^n = y\_k]}{n} - \frac{\mathcal{Z}[\mathcal{Y}\_i^{nt} = y\_k]}{nt} \right|^r \right)^{1/r} \\ &\geq \left(\sum\_{k=1}^{|\mathcal{Y}|} \left| \frac{\mathcal{Z}[\mathcal{Y}\_i^{nt} = y\_k]}{nt} - \bar{\mathfrak{p}}(y\_k|\mathbf{x}\_1) \right|^r \right)^{1/r} - |\mathcal{Y}|^{1/r}\epsilon,\tag{20} \end{aligned}$$

where *<sup>i</sup>* <sup>=</sup> 1. Now, using the definitions of *Py*ˆ*nt <sup>i</sup>* and P¯ <sup>1</sup> given by (6) and (9), respectively, into (20) we can replace the expression in the right-hand side of (20) by *Py*ˆ*nt <sup>i</sup>* <sup>−</sup> P¯ <sup>1</sup> *r* , and thereby for any *<sup>i</sup>* <sup>=</sup> 1 we have

$$\left(\sum\_{k=1}^{|\mathcal{Y}|} \left| \frac{\mathcal{Z}[\mathcal{Y}^n = y\_k]}{n} - \frac{\mathcal{Z}[\mathcal{Y}\_i^{nt} = y\_k]}{nt} \right|^r \right)^{1/r} \ge \left\| P\_{\mathcal{Y}\_i^{nt}} - \bar{\mathcal{P}}\_1 \right\|\_r - \left| \mathcal{Y} \right|^{1/r} \varepsilon. \tag{21}$$

The expression in (21) represents a lower bound on the Minkowski *r* distance between the empirical probability distribution of the observed vector *y<sup>n</sup>* and the empirical probability distribution of the concatenated training vector for any label *xi*, where *<sup>i</sup>* <sup>=</sup> 1.

Using the bounds in (18) and (21), we now relate the left-hand sides of (18) and (21). As long as the following inequality holds for each *<sup>i</sup>* <sup>=</sup> 1,

$$|\mathcal{Y}|^{1/r}\epsilon\left(1+\frac{1}{\sqrt[3]{t}}\right) < ||P\_{\mathcal{Y}\_i^{nt}} - \bar{\mathcal{P}}\_1||\_r - |\mathcal{Y}|^{1/r}\epsilon,\tag{22}$$

which is equivalent to the following for *<sup>i</sup>* <sup>=</sup> <sup>1</sup>

$$\epsilon < \frac{\left||P\_{\mathcal{Y}\_i^{\rm nt}} - \mathbb{P}\_1\right||\_r}{(2 + t^{-1/3})|\mathcal{Y}|^{1/r'}}\tag{2.3}$$

$$\begin{split} \mathbb{E}\left(\sum\_{k=1}^{|\mathcal{I}|} \left|\frac{\mathcal{Z}[y^n = y\_k]}{n} - \frac{\mathcal{Z}[\mathcal{Y}\_1^{nt} = y\_k]}{nt}\right|^r\right)^{1/r} &\stackrel{(i)}{\leq} |\mathcal{Y}|^{1/r}\mathfrak{c}\left(1 + \frac{1}{\sqrt[3]{t}}\right) \\ &\stackrel{(i)}{\leq} \|P\_{\mathcal{Y}\_i^{nt}} - \mathcal{P}\_1\|\_r - |\mathcal{Y}|^{1/r}\mathfrak{c} \\ &\stackrel{(i)}{\leq} \left(\sum\_{k=1}^{|\mathcal{I}|} \left|\frac{\mathcal{Z}[y^n = y\_k]}{n} - \frac{\mathcal{Z}[\mathcal{Y}\_i^{nt} = y\_k]}{nt}\right|^r\right)^{1/r}, \end{split} \tag{24}$$

where (*a*), (*b*), and (*c*) follow from (18), (22), and (21), respectively. Thereby, from (24), we have the following for *<sup>i</sup>* <sup>=</sup> <sup>1</sup>

$$\left(\sum\_{k=1}^{|\mathcal{Y}|} \left| \frac{\mathcal{Z}[y^n = y\_k]}{n} - \frac{\mathcal{Z}[\mathfrak{z}\_1^{nT} = y\_k]}{nT} \right|^r \right)^{1/r} < \left(\sum\_{k=1}^{|\mathcal{Y}|} \left| \frac{\mathcal{Z}[y^n = y\_k]}{n} - \frac{\mathcal{Z}[\mathfrak{z}\_i^{nT} = y\_k]}{nT} \right|^r \right)^{1/r} . \tag{25}$$

Note that the right- and left-hand sides of (25) can be replaced by the Minkowski distance of the vectors

$$\mathbf{v}\_{1} = \left[\frac{\mathcal{Z}\left[y^{n} = y\_{1}\right]}{n} - \frac{\mathcal{Z}\left[\mathcal{Y}\_{1}^{nt} = y\_{1}\right]}{nt}, \dots, \frac{\mathcal{Z}\left[y^{n} = y\_{\left[\mathcal{Y}\right]}\right]}{n} - \frac{\mathcal{Z}\left[\mathcal{Y}\_{1}^{nt} = y\_{\left[\mathcal{Y}\right]}\right]}{nt}\right],\tag{26}$$

and

$$\mathbf{v}\_{2} = \left[ \frac{\mathcal{T}\left[\mathbf{y}^{n} = y\_{1}\right]}{n} - \frac{\mathcal{T}\left[\mathbf{y}\_{i}^{n\text{t}} = y\_{1}\right]}{n\text{t}}, \dots, \frac{\mathcal{T}\left[\mathbf{y}^{n} = y\_{\left|\mathcal{Y}\right|}\right]}{n} - \frac{\mathcal{T}\left[\mathbf{y}\_{i}^{n\text{t}} = y\_{\left|\mathcal{Y}\right|}\right]}{n\text{t}}\right],\tag{27}$$

respectively. Now, (26) and (27) can be replaced by *Pyn* − *Py*ˆ*nt* <sup>1</sup> and *Pyn* − *Py*ˆ*nt <sup>i</sup>* , respectively, by the definitions of *Pyn* and *Py*ˆ*nt <sup>i</sup>* given by (7) and (6), respectively. Therefore, (25) can be written equivalently as

$$\left\|\left|P\_{y^n} - P\_{\hat{y}\_1^{nt}}\right|\right\|\_{r} \leqslant \left\|\left|P\_{y^n} - P\_{\hat{y}\_i^{nt}}\right|\right\|\_{r}.\tag{28}$$

Now, let us highlight what we have obtained. We obtained that there is an for which if (23) holds for *<sup>i</sup>* <sup>=</sup> 1, and for that there are sets <sup>A</sup> and <sup>B</sup> for which *<sup>y</sup><sup>n</sup>* ∈ A and *y*ˆ*nt* <sup>1</sup> ∈ B then (28) holds for *<sup>i</sup>* <sup>=</sup> 1, and thereby our classifier will detect that *<sup>x</sup>*<sup>1</sup> is the correct label. Using this, we can upper bound the error probability as

$$\begin{split} \mathbb{P}\_{\mathfrak{e}} &= 1 - \Pr\{\mathfrak{X}\_{1} = \mathfrak{x}\_{1}\} \\ &\leq 1 - \Pr\{\left(\mathfrak{y}^{\mathfrak{n}} \in \mathcal{A}^{\mathfrak{c}}\right) \cap \left(\mathfrak{Y}\_{1}^{\mathfrak{n}t} \in \mathcal{B}^{\mathfrak{c}}\right) | \mathfrak{e} \in \mathcal{S} \right\}, \end{split} \tag{29}$$

where S is a set defined as

$$\mathcal{S} = \left\{ \boldsymbol{\epsilon} : \boldsymbol{\epsilon} \le \min\_{i \ne 1} \frac{||P\_{\mathcal{Y}\_i^{nt}} - \mathbb{P}\_1||\_r}{(2 + t^{-1/3})|\mathcal{Y}|^{1/r}} \right\}. \tag{30}$$

In the following, we derive the expression in (29). The right-hand side of (29) can be upper bounded as

$$\begin{split} 1 - \Pr\left\{ \left( y^{\text{n}} \in \mathcal{A}^{\text{c}} \right) \cap \left( \mathcal{Y}\_{1}^{\text{pt}} \in \mathcal{B}^{\text{c}} \right) \middle| \varepsilon \in \mathcal{S} \right\} &= \Pr\left\{ \left( y^{\text{n}} \notin \mathcal{A}^{\text{c}} \right) \cup \left( \mathcal{Y}\_{1}^{\text{nt}} \notin \mathcal{B}^{\text{c}} \right) \middle| \varepsilon \in \mathcal{S} \right\} \\ &\stackrel{(a)}{\leq} \Pr\left\{ y^{\text{n}} \notin \mathcal{A}^{\text{c}} \middle| \varepsilon \in \mathcal{S} \right\} + \Pr\left\{ \mathcal{Y}\_{1}^{\text{nt}} \notin \mathcal{B}^{\text{c}} \middle| \varepsilon \in \mathcal{S} \right\}, \end{split} \tag{31}$$

where (*a*) follows from Boole's inequality. Now, note that we have the following upper bound for the first expression in the right-hand side of (31)

$$\begin{split} \Pr\{y^{n}\notin\mathcal{A}^{\epsilon}|\epsilon\in\mathcal{S}\} &= \Pr\left\{y^{n}\notin\bigcap\_{k=1}^{|\mathcal{N}|}\mathcal{A}^{\epsilon}\_{k}\middle|\epsilon\in\mathcal{S}\right\} \\ &= \Pr\left\{y^{n}\in\bigcup\_{k=1}^{|\mathcal{N}|}\overline{\mathcal{A}}^{\epsilon}\_{k}\middle|\epsilon\in\mathcal{S}\right\} \\ &\overset{(a)}{\leq} \sum\_{k=1}^{|\mathcal{N}|}\Pr\{y^{n}\in\overline{\mathcal{A}}^{\epsilon}\_{k}|\epsilon\in\mathcal{S}\} \\ &= \sum\_{k=1}^{|\mathcal{N}|}\Pr\left\{\left|\frac{\mathbb{Z}[y^{n}=y\_{k}]}{n}-\overline{\mathbf{p}}(y\_{k}|\mathbf{x}\_{1})\right|>\epsilon\right|\epsilon\in\mathcal{S}\right\} \\ &= \sum\_{k=1}^{|\mathcal{N}|}\Pr\left\{\left|\sum\_{j=1}^{n}\frac{\mathcal{Z}[y\_{j}=y\_{k}]}{n}-\overline{\mathbf{p}}(y\_{k}|\mathbf{x}\_{1})\right|>\epsilon\left|\epsilon\in\mathcal{S}\right\}, \end{split} \tag{32}$$

where <sup>A</sup> *<sup>k</sup>* is the complement of <sup>A</sup> *<sup>k</sup>* and (*a*) follows from Boole's inequality. Note that <sup>Z</sup>[*y*<sup>1</sup> <sup>=</sup> *yk*], <sup>Z</sup>[*y*<sup>2</sup> <sup>=</sup> *yk*], ... , <sup>Z</sup>[*yn* <sup>=</sup> *yk*] in (32) are *<sup>n</sup>* independent Bernoulli random variables with probabilities of success *<sup>p</sup>*<sup>1</sup> (*yk*|*x*1), *<sup>p</sup>*<sup>2</sup> (*yk*|*x*1), ... , *pn* (*yk*|*x*1), respectively. Let <sup>W</sup>[*yk*] be a binomial random variable with parameters *<sup>n</sup>*, p¯(*yk*|*x*1) . We proceed the proof by introducing the following well-known Hoefdding's Theorem from [41].

**Theorem 2** (Hoeffding [41])**.** *Assume that Z*1, *Z*2, ... , *and Zn are n independent Bernoulli random variables with probabilities of success p*<sup>1</sup> , *p*<sup>2</sup> , ... , *and pn , respectively. Next, let* Z *be defined as* Z = *<sup>Z</sup>*<sup>1</sup> + *<sup>Z</sup>*<sup>2</sup> + ... + *Zn and, let* p¯ *be defined as* p¯ = *<sup>p</sup>*<sup>1</sup> <sup>+</sup> *<sup>p</sup>*<sup>2</sup> <sup>+</sup> ... <sup>+</sup> *pn* /*n. Let* W *be a binomial random variable with parameters* (*n*, p¯)*. Then, for a given a and b, where* 0 ≤ *a* ≤ *n*p¯ ≤ *b* ≤ *n holds, we have*

$$\Pr\{a \le W \le b\} \le \Pr\{a \le Z \le b\}.\tag{33}$$

*In other words, the probability distribution of* W *is more dispersed around its mean n*p¯ *than is the probability distribution of* Z*. Except in the trivial case when a* = *b* = 0*, the bound in* (33) *holds with equality if and only if p*<sup>1</sup> = ... = *pn* = p¯*.*

#### **Proof of Theorem 2.** Please refer to [41].

Setting *<sup>a</sup>* <sup>=</sup> *<sup>n</sup>*(p¯ <sup>−</sup> *<sup>δ</sup>*) and *<sup>b</sup>* <sup>=</sup> *<sup>n</sup>*(p¯ <sup>+</sup> *<sup>δ</sup>*) in (33), we obtain

$$\Pr\{n(\mathfrak{p}-\delta) \le \mathcal{W} \le n(\mathfrak{p}+\delta)\} \le \Pr\{n(\mathfrak{p}-\delta) \le \mathcal{Z} \le n(\mathfrak{p}+\delta)\}.\tag{34}$$

Using (34), we have the following upper bound

$$\Pr\left\{ \left| \frac{\mathbf{Z}}{n} - \bar{\mathbf{p}} \right| > \delta \right\} = 1 - \Pr\{ n(\bar{\mathbf{p}} - \delta) \le \mathbf{Z} \le n(\bar{\mathbf{p}} + \delta) \}$$

$$\stackrel{(a)}{\le} 1 - \Pr\{ n(\bar{\mathbf{p}} - \delta) \le \mathbf{W} \le n(\bar{\mathbf{p}} + \delta) \}$$

$$= \Pr\{ \left| \frac{\mathbf{W}}{n} - \bar{\mathbf{p}} \right| > \delta \}, \tag{35}$$

where (*a*) follows from (34).

We now turn to the proof of Theorem 1. According to Theorem 2, the probability distribution of <sup>W</sup>[*yk*] is more dispersed around its mean *<sup>n</sup>*p¯(*yk*|*x*1) than is the probability distribution of <sup>∑</sup>1≤*j*≤*<sup>n</sup>* <sup>Z</sup>[*yj* <sup>=</sup> *yk*]. Therefore, we can upper bound the probability in the last line of (32) as

$$\Pr\left\{ \left| \sum\_{j=1}^{n} \frac{\mathcal{Z}[y\_j = y\_k]}{n} - \mathfrak{P}(y\_k | \mathbf{x}\_1) \right| > \varepsilon \, \middle| \, \varepsilon \in \mathcal{S} \right\} \stackrel{(a)}{\leq} \Pr\left\{ \left| \frac{\mathcal{W}[y\_k]}{n} - \mathfrak{P}(y\_k | \mathbf{x}\_1) \right| > \varepsilon \, \middle| \, \varepsilon \in \mathcal{S} \right\},\tag{36}$$

where ∈ S is defined in (30) and (*a*) follows from (35). Now, let us introduce another well-known Hoeffding's Theorem from [42].

**Theorem 3** (Hoeffding's inequality [42])**.** *Let W*1, *W*2, ... , *Wn be n independent random variables such that for each* 1 ≤ *i* ≤ *n, we have* Pr, *Wi* <sup>∈</sup> [*ai*, *bi*] - = 1*. Then for Sn, defined as Sn* <sup>=</sup> *<sup>n</sup>* ∑ *i*=1 *Wi, we have*

$$\Pr\left\{S\_{\mathfrak{n}} - \mathbb{E}\left[S\_{\mathfrak{n}}\right] \ge \delta\right\} \le \exp\left(-\frac{2\delta^2}{\sum\_{i=1}^{\mathfrak{n}} (b\_i - a\_i)^2}\right),\tag{37}$$

*where* E *Sn is the expectation of Sn.*

**Proof of Theorem 3.** Please refer to [42].

Back to (36), by using the result of (37) for *ai* = 0 and *bi* = 1 since the binomial random variable <sup>W</sup>[*yk*] can take values 0 or 1, respectively, we have

$$\Pr\left\{ \left| \sum\_{j=1}^{n} \frac{\mathcal{Z}[y\_j = y\_k]}{n} - \mathfrak{p}(y\_k|\mathbf{x}\_1) \right| > \epsilon \, \middle| \, \mathfrak{e} \in \mathcal{S} \right\} \le 2 \exp\left( -\frac{2n^2 \epsilon^2}{\sum\_{1 \le i \le n} (1 - 0)^2} \right)$$

$$\le 2e^{-2n\epsilon^2},\tag{38}$$

where ∈ S is defined in (30). Inserting (38) into (32), we obtain the following upper bound

$$\Pr\left\{\mathcal{Y}^{\mathbb{R}} \notin \mathcal{A}^{\varepsilon} | \varepsilon \in \mathcal{S}\right\} \leq 2|\mathcal{Y}|\mathsf{e}^{-2n\varepsilon^{2}}.\tag{39}$$

Similarly, we have the following result for the second expression in the right-hand side of (31)

$$\begin{split} \Pr\left\{\hat{g}\_{1}^{\text{int}} \notin \mathcal{B}^{c} \,|\,\varepsilon \in \mathcal{S} \right\} &= \Pr\left\{\mathcal{Y}\_{1}^{\text{int}} \notin \bigcap\_{k=1}^{|\mathcal{Y}|} \mathcal{B}\_{k}^{c} \,|\,\varepsilon \in \mathcal{S} \right\} \\ &= \Pr\left\{\mathcal{Y}\_{1}^{\text{int}} \in \bigcup\_{k=1}^{|\mathcal{Y}|} \overline{\mathcal{B}\_{k}^{c}} \,|\,\varepsilon \in \mathcal{S} \right\} \\ &\overset{(a)}{\leq} \sum\_{k=1}^{|\mathcal{Y}|} \Pr\left\{\mathcal{Y}\_{1}^{\text{int}} \in \overline{\mathcal{B}\_{k}^{c}} \,|\,\varepsilon \in \mathcal{S} \right\} \\ &= \sum\_{k=1}^{|\mathcal{Y}|} \Pr\left\{ \left| \frac{\mathbb{Z}[\hat{g}\_{1}^{\text{int}} = y\_{k}]}{nt} - \hat{\mathbf{p}}(y\_{k}|\mathbf{x}\_{1}) \right| > \frac{\varepsilon}{\sqrt[c]{t}} \,|\,\varepsilon \in \mathcal{S} \right\} \\ &= \sum\_{k=1}^{|\mathcal{Y}|} \Pr\left\{ \left| \sum\_{j=1}^{n} \frac{\mathbb{Z}[y\_{j} = y\_{k}]}{nt} - \hat{\mathbf{p}}(y\_{k}|\mathbf{x}\_{1}) \right| > \frac{\varepsilon}{\sqrt[c]{t}} \,|\,\varepsilon \in \mathcal{S} \right\}, \end{split} \tag{40}$$

where again (*a*) follows from Boole's inequality. Note that due to (5), for any integer number *<sup>l</sup>* such that 0 <sup>≤</sup> *<sup>l</sup>* <sup>≤</sup> *<sup>t</sup>* <sup>−</sup> 1 the random variables <sup>Z</sup>[*ynl*<sup>+</sup><sup>1</sup> <sup>=</sup> *yk*], <sup>Z</sup>[*ynl*<sup>+</sup><sup>2</sup> <sup>=</sup> *yk*], ... , and <sup>Z</sup>[*ynl*<sup>+</sup>*<sup>n</sup>* <sup>=</sup> *yk*] in (40) are *<sup>n</sup>* independent Bernoulli random variables with the probabilities of success *<sup>p</sup>*<sup>1</sup> (*yk*|*x*1), *<sup>p</sup>*<sup>2</sup> (*yk*|*x*1), ... , and *pn* (*yk*|*x*1), respectively *ynl*+1, *ynl*+2, ... , *ynl*<sup>+</sup>*<sup>n</sup>* are elements of *y*ˆ*<sup>n</sup>* 1*l*+<sup>1</sup> . In addition, note that

$$\begin{split} \mathfrak{p}(y\_k|\mathbf{x}\_1) &= \frac{1}{n} \sum\_{j=1}^n p\_j(y\_k|\mathbf{x}\_1) \\ &= \frac{1}{nt} \left( \sum\_{l=0}^{t-1} \sum\_{j=1}^n p\_j(y\_k|\mathbf{x}\_1) \right). \end{split} \tag{41}$$

Notice that for each 0 <sup>≤</sup> *<sup>l</sup>* <sup>≤</sup> *<sup>t</sup>* <sup>−</sup> 1, *<sup>p</sup>*<sup>1</sup> (*yk*|*x*1) + *<sup>p</sup>*<sup>2</sup> (*yk*|*x*1) + ... <sup>+</sup> *pn* (*yk*|*x*1) is the summation of the probabilities of success of the random variables <sup>Z</sup>[*ynl*<sup>+</sup><sup>1</sup> <sup>=</sup> *yk*], <sup>Z</sup>[*ynl*<sup>+</sup><sup>2</sup> <sup>=</sup> *yk*], ... , and <sup>Z</sup>[*ynl*<sup>+</sup>*<sup>n</sup>* <sup>=</sup> *yk*]. Thereby, the last expression on the right-hand side of (41) is the average probability of success of random variables <sup>Z</sup>[*yj* <sup>=</sup> *yk*] for 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *nt*. Now, let <sup>W</sup>[*yk*] be a binomial random variable with parameters *nt*, p¯(*yk*|*x*1) . Once again, according to Theorem 2, the probability distribution of <sup>W</sup>[*yk*] is more dispersed around its mean *nt*p¯(*yk*|*x*1)) than is the probability distribution of <sup>∑</sup>1≤*j*≤*nt* <sup>Z</sup>[*yj* <sup>=</sup> *yk*]. Therefore, the probability in the last line of (40) can be upper bounded as

$$\begin{split} \Pr\left\{ \left| \sum\_{j=1}^{nt} \frac{\mathbb{Z}[y\_j = y\_k]}{nt} - \mathfrak{p}(y\_k|\mathbf{x}\_1) \right| > \frac{\varepsilon}{\sqrt[t]{t}} \middle| \varepsilon \in \mathcal{S} \right\} \stackrel{(i)}{\leq} \Pr\left\{ \left| \frac{\mathbb{W}[y\_k]}{nt} - \mathfrak{p}(y\_k|\mathbf{x}\_1) \right| > \frac{\varepsilon}{\sqrt[t]{t}} \middle| \varepsilon \in \mathcal{S} \right\} \\ &\stackrel{(i)}{\leq} 2 \exp\left( -\frac{2(nt)^2 \mathfrak{f}^{t-1/3} \varepsilon^2}{\sum\_{1 \leq l \leq nt} (1 - \mathfrak{d})^2} \right) \\ &\leq 2e^{-2nt \left(t^{-2/3} \varepsilon^2\right)} \\ &= 2e^{-2nt^{1/3} \varepsilon^2} \end{split} \tag{42}$$

where ∈ S, defined in (30), (*a*) follows from (35) (in which *<sup>n</sup>* is replaced by *nt*), and (*b*) is the result of (37) for *ai* <sup>=</sup> 0 and *bi* <sup>=</sup> 1 since the binomial random variable <sup>W</sup>[*yk*] can take values 0 or 1, respectively. Inserting (42) into (40), we have the following upper bound

$$\Pr\left\{\mathcal{Y}\_1^{\mathrm{nt}} \notin \mathcal{B}^{\varepsilon} \,|\,\epsilon \in \mathcal{S}\right\} \leq 2|\mathcal{Y}|\mathrm{e}^{-2nt^{1/3}d^2}.\tag{43}$$

Inserting (39) and (43) into (31), and then inserting (31) into (29), we obtain the following upper bound for the error probability

$$\mathbb{P}\_{\mathbf{e}} \le 2|\mathcal{Y}| \mathbf{e}^{-2nt^2} + 2|\mathcal{Y}| \mathbf{e}^{-2nt^{1/3}d^2},\tag{44}$$

where

$$\epsilon = \min\_{\substack{i,j\\i\neq j\end{subarray}} \frac{||P\_{\mathcal{Y}^{\mathrm{rd}}\_i} - \mathcal{P}\_j||\_r}{(2 + t^{-1/3}) |\mathcal{Y}|^{1/r'}}\tag{45}$$

which is the optimal value of that exhibits the tightest upper bound for the error probability P<sup>e</sup> given by (44). This completes the proof of Theorem 1.

The following corollary provides a simplified upper bound on the error probability when *t* → ∞.

**Corollary 1.** *When the number of training vectors per label reaches infinity, i.e., when t* → ∞*, which is equivalently to the case when the probability distribution <sup>p</sup>*(*yn*|*x*) *is known at the classifier, the error probability of the proposed classifier is upper bounded as*

$$\mathbb{P}\_{\mathfrak{e}} \le 2|\mathcal{Y}| \mathrm{e}^{-2m^2} \text{ \text{\textquotedblleft}}\tag{46}$$

*where is given by*

$$\varepsilon = \min\_{\substack{i,j\\i\neq j}} \frac{||\bar{\mathbf{P}}\_i - \bar{\mathbf{P}}\_j||\_r}{2|\mathcal{Y}|^{1/r}}.\tag{47}$$

**Proof.** The proof is straightforward.

As can be seen from (8) and (11), the performance of the proposed classifier depends on *r*. We cannot derive the optimal value of *r* that minimizes the error probability since we do not have the exact expression of the error probability, we only have its upper bound. On the other hand, in practice, the optimal *r* with respect to the upper bound on the error probability also cannot be derived since the upper bound depends on P¯ *<sup>j</sup>*, which would be unknown in practice due to *pYn*|*X*(*yn*|*x*) being unknown. As a result, for our numerical examples, we consider the Euclidean distance (*r* = 2), which is one of the most widely used distance metrics in practice.

The following corollary establishes the asymptotic optimality of the proposed classifier with respect to *n*.

**Corollary 2.** *The proposed classifier has an error probability that satisfies* <sup>P</sup><sup>e</sup> <sup>→</sup> <sup>0</sup> *as <sup>n</sup>* <sup>→</sup> <sup>∞</sup> *if* |Y| ≤ O(*nm*)*, <sup>m</sup> is fixed, and <sup>r</sup>* <sup>&</sup>gt; <sup>2</sup>*m. Here, <sup>n</sup><sup>m</sup> indicates the dimension of our space, i.e., maximum number of alphabets each element in the feature vector y<sup>n</sup> can take. Thereby, the proposed classifier is asymptotically optimal .*

**Proof.** For the proof, please see Appendix A.

#### **4. Simulation Results**

In this section, we provide simulation results of the performance of the proposed classifier for *r* = 2 and compare it to benchmark schemes. The benchmark schemes that we adopt for comparison are the naive Bayes classifier and the KNN algorithm. We cannot adopt a classifier based on a neural network since neural networks require a very large training set, which we assume is not available. For the naive Bayes classifier, the probability distribution *pYn*|*<sup>X</sup>* (*yn*|*x*) is estimated from the training vectors as follows. Let again *<sup>y</sup>*ˆ*nt <sup>i</sup>* be a vector obtained by concatenating all training feature vectors for the input label *xi* as in (5). Then, the estimated probability distribution of *<sup>p</sup>*(*yj* <sup>=</sup> *<sup>y</sup>*|*xi*), denoted by *<sup>p</sup>*ˆ(*yj* <sup>=</sup> *<sup>y</sup>*|*xi*), is found as

$$
\hat{p}(y\_j = y | \mathbf{x}\_i) = \frac{\mathcal{X}[\mathcal{Y}\_i^{nt} = y]}{nt},
\tag{48}
$$

and the naive Bayes classifier decides according to

$$\mathfrak{X} = \arg\max\_{x\_i} \prod\_{k=1}^{n} \mathfrak{f}(y\_k|x\_i). \tag{49}$$

The main problem of the naive Bayes classifier occurs when an alphabet *yj* ∈ Y is not present in the training feature vectors. In that case, *<sup>p</sup>*ˆ(*yj*|*xi*) in (48) is *<sup>p</sup>*ˆ(*yj*|*xi*) = 0, ∀*xi* ∈ X and, as a result, the right hand side of (49) is zero since at least one of the elements in the product in (49) is zero. In this case, the naive Bayes classifier fails to provide an accurate classification of the labels. In what follows, we see that this issue of the naive Bayes classifier appears frequently when we have a small number of training feature vectors. On the other hand, the KNN classifier works as follows. For the observed feature vector *yn*, the KNN classifier looks for the *k* nearest feature vectors to *yn*, among all training feature vectors *y*ˆ*<sup>n</sup> rs* , for all 1 ≤ *r* ≤ |X | and 1 ≤ *s* ≤ *T*. Then by considering a set of *K* input–output pairs (*xk*, *<sup>y</sup>*ˆ*<sup>n</sup> kl* ), for *<sup>k</sup>* ∈ {1, 2, ... , |X |} and *<sup>l</sup>* ∈ {1, 2, ... , <sup>|</sup>*T*|}, the KNN classifier decides a label which is the most frequent among *xk*-s. The optimum value of *<sup>k</sup>* for *<sup>t</sup>* = 1 is *<sup>k</sup>* = 1.

In the following, we provide numerical examples where we illustrate the performance of the proposed classifier when *pYn*|*<sup>X</sup>* (*yn*|*x*) is artificially generated.

#### *4.1. The I.I.D. Case with One Training Sample per Label*

In the following examples, we assume that the classifiers have access to only one training feature vector for each label, the elements of the feature vectors are generated i.i.d., and the alphabet size of the feature vector, |Y|, is fixed.

In Figures 4 and 5, we compare the error probability of the proposed classifier with the naive Bayes classifier and the KNN algorithm for the case when |Y| <sup>=</sup> 6 and |Y| <sup>=</sup> 20, respectively. In both examples, we have two different labels, i.e., |X | <sup>=</sup> 2. As a result, we have two different probability distributions *pYn*|*X*<sup>1</sup> (*yn*|*x*1) and *pYn*|*X*<sup>2</sup> (*yn*|*x*2). The probability distributions *pYn*|*X*<sup>1</sup> (*yn*|*x*1) and *pYn*|*X*<sup>2</sup> (*yn*|*x*2) are randomly generated as follows. We first generate two random vectors of length 6 and length 20 for Figures 4 and 5, respectively, where the elements of these vectors are drawn independently from a uniform probability distribution. Then we normalize these vectors such that the sum of their elements is equal to one. These two normalized randomly generated vectors then represent the two probability distributions *pYi*|*X*<sup>1</sup> (*yi*|*x*1) = *pY*|*X*<sup>1</sup> (*y*|*x*1) and *pYi*|*X*<sup>2</sup> (*yi*|*x*2) = *pY*|*X*<sup>2</sup> (*y*|*x*2), <sup>∀</sup>*i*. Then, *pYn*|*Xk* (*yn*|*xk*) is obtained as *pYn*|*Xk* (*yn*|*xk*) = <sup>∏</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *pYi*|*Xk* (*yi*|*xk*), for *<sup>k</sup>* <sup>=</sup> 1, 2. The simulation is carried out as follows. For each *n*, we generate one training vector for each label, using the aforementioned probability distributions. Then, as test samples, we generate 1000 feature vectors for each label and pass these feature vectors through our proposed classifier, the naive Bayes classifier, and the KNN algorithm, and compute the errors. The length of the feature vector *n* is varied from *n* = 1 to *n* = 100. We repeat the simulation 5000 times and then plot the error probability. Figures 4 and 5 show that the proposed classifier outperforms both the naive Bayes classification and KNN. The main reason for this performance gain is because when only one training vector per label is available, the proposed classifier is more resilient to errors than the naive Bayes classifier, whereas the KNN algorithm has very poor performance because of the "curse of dimensionality". Specifically, the naive Bayes classifier cannot perform an accurate classification for small *n* compared to |Y| since the chance that an alphabet will not be present in one of the training feature vectors is close to 1. On the other hand, the KNN algorithm cannot perform an accurate classification for large *n* since the dimension of the input feature vector becomes much larger than the training data and the "curse of dimensionality" occurs.

**Figure 4.** Comparison in error probability between the naive Bayes classifier, KNN, and the proposed classifier.

**Figure 5.** Comparison in error probability between the naive Bayes classifier, KNN, and the proposed classifier.

In Figure 6, we compare the performance of the proposed classifier for different values of *<sup>r</sup>* when |Y| <sup>=</sup> 6 with the derived upper bounds. As can be seen, for this example, the derived theoretical upper bounds have similar slope as the exact error probabilities. Moreover, we can see that for this example, the optimal *r* is *r* = 1. However, this is not always the case and it depends on *pYn*|*Xk* (*yn*|*xk*), |Y|, and |X |.

**Figure 6.** Comparison in error probability of the proposed classifier for different values of *r* when |Y| <sup>=</sup> 6. The related theoretical upper bounds for each value of *<sup>r</sup>* are also given.

#### *4.2. The Overlapping I.Non-I.D. Case with One Training Sample per Label*

In this example, we consider the i.non-i.d. case where the probability distributions *pi* (*yi*|*xk*) are overlapping for all *<sup>i</sup>*, as shown in Figure 7. The small orthogonal lines on the xaxis in Figure 7 represent alphabets, i.e., the elements in Y, and the probability of occurrence of an alphabet *yi* is equal to the intersection between the corresponding orthogonal line to the represented probability distribution *pi* (*yi*|*xk*) for *<sup>k</sup>* <sup>=</sup> 1, 2. By "overlapping", we mean the following. Let <sup>Y</sup>*<sup>v</sup>* and <sup>Y</sup>*<sup>u</sup>* denote the set of outputs generated by *pv* (*yv*|*xk*) and *pu* (*yu*|*xk*), respectively. If for any *<sup>v</sup>* and *<sup>u</sup>*, <sup>Y</sup>*<sup>v</sup>* ∩ Y*<sup>u</sup>* <sup>=</sup> <sup>∅</sup> holds, we say that the output alphabets are overlapping.

**Figure 7.** Illustration of the probability distributions *pi*(*yi*|*x*1) (upper figure) and *pi*(*yi*|*x*2) (lower figure), for *i* = 1, 2, . . . , *n*.

To demonstrate the performance of our proposed classifier in the overlapping case, we assume that we have two different labels, <sup>X</sup> <sup>=</sup> {*x*1, *<sup>x</sup>*2}, where the corresponding conditional probability distributions *pi* (*yi*|*x*1) and *pi* (*yi*|*x*2) are obtained as follows. For a given *<sup>n</sup>*, let <sup>Y</sup> <sup>=</sup> , <sup>−</sup> *<sup>n</sup>*, <sup>−</sup>*<sup>n</sup>* <sup>+</sup> 1, ... , 0, ... , *<sup>n</sup>* <sup>−</sup> 1, *<sup>n</sup>* - be the set of all alphabets. Note that the size of <sup>Y</sup> grows with *<sup>n</sup>*. Moreover, let **<sup>u</sup>***<sup>i</sup>* and **<sup>v</sup>***<sup>i</sup>* (<sup>1</sup> <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*) be vectors of length 2*<sup>n</sup>* <sup>+</sup> 1, given by

$$\mathbf{u}\_{i} = \left[ 0, \ldots, 0, \frac{1}{i(i+1)}, \frac{2}{i(i+1)}, \ldots, \frac{i}{i(i+1)}, \frac{i+1}{i(i+1)}, \frac{i}{i(i+1)}, \ldots, \frac{1}{i(i+1)}, 0, \ldots, 0 \right],\tag{50}$$

$$\mathbf{v}\_{i} = \left[ 0, \dots, 0, \frac{1}{i(i+1)}, \frac{1}{i(i+1)}, \dots, \frac{1}{i(i+1)}, \frac{1}{i(i+1)}, 0, \dots, 0 \right]. \tag{51}$$

The number of zeros in each side of the vectors **<sup>u</sup>***<sup>i</sup>* and **<sup>v</sup>***<sup>i</sup>* is (*<sup>n</sup>* <sup>−</sup> *<sup>i</sup>*). To generate a feature vector from label *<sup>x</sup>*1(*x*2), we generate the vector *<sup>y</sup><sup>n</sup>* = (*y*1, *<sup>y</sup>*2, ... , *yn*), where *yk* takes values from the set Y, with a probability distribution *pi* (*yi*|*x*1) = **<sup>u</sup>***<sup>i</sup>* <sup>1</sup> + <sup>2</sup>(*<sup>n</sup>* + *yi*) - *pi* (*yi*|*x*2) = **<sup>v</sup>***<sup>i</sup>* <sup>1</sup> + <sup>2</sup>(*<sup>n</sup>* + *yi*) .

The simulation is carried out as follows. For each *n*, we generate one training feature vector for each label. Then, we generate 1000 feature vectors for each label and pass them through our proposed classifier, the naive Bayes classifier, and the KNN algorithm and calculate the error probability. We change the length of the feature vector from *n* = 1 to *n* = 100 and repeat the simulation 1000 times and then plot the error probability.

As shown in Figure 8, there is a huge difference between the performance of the two benchmark classifiers and the proposed classifier. The error probability of the naive Bayes classifier is almost 0.5 for all shown values of *n* as it is susceptible to the problem of unseen alphabets in the training vectors. The error probability of the KNN classifier is also almost 0.5 for *n* > 20 as it is susceptible to the "curse of dimensionality". However, the error probability of our proposed classifier continuously decays as *n* increases.

**Figure 8.** Comparison in error probability between the naive Bayes classifier, KNN, and the proposed classifier (*T* = 1).

In Figure 9, we run the same experiments as in Figure 8 but with *T* = 100, i.e., 100 training feature vectors per label. As can be seen from Figure 9, the performance of the proposed classifier is better than the naive Bayes classifier, for *<sup>n</sup>* <sup>&</sup>gt; 15. Since |Y| <sup>=</sup> <sup>2</sup>*<sup>n</sup>* <sup>+</sup> 1, for small values of *n*, the naive Bayes classifier has access to many training samples and, thereby, its performance is very close to the case when the probability distribution *pYn*|*<sup>X</sup>* (*yn*|*x*) is known, i.e., to the maximum-likelihood classifier, and hence it has the optimal performance. As *n* increases, the number of alphabets rises, i.e., |Y| rises, and due to the aforementioned issue of the naive Bayes classifier with unseen alphabets, our proposed classifier performs much better classification than the naive Bayes classifier. Furthermore, note that the error probability of our proposed classifier decays exponentially as *n* increases which is not the case with the naive Bayes classifier. Moreover, Figure 9 also shows the theoretical upper bound on the error probability we derived in (11).

**Figure 9.** Comparison in error probability between the naive Bayes classifier and the proposed classifier (*T* = 100).

#### *4.3. The Non-Overlapping I.Non-I.D. Case with One Training Sample for Each Label*

In this example, we consider the i.non-i.d. case where the probability distributions *pj* (*yj*|*xi*) are non-overlapping for all *<sup>j</sup>* as shown in Figure 10, where we defined "overlapping" in Section 4.2. Hence, we test the other extreme in terms of possible distribution of the elements in the feature vectors *Yn*.

To demonstrate the performance of our proposed classifier in the non-overlapping case, we assume that we have two different labels <sup>X</sup> <sup>=</sup> {*x*1, *<sup>x</sup>*2}, the corresponding conditional probability distributions *pi* (*yi*|*x*1) and *pi* (*yi*|*x*2) are obtained as follows. For a given *<sup>n</sup>*, let <sup>Y</sup> <sup>=</sup> , 1, 2, 3, ... ,(*<sup>n</sup>* <sup>+</sup> <sup>1</sup>)<sup>2</sup> <sup>−</sup> <sup>1</sup> - be the set of all alphabets of the element in the feature vectors. Note again that the size of <sup>Y</sup> grows with *<sup>n</sup>*. in addition, let **<sup>u</sup>***<sup>i</sup>* and **<sup>v</sup>***<sup>i</sup>* for (<sup>1</sup> <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*), be vectors of length (*<sup>n</sup>* <sup>+</sup> <sup>1</sup>)<sup>2</sup> <sup>−</sup> 1, given by

$$\mathbf{u}\_{i} = \left[ 0, \dots, 0, \frac{1}{i(i+1)}, \frac{2}{i(i+1)}, \dots, \frac{i}{i(i+1)}, \frac{i+1}{i(i+1)}, \frac{i}{i(i+1)}, \dots, \frac{1}{i(i+1)}, 0, \dots, 0 \right],\tag{52}$$

$$\mathbf{v}\_{l} = \left[ 0, \dots, 0, \frac{1}{i(i+1)}, \frac{1}{i(i+1)}, \dots, \frac{1}{i(i+1)}, \frac{1}{i(i+1)}, 0, \dots, 0 \right]. \tag{53}$$

The number of zeros in the left-hand sides of **u***<sup>i</sup>* and **v***<sup>i</sup>* is *i* <sup>2</sup> <sup>−</sup> 1. To generate a feature vector from the label *<sup>x</sup>*1(*x*2), we generate the vector *<sup>y</sup><sup>n</sup>* = (*y*1, *<sup>y</sup>*2, ... , *yn*), where *yk* take values from the set Y, with probability distribution *pi* (*yi*|*x*1) = **<sup>u</sup>***i*(*yi*) *pi* (*yi*|*x*2) = **<sup>v</sup>***i*(*yi*) .

**Figure 10.** Illustration of the probability distributions *pi*(*yi*|*x*1) (upper figure) and *pi*(*yi*|*x*2) (lower figure), for *i* = 1, 2, . . . , *n*.

The simulation is carried out as follows. For each *n*, we generate one training feature vector for each label. Then we generate 250 feature vectors for each label and pass it through our proposed classifier, the naive Bayes classifier and KNN and calculate the error probabilities. We change the length of the vector from 1 to 80 and repeat the simulation 250 times and then plot the error probability. As shown in Figure 11, there is a huge difference between the performance of the proposed classifier and the two benchmark classifiers. The error probability of the naive Bayes classifier is almost 0.5 for all shown values of *n* as it is susceptible to the issue with unseen alphabets in the training feature vector. The error probability of the KNN classifier is almost 0.5 for all shown values of *n* > 30 as it becomes susceptible to the "curse of dimensionality". However, the error probability of our proposed classifier still decays continuously as *n* increases.

Note that, in our numerical examples, we compared our algorithm with the benchmark schemes on two extreme cases of i.non-i.d. vectors, referred to as "overlapping" and "nonoverlapping". Any other i.non-i.d. vector can be represented as a combination of the "overlapping" and "non-overlapping" vectors. Since our algorithm works better than the benchmark schemes for small *t* on both these cases, it will work better than the benchmark schemes on any combination between "overlapping" and "non-overlapping" vectors, i.e., for any other i.non-i.d. vectors.

**Figure 11.** Comparison in error probability between the naive Bayes classifier and the proposed classifier (*T* = 1).

#### **5. Conclusions**

In this paper, we proposed a supervised classification algorithm that assigns labels to input feature vectors with independent but non-identically distributed elements, a statistical property found in practice. We proved that the proposed classifier is asymptotically optimal since the error probability moves to zero as the length of the input feature vectors grows. We showed that this asymptotic optimality is achievable even when one training feature vector per label is available. In the numerical examples, we compared the proposed classifier with the naive Bayes classifier and the KNN algorithm. Our numerical results show that the proposed classifier outperforms the benchmark classifiers when the number of training data is small and the length of the input feature vectors is sufficiency large.

**Author Contributions:** Methodology, F.S. and N.Z.; software, F.S.; formal analysis, F.S. and N.Z.; investigation, F.S.; supervision, N.Z.; writing—original draft preparation, F.S.; writing—review and editing, N.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data sharing is not applicable to this article as no new data were created or analyzed in this study.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Proof of Corollary 2**

The proof is almost identical to the proof of Theorem 1; however, here we derive a looser upper-bound on the error-probability than that in (11), which is independent of *Py*ˆ*nT* .

*i* Without loss of generality we assume that *<sup>x</sup>*<sup>1</sup> is the input to *pYn*|*<sup>X</sup>* (*yn*|*x*) and *<sup>y</sup><sup>n</sup>* is observed at the classifier.

Let <sup>B</sup> *k*,*l* , for 1 ≤ *k* ≤ |Y| and 1 ≤ *l* ≤ |X |, be a set defined as

$$\mathcal{B}\_{k,l}^{\mathfrak{c}} = \left\{ \mathfrak{H}^{\mathfrak{nt}} : \left| \frac{\mathcal{T}[\mathfrak{J}^{\mathfrak{nt}} = y\_k]}{nt} - \mathfrak{p}(y\_k|x\_l) \right| \le \frac{\mathfrak{e}}{\sqrt[3]{t}} \right\}. \tag{A1}$$

$$\begin{aligned} \text{Let } \mathcal{B}\_{1}^{\mathbf{c}} &= \bigcap\_{k=1}^{|\mathcal{Y}|} \mathcal{B}\_{k,l}^{\mathbf{c}}. \text{ For } \hat{y}\_{1}^{nt} \in \mathcal{B}\_{1}^{\mathbf{c}} \text{ we have} \\\\ \left(\sum\_{k=1}^{|\mathcal{Y}|} \left|\frac{\mathcal{Z}[\mathcal{Y}\_{1}^{nt} = y\_{k}]}{nt} - \bar{\mathbf{p}}(y\_{k}|\mathbf{x}\_{1})\right|^{r}\right)^{1/r} &\stackrel{(i)}{\leq} \left(\sum\_{k=1}^{|\mathcal{Y}|} \left(\frac{\epsilon}{\sqrt[3]{t}}\right)^{r}\right)^{1/r}, \end{aligned} \tag{A2}$$

Using the same derivation as (18), for any *<sup>y</sup><sup>n</sup>* ∈ A and for *<sup>y</sup>*ˆ*nt* <sup>1</sup> ∈ B <sup>1</sup>, we have:

$$\left(\sum\_{k=1}^{|\mathcal{Y}|} \left| \frac{\mathcal{Z}[\mathcal{Y}^{\mathfrak{n}} = y\_k]}{n} - \frac{\mathcal{Z}[\mathcal{Y}\_1^{\mathfrak{n}} = y\_k]}{nT} \right|^r \right)^{1/r} \le |\mathcal{Y}|^{1/r} \mathfrak{c} + |\mathcal{Y}|^{1/r} \frac{\mathfrak{c}}{\sqrt[3]{t}}.\tag{A3}$$

On the other hand, the same as the derivation in (21), for each *<sup>i</sup>* <sup>=</sup> 1, we have:

$$\left(\sum\_{k=1}^{|\mathcal{Y}|} \left| \frac{\mathcal{Z}[y^n = y\_k]}{n} - \frac{\mathcal{Z}[\mathcal{Y}\_i^{nt} = y\_k]}{nt} \right|^r \right)^{1/r} \ge \left\| P\_{\mathcal{Y}\_i^{nt}} - \mathcal{P}\_1 \right\|\_r - |\mathcal{Y}|^{1/r} \varepsilon. \tag{A4}$$

Now, for any *y*ˆ*nt <sup>i</sup>* ∈ B *<sup>i</sup>* , we have

$$\begin{split} & \left\| \left\| \mathcal{P}\_{\mathcal{Y}\_{i}^{nt}} - \bar{\mathbf{P}}\_{1} \right\|\_{r} + \left( \sum\_{k=1}^{|\mathcal{Y}|} \left( \frac{\varepsilon}{\sqrt[r]{t}} \right)^{r} \right)^{1/r} \\ & \stackrel{(i)}{\geq} \left( \sum\_{k=1}^{|\mathcal{Y}|} \left| \frac{\mathcal{Z}[\mathcal{Y}\_{i}^{nt} = y\_{k}]}{nt} - \mathfrak{p}(y\_{k}|\mathbf{x}\_{1}) \right|^{r} \right)^{1/r} + \left( \sum\_{k=1}^{|\mathcal{Y}|} \left| \frac{\mathcal{Z}[\mathcal{Y}\_{i}^{nt} = y\_{k}]}{nt} - \mathfrak{p}(y\_{k}|\mathbf{x}\_{i}) \right|^{r} \right)^{1/r} \\ & \stackrel{(ii)}{\geq} \left( \sum\_{k=1}^{|\mathcal{Y}|} \left| \mathfrak{p}(y\_{k}|\mathbf{x}\_{1}) - \mathfrak{p}(y\_{k}|\mathbf{x}\_{i}) \right|^{r} \right)^{1/r}, \end{split} \tag{A5}$$

where (*a*) follows from (A1) and (*b*) is again due to the Minkowski inequality. The expression in (A5), can be written equivalently as

$$\left\|\left\|P\_{\mathcal{Y}\_i^{nt}} - \mathbb{P}\_1\right\|\_r \geq \left\|\left\|\mathbf{P}\_i - \mathbb{P}\_1\right\|\_r - \left|\mathcal{Y}\right|^{1/r} \frac{\mathfrak{E}}{\sqrt[3]{t}}.\tag{A6}$$

where *<sup>i</sup>* <sup>=</sup> 1. Using the bounds in (A6) and (A4), for any *<sup>i</sup>* <sup>=</sup> 1 we have

$$\left(\sum\_{k=1}^{|\mathcal{I}|} \left| \frac{\mathcal{Z}[y^n = y\_k]}{n} - \frac{\mathcal{Z}[\mathcal{Y}\_i^{nt} = y\_k]}{nt} \right|^r \right)^{1/r} \ge \left||\bar{\mathcal{P}}\_i - \bar{\mathcal{P}}\_1||\_r - |\mathcal{Y}|^{1/r} \epsilon \left(1 + \frac{1}{\sqrt[3]{t}}\right). \tag{A7}$$

Using the bounds in (A3) and (A7), we now relate the left-hand sides of (A3) and (A7) as follows. As long as the following inequality holds for each *<sup>i</sup>* <sup>=</sup> 1,

$$|\mathcal{Y}|^{1/r}\mathfrak{e}\left(1+\frac{1}{\sqrt[3]{T}}\right) < \left\|\mathbb{P}\_i - \mathbb{P}\_1\right\|\_r - |\mathcal{Y}|^{1/r}\mathfrak{e}\left(1+\frac{1}{\sqrt[3]{t}}\right),\tag{A8}$$

which is equivalent to the following for *<sup>i</sup>* <sup>=</sup> <sup>1</sup>

$$\epsilon < \frac{||\bar{\mathbf{P}}\_i - \bar{\mathbf{P}}\_1||\_r}{2(1 + t^{-1/3})|\mathcal{Y}|^{1/r}}\tag{A9}$$

we have the following for *<sup>i</sup>* <sup>=</sup> <sup>1</sup>

$$\begin{split} \mathbb{E}\left(\sum\_{k=1}^{|\mathcal{Y}|} \left|\frac{\mathcal{Z}[y^n = y\_k]}{n} - \frac{\mathcal{Z}[\mathcal{G}\_1^{\rm n} = y\_k]}{nt}\right|^r\right)^{1/r} &\stackrel{(a)}{\leq} |\mathcal{Y}|^{1/r} \varepsilon \left(1 + \frac{1}{\sqrt[3]{t}}\right) \\ &\stackrel{(b)}{\leq} ||\mathbb{P}\_i - \mathbb{P}\_1||\_r - |\mathcal{Y}|^{1/r} \varepsilon \left(1 + \frac{1}{\sqrt[3]{t}}\right) \\ &\stackrel{(c)}{\leq} \left(\sum\_{k=1}^{|\mathcal{Y}|} \left|\frac{\mathcal{Z}[y^n = y\_k]}{n} - \frac{\mathcal{Z}[\mathcal{G}\_i^{\rm n} = y\_k]}{nt}\right|^r\right)^{1/r}, \text{ (A10)} \end{split}$$

where (*a*), (*b*), and (*c*) follow from (A3), (A8), and (A7), respectively. Thereby, from (A10), we have the following for *<sup>i</sup>* <sup>=</sup> <sup>1</sup>

$$\left(\sum\_{k=1}^{|\mathcal{Y}|} \left| \frac{\mathcal{Z}[\underline{y}^n = y\_k]}{n} - \frac{\mathcal{Z}[\underline{y}^{\rm nt}\_1 = y\_k]}{nt} \right|^r \right)^{1/r} \le \left( \sum\_{k=1}^{|\mathcal{Y}|} \left| \frac{\mathcal{Z}[\underline{y}^n = y\_k]}{n} - \frac{\mathcal{Z}[\underline{y}^{\rm nt}\_i = y\_k]}{nt} \right|^r \right)^{1/r}, \text{ (A11)}$$

or equivalently as

$$\left\|\left|P\_{y^n} - P\_{\mathcal{Y}\_1^{nt}}\right|\right\|\_{r} \leqslant \left\|\left|P\_{y^n} - P\_{\mathcal{Y}\_i^{nt}}\right|\right\|\_{r}.\tag{A12}$$

Once again, we obtained that if there is an for which (A9) holds for *<sup>i</sup>* <sup>=</sup> 1 and for that there are sets <sup>A</sup> and <sup>B</sup> *<sup>i</sup>* for which *<sup>y</sup><sup>n</sup>* ∈ A and *<sup>y</sup>*ˆ*nt <sup>j</sup>* ∈ B *<sup>l</sup>* for all 1 ≤ *l* ≤ |X |, then (A12) holds for *<sup>i</sup>* <sup>=</sup> 1, and thereby our classifier will detect that *<sup>x</sup>*<sup>1</sup> is the correct label. Using this, we can upper-bound the error probability as

$$\begin{split} \mathbb{P}\_{\mathbf{e}} &= 1 - \Pr\{\mathfrak{X}\_{1} = \mathbf{x}\_{1}\} \\ &\leq 1 - \Pr\left\{ (\mathbf{y}^{n} \in \mathcal{A}^{\mathbf{e}}) \cap \left( \bigcap\_{j=1}^{|\mathcal{X}|} \mathcal{Y}\_{l}^{\mathrm{nt}} \in \mathcal{B}\_{l}^{\mathbf{e}} \right) \middle| \mathfrak{e} \in \mathcal{S} \right\}, \end{split} \tag{A13}$$

where S is a set defined as

$$\mathcal{S} = \left\{ \boldsymbol{\epsilon} : \boldsymbol{\epsilon} \le \min\_{i} \frac{||\mathbf{P}\_{i} - \mathbf{P}\_{1}||\_{r}}{(2 + t^{-1/3}) |\mathcal{Y}|^{1/r}} \right\}. \tag{A14}$$

The right-hand side of (A13) can be upper-bounded as

$$\begin{split} 1 - \Pr\left\{ \left( y^n \in \mathcal{A}^c \right) \cap \left( \bigcap\_{l=1}^{|\mathcal{X}|} \mathcal{Y}\_l^{\text{int}} \in \mathcal{B}\_l^c \right) \middle| \varepsilon \in \mathcal{S} \right\} &= \Pr\left\{ \left( y^n \notin \mathcal{A}^c \right) \cup \left( \bigcup\_{l=1}^{|\mathcal{X}|} \mathcal{Y}\_l^{\text{int}} \notin \mathcal{B}\_l^c \right) \middle| \varepsilon \in \mathcal{S} \right\} \\ &\stackrel{(a)}{\leq} \Pr\left\{ y^n \notin \mathcal{A}^c \middle| \varepsilon \in \mathcal{S} \right\} \\ &+ \sum\_{l=1}^{|\mathcal{X}|} \Pr\left\{ \mathcal{Y}\_l^{\text{int}} \notin \mathcal{B}\_l^c \middle| \varepsilon \in \mathcal{S} \right\}, \end{split} \tag{A15}$$

Using the same derivation as (39), we have:

$$\Pr\{y^n \notin \mathcal{A}^\epsilon | \epsilon \in \mathcal{S}\} \le 2|\mathcal{Y}|\mathbf{e}^{-2n\epsilon^2}.\tag{A16}$$

Similarly, we have the following result for the second expression in the right-hand side of (A15), which is the same as the derivation in (43)

$$\Pr\left\{\mathcal{Y}\_{l}^{\mathrm{nt}} \notin \mathcal{B}\_{l}^{c} \,|\,\mathbf{e} \in \mathcal{S}\right\} \leq 2|\mathcal{Y}| \mathrm{e}^{-2nt^{1/3}a^{2}}.\tag{A17}$$

Inserting (A16) and (A17) into (A15), and then inserting (A15) into (A13), we obtain the following upper-bound for the error probability

$$\mathbb{P}\_{\mathbf{e}} \le 2|\mathcal{Y}| \mathbf{e}^{-2nt^2} + 2|\mathcal{X}||\mathcal{Y}| \mathbf{e}^{-2nt^{1/3}\mathbf{e}^2},\tag{A18}$$

where

$$\mathfrak{e} = \min\_{\substack{i,j\\i\neq j}} \frac{||\bar{\mathbf{P}}\_i - \bar{\mathbf{P}}\_j||\_r}{2(1 + t^{-1/3}) |\mathcal{Y}|^{1/r}} \tag{A19}$$

Now, if |Y| ≤ *<sup>n</sup>m*, (A18) can be written as

$$\begin{split} \mathbb{P}\_{\mathbf{e}} &\leq 2|\mathcal{Y}|\mathbf{e}^{-2n\mathbf{e}^{2}} + 2|\mathcal{X}|\left|\mathcal{Y}\right|\mathbf{e}^{-2n\mathbf{t}^{1/3}\mathbf{e}^{2}} \\ &\leq 2n^{m}\exp\left(-2n\min\_{\substack{i,j\\i\neq j}}\frac{\left\Vert\mathbf{P}\_{i}-\mathbf{P}\_{j}\right\Vert\_{r}^{2}}{2(1+t^{-1/3})^{2}n^{2m/r}}\right) \\ &+ 2\left|\mathcal{X}\right|n^{m}\exp\left(-2nt^{1/3}\min\_{\substack{i,j\\i\neq j}}\frac{\left\Vert\mathbf{P}\_{i}-\mathbf{P}\_{j}\right\Vert\_{r}^{2}}{2(1+t^{-1/3})^{2}n^{2m/r}}\right) \\ &\leq \mathcal{O}\left(n^{m}\exp\left(-n^{1-\frac{2m}{r}}\right)\right). \end{split} \tag{A20}$$

According to (A20), for a fixed *r* > 2*m*, the right-hand side of (A20) moves to zero as *n* → ∞ and, thereby, the classifier is asymptotically optimal.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Entropy* Editorial Office E-mail: entropy@mdpi.com www.mdpi.com/journal/entropy

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34

www.mdpi.com

ISBN 978-3-0365-5308-5