Analysis on Optimal Error Exponents of Binary Classification for Source with Multiple Subclasses

Hiroto Kuramata; Hideki Yagi

doi:10.3390/e24050635

Abstract

We consider a binary classification problem for a test sequence to determine from which source the sequence is generated. The system classifies the test sequence based on empirically observed (training) sequences obtained from unknown sources

P_{1}

and

P_{2}

. We analyze the asymptotic fundamental limits of statistical classification for sources with multiple subclasses. We investigate the first- and second-order maximum error exponents under the constraint that the type-I error probability for all pairs of distributions decays exponentially fast and the type-II error probability is upper bounded by a small constant. In this paper, we first give a classifier which achieves the asymptotically maximum error exponent in the class of deterministic classifiers for sources with multiple subclasses, and then provide a characterization of the first-order error exponent. We next provide a characterization of the second-order error exponent in the case where only

P_{2}

has multiple subclasses but

P_{1}

does not. We generalize our results to classification in the case that

P_{1}

and

P_{2}

are a stationary and memoryless source and a mixed memoryless source with general mixture, respectively.

Keywords:

binary classification; error exponent; multiple subclasses

1. Introduction

1.1. Background

The problem of learning sources from training sequences and estimating the source from which a test sequence is generated is known as a classification problem. Recently, this problem has been actively studied in a field such as machine learning, and it is desirable to conduct studies that guarantee the performance of the system. In the field of information theory, studies have been conducted mainly to analyze the performance of optimal tests. When the number of sources is two, the binary classification problem can be regarded as a binary hypothesis testing problem, using training sequences. In the setting of binary hypothesis testing, it is assumed that the sources are known, but in real-world applications, the sources are not known in general. Therefore, it is of importance to consider the binary classification problem.

Hypothesis testing includes approaches such as the Bayesian test [1,2] and the Neyman–Pearson test [3,4,5]. In this paper, we take the latter approach to formulate the best asymptotic error exponent (the exponential part of an error probability).

There are a lot of studies related to the classification problem. We state important points, which are deeply connected to this study in some previous studies. Gutman [3] has shown that type-based (empirical distribution-based) tests asymptotically achieve the maximum type-II error exponent for stationary Markov sources, while the type-I error probability exponentially converges to zero as the length of a test sequence goes to infinity. Zhou et al. [5] derived second-order approximations of the maximum type-I error exponent for stationary and memoryless sources when the type-II error probability is upper bounded by a small constant. On the other hand, for the hypothesis testing problem, Han and Nomura [6] characterized a first-order maximum error exponent when each sequence is generated from a mixed memoryless source, which is a mixture of stationary and memoryless sources. In addition, they also characterized a second-order maximum error exponent in the case where one source is a stationary and memoryless source and the other source is a mixed memoryless source.

1.2. Contributions

In this paper, we investigate the binary classification problem for stationary memoryless sources with multiple subclasses. The class of sources with multiple subclasses is important in binary classification because there are many such settings in real-world applications. For example, newspaper articles with science headlines consist of topics of physics, chemistry, biology, etc. We assume that sources (subclasses) are characterized by a mixture with some unknown prior distribution (cf. Equation (2)), and the overall sources can be regarded as mixed memoryless sources [6]. The purpose of this paper is to characterize the first- and second-order maximum error exponents in a single-letter form (the term “single-letter form” means an expression which does not depend on lengths of sequences n or N (cf. the formulas for error exponents in Theorems 2–4)).

To this end, we generalize Gutman’s classifier [3], which was shown to be first- and second-order optimal for memoryless sources (with no multiple subclasses) in [5]. This classifier uses training sequences from one of the two sources, as in [3,4,5], making a type-based decision for a source (subclass) with the smallest skewed Jensen–Shannon divergence [7] among the subclasses. We show that this classifier asymptotically achieves the maximum type-II error exponent in the class of deterministic classifiers for a given pair of distributions when the type-I error probability decays exponentially fast for all pairs of distributions in Theorem 1. We also demonstrate that the structure of this classifier leads to a reversed and more relaxed relation; the maximization of the type-I error exponent when the type-II error probability is upper bounded by a small constant

ϵ

(0 \leq ϵ < 1)

for sources with multiple subclasses in Theorem 2. In addition, using the Berry–Esseen theorem [8], we derive the second-order maximum error exponent in the case where only one of sources has subclasses in Theorem 3. Finally, the fact that the classifier uses the test sequence from one of two sources motivates us to consider a more general case; the first source is a source with no multiple subclasses, but the second source is given by a general mixture [6,9]. That is, the number of subclasses is not necessarily finite, and the prior distribution of subclasses may not be discrete (cf. Equation (75)). We give characterizations of the first- and second-order maximum error exponents in Theorem 4.

1.3. Related Work

Ziv [10] proposed a classifier based on empirical entropy and discussed the relationship between binary classification and universal source coding. Hsu and Wang [4] characterized the maximum error exponent with mismatched empirically observed statistics. In their achievability proof, a generalization of Gutman’s classifier is also used. Kelly et al. [11] investigated binary classification with large alphabets. Unnikrishnan and Huang [12] investigated the type-I error probability of binary classification using the analysis of weak convergence. Generalizing the binary classification problem, He et al. [13] discussed the binary distribution detection problem, in which a different generalization of Gutman’s classifier is also discussed.

There are also studies which take a Bayesian approach. Merhav and Ziv [1] analyzed the weighted sum of type-I and type-II error probabilities, and subsequently, Saito and Matsushima [2,14] gave a different result via the analysis for the Bayes code.

1.4. Organization

The rest of this paper is organized as follows: In Section 2, we define the notation used in this paper and describe the details of the source and system model. Moreover, we state the problem setting, defining the first- and second-order maximum error exponents. In Section 3, we first give a classifier which achieves the asymptotically maximum error exponent in the class of deterministic classifiers for sources with multiple subclasses. Next, we characterize the first- and second-order maximum error exponents and the detailed proofs for the first-order representation. In Section 4, we generalize the obtained results to the classification of a mixed memoryless source with general mixture. In Section 5, we present numerical examples. Finally, in Section 6, we provide some concluding remarks and future work.

2. Problem Formulation

2.1. Notation

The set of non-negative real numbers is denoted by

R^{+}

. Calligraphic

X

stands for a finite alphabet. Upper-case X denotes a random variable taking values in

X

, and lower-case

x \in X

denotes its realization. Throughout this paper, logarithms are of base e. For integers a and b such that

a \leq b, [a, b]

denotes the set

{a, a + 1, \dots, b}

. The set of all probability distributions on a finite set

X

is denoted as

P (X)

. Notation regarding the method of types [15] is as follows: Given a vector

x_{1}^{n} = (x_{1}, x_{2}, \dots, x_{n}) \in X^{n}

, the type is denoted as

\begin{matrix} q_{x_{1}^{n}} (a) = \frac{1}{n} \sum_{i = 1}^{n} 1 {x_{i} = a}, a \in X . \end{matrix}

(1)

The set of types formed from length-n sequences with alphabet

X

is denoted as

P_{n} (X)

. The probability that n independent drawings from a probability distribution

Q \in P (X)

give

x \in X^{n}

is denoted by

Q (x)

.

2.2. Source with Multiple Subclasses

Consider a source consisting of multiple subclasses. Each subclass is distributed according to a given probability (weight). Let

{P_{1}^{i}}_{i \in S}

be a family of probability distributions on a finite alphabet

X

, where

S = {1, \dots, s}

is a probability space with probability measure

v (i), i \in S

. That is, the probability of

x \in X^{n}

is given by

\begin{matrix} P_{1} (x) = \sum_{i = 1}^{s} v (i) P_{1}^{i} (x), \end{matrix}

(2)

where the i-th subclass

P_{1}^{i}

is a stationary and memoryless source. That is, for

x = (x_{1}, x_{2}, \dots, x_{n}) \in X^{n}

,

\begin{matrix} P_{1}^{i} (x) = \prod_{j = 1}^{n} P_{1}^{i} (x_{j}) \end{matrix}

(3)

(for notational simplicity, we denote both the multi-letter and single-letter probabilities by

P_{1}^{i}

with a slight abuse of notation).

In view of (2), the sequence

x

can be regarded as an output from a mixed memoryless source

P_{1} (\cdot)

, and it is called a test sequence. Similarly, let

{P_{2}^{i}}_{i \in U}

be a family of probability distributions, where

U = {1, \dots, u}

is a probability space with probability measure

w (i), i \in U

. For these mixed sources, if the sources are known, that is, the addressed problem is hypothesis testing, the first- and second-order error exponents were analyzed by Han and Nomura [6]. In this paper, we assume that the sources are unknown and training sequences are available to learn about the source. Sets of training sequences are denoted by

t_{1} = {t_{1}^{1}, \dots, t_{1}^{s}}

and

t_{2} = {t_{2}^{1}, \dots, t_{2}^{u}}

, where

t_{i}^{j} \in X^{N}

of length N is output from subclass j and

N = ⌈ n γ ⌉

for some fixed

γ \in R^{+}

. Then, the joint probabilities of training sequences are, respectively,

\begin{matrix} P_{1} (t_{1}) = \prod_{i = 1}^{s} P_{1}^{i} (t_{1}^{i}), \end{matrix}

(4)

\begin{matrix} P_{2} (t_{2}) = \prod_{j = 1}^{u} P_{2}^{j} (t_{2}^{j}) . \end{matrix}

(5)

We define the class of sources with multiple subclasses on a probability space

S

with probability measure

v (\cdot)

as

\begin{matrix} P_{S} (X) : = \{P = {P^{i}, v (i)}_{i \in S} : P^{i} \in P (X)\}, \end{matrix}

(6)

which means

P_{1} \in P_{S} (X)

, where the set of weights

{v (i)}_{i \in S}

is implicitly fixed. Similarly, we define the class of sources with multiple subclasses on a probability space

U

with probability measure

w (\cdot)

as

\begin{matrix} P_{U} (X) : = \{P = {P^{i}, w (i)}_{i \in U} : P^{i} \in P (X)\}, \end{matrix}

(7)

which means

P_{2} \in P_{U} (X)

.

2.3. System Model

The binary classification problem assumed in this paper is shown in Figure 1. It consists of two phases: (I) learning phase and (II) classification phase. We explain the details of each phase.

Figure 1. System model.

(I) Learning phase: Determine the classifier by learning with the training sequences

t_{1} = {t_{1}^{1}, \dots, t_{1}^{s}}

and

t_{2} = {t_{2}^{1}, \dots, t_{2}^{u}}

generated from unknown source

P_{1} \in P_{S} (X)

and

P_{2} \in P_{U} (X)

, respectively.

(II) Classification phase: It represents the phase in which we judge whether the test sequence

x \in X^{n}

generated from

P_{1} \in P_{S} (X)

or

P_{2} \in P_{U} (X)

according to the classifier determined in (I).

2.4. Maximum Error Exponent

In this section, we define two error probabilities that arise in a binary classification problem and formulate the maximum error exponents. In the binary classification problem, a test is described as a partition of the space

ϕ^{n} : X^{n} \times X^{(s + u) N} \to {1, 2}

. The type-I and type-II error probabilities of a given test

ϕ^{n}

are denoted as

β_{1} (ϕ^{n} | P_{1}, P_{2})

and

β_{2} (ϕ^{n} | P_{1}, P_{2})

, respectively. That is,

\begin{matrix} β_{1} (ϕ^{n} | P_{1}, P_{2}) : = P_{1} {ϕ^{n} (x, t_{1}, t_{2}) = 2}, \end{matrix}

(8)

\begin{matrix} β_{2} (ϕ^{n} | P_{1}, P_{2}) : = P_{2} {ϕ^{n} (x, t_{1}, t_{2}) = 1} . \end{matrix}

(9)

Here,

P_{θ} {\cdot}

is the joint probability of training and testing sequences when the underlying parameter is

θ

, given by

\begin{matrix} P_{1} {ϕ^{n} (x, t_{1}, t_{2}) = ℓ} : = \sum_{(x, t_{1}, t_{2}) : ϕ^{n} (x, t_{1}, t_{2}) = ℓ} P_{1} (x) \prod_{i = 1}^{s} P_{1}^{i} (t_{1}^{i}) \prod_{j = 1}^{u} P_{2}^{j} (t_{2}^{j}) ℓ \in {1, 2}, \end{matrix}

(10)

\begin{matrix} P_{2} {ϕ^{n} (x, t_{1}, t_{2}) = ℓ} : = \sum_{(x, t_{1}, t_{2}) : ϕ^{n} (x, t_{1}, t_{2}) = ℓ} P_{2} (x) \prod_{i = 1}^{s} P_{1}^{i} (t_{1}^{i}) \prod_{j = 1}^{u} P_{2}^{j} (t_{2}^{j}) ℓ \in {1, 2} . \end{matrix}

(11)

We consider the problem of maximizing the type-I error exponent when the type-II error probability is upper bounded by a small constant

ϵ \in [0, 1)

. In this study, we characterize the following quantities (the first- and second-order maximum type-I error exponent).

Definition 1

(First-order maximum error exponent). For any pair of distributions

P = (P_{1}, P_{2}) \in P_{S} (X) \times P_{U} (X)

and

ϵ \in [0, 1)

, we define

\begin{matrix} \hat{λ} (ϵ) & : = sup \{λ \in R^{+} | \begin{matrix} \exists {ϕ^{n}}_{n = 1}^{\infty} s . t . for all sufficiently large n \\ β_{1} (ϕ^{n} | \tilde{P}) \leq exp (- n λ) (\forall \tilde{P} \in P_{S} (X) \times P_{U} (X)), \\ \underset{n \to \infty}{lim sup} β_{2} (ϕ^{n} | P) \leq ϵ \end{matrix}\}, \end{matrix}

(12)

where the weights of

\tilde{P} = ({\tilde{P}}_{1}, {\tilde{P}}_{2}) \in P_{S} (X) \times P_{U} (X)

are the sames as the weights of

P = (P_{1}, P_{2}) \in P_{S} (X) \times P_{U} (X)

.

Definition 2

(Second-order maximum error exponent). For any pair of distributions

P = (P_{1}, P_{2}) \in P_{S} (X) \times P_{U} (X)

and

ϵ \in [0, 1)

, we define

\begin{matrix} \hat{r} (ϵ, λ) : = sup \{r | \begin{matrix} \exists {ϕ^{n}}_{n = 1}^{\infty} s . t . for all sufficiently large n \\ β_{1} (ϕ^{n} | \tilde{P}) \leq exp (- n λ - \sqrt{n} r) (\forall \tilde{P} \in P_{S} (X) \times P_{U} (X)), \\ \underset{n \to \infty}{lim sup} β_{2} (ϕ^{n} | P) \leq ϵ \end{matrix}\}, \end{matrix}

(13)

where the weights of

\tilde{P} = ({\tilde{P}}_{1}, {\tilde{P}}_{2}) \in P_{S} (X) \times P_{U} (X)

are the sames as the weights of

P = (P_{1}, P_{2}) \in P_{S} (X) \times P_{U} (X)

.

Remark 1.

In (12) and (13), the type-I error probability is constrained for any

\tilde{P} \in P_{S} (X) \times P_{U} (X)

for technical reasons. In more detail, this condition is required in the proof of the converse part. This condition was also imposed by Gutman [3], Hsu and Wang [4], and Zhou et al. [5].

In Definitions 1 and 2, we focus on universal tests that perform well for all pairs of distributions with respect to the type-I error probability, and at the same time, constrain the type-II error probability with respect to a particular pair of distributions. We obtain the same result when the weights of

({\tilde{P}}_{1}, {\tilde{P}}_{2}) \in P_{S} (X) \times P_{U} (X)

are not fixed.

3. Main Result

3.1. A Test to Achieve Maximum Error Exponent

The rule for estimating whether a test sequence generated from

P_{1} \in P_{S} (X)

or

P_{2} \in P_{U} (X)

is called a decision rule. One of the goals of the classification problem is to design an optimal decision rule which achieves a maximum error exponent based on training sequences. In this section, we present a decision rule that asymptotically achieves the maximum type-II error exponent for any pair of distributions when the type-I error exponent is lower bounded by a constant for all pairs of distributions (cf. Theorem 1).

To define a test that is asymptotically optimum, we define two generalizations of the Jensen–Shannon divergence. These generalizations are related to some variational definitions in [7,16]. For any pair of distributions

(Q_{1}, Q_{2}) \in P {(X)}^{2}

and any number

γ \in R^{+}

, let the generalized Jensen–Shannon divergence be

\begin{matrix} GJS (Q_{1}, Q_{2}, γ) : = γ D (Q_{1} | | \frac{γ Q_{1} + Q_{2}}{1 + γ}) + D (Q_{2} | | \frac{γ Q_{1} + Q_{2}}{1 + γ}), \end{matrix}

(14)

where

D (p | | q)

denotes the Kullback–Leibler divergence for

p \in P (X)

and

q \in P (X)

defined as

\begin{matrix} D (p | | q) : = \sum_{i \in X} p (i) log \frac{p (i)}{q (i)} . \end{matrix}

(15)

The generalized Jensen–Shannon divergence

GJS (Q_{1}, Q_{2}, γ)

corresponds to a skewed

α

-Jensen–Shannon divergence for

α = \frac{γ}{1 + γ}

. Additionally, for

(Q_{1}, Q_{2}) \in P_{S} (X) \times P (X)

, we define the minimized generalized Jensen–Shannon divergence by

\begin{matrix} MGJS (Q_{1}, Q_{2}, γ) : = min_{i \in S} GJS (Q_{1}^{i}, Q_{2}, γ) . \end{matrix}

(16)

Given a threshold

λ \in R^{+}

(including

λ = 0

), the decision rule to achieve the maximum error exponent is given by

\begin{matrix} Λ_{2}^{n} = \{(x, t_{1}, t_{2}) \in X^{n} \times X^{s N} \times X^{u N} | MGJS (q_{t_{1}}, q_{x}, γ) \geq \tilde{λ}\}, \end{matrix}

(17)

where

\tilde{λ} = λ + η (n)

,

η (n) : = \frac{2 log s + | X | log (n + N + 1)}{n}

and

Λ_{i}^{n}, i \in {1, 2}

is the set of

(x, t_{1}, t_{2})

determined to be class i by the test

Λ^{n}

. By definition, the discriminant function

MGJS (q_{t_{1}}, q_{x}, γ)

, appearing on the right-hand side of (17), can also be expressed as

\begin{matrix} MGJS (q_{t_{1}}, q_{x}, γ) = min_{i \in S} \{γ D (q_{t_{1}} | | q_{y_{1}^{i}}) + D (q_{x} | | q_{y_{1}^{i}})\}, \end{matrix}

(18)

where

y_{1}^{i} : = x t_{1}^{i}

. From (17) and (18),

Λ^{n}

is a type-based test and implicitly depends on

λ

. In addition, this test uses training sequences asymmetrically; only sequence

t_{1}

is used, but not

t_{2}

(cf. refs. [3,4,5]).

Theorem 1.

For any given

λ \in R^{+}

and any sequence of tests

{ϕ^{n}}

such that

β_{1} (ϕ^{n} | \tilde{P}) \leq exp (- n λ), \forall \tilde{P} = ({\tilde{P}}_{1}, {\tilde{P}}_{2}) \in P_{S} (X) \times P_{U} (X)

, the sequence of tests

{Λ^{n}}

given by (17) satisfies for any pair of distributions

P = (P_{1}, P_{2}) \in P_{S} (X) \times P_{U} (X)

,

\begin{matrix} β_{1} (Λ^{n} | \tilde{P}) & \leq exp (- n λ) (\forall \tilde{P} \in P_{S} (X) \times P_{U} (X)), \end{matrix}

(19)

\begin{matrix} lim_{n \to \infty} - \frac{1}{n} log β_{2} (ϕ^{n} | P) & \leq lim_{n \to \infty} - \frac{1}{n} log β_{2} (Λ^{n} | P) . \end{matrix}

(20)

Proof.

Equation (19) is derived in Section 3.3.1. The proof of (20) follows from Corollary 1. Although there is a deviation between the exponents in Corollary 1 and for the test

Λ^{n}

, the deviation vanishes asymptotically. □

Theorem 1 shows that the test

Λ^{n}

can asymptotically achieve the maximum type-II error exponent among the tests

ϕ^{n}

for which the type-I error exponent is greater than or equal to

λ

. This test also has a reversed and more relaxed property; it achieves

\hat{λ} (ϵ)

, the maximum type-I error exponent when the type-II error probability is upper bounded by a constant

ϵ (0 \leq ϵ < 1)

(see the achievability proof of Theorem 2 in Section 3.3.1).

3.2. First-Order Maximum Error Exponent

In this section, we characterize the first-order maximum error exponent in a single-letter form for sources with multiple subclasses.

Theorem 2.

For any pair of distributions

(P_{1}, P_{2}) \in P_{S} (X) \times P_{U} (X)

, we have

\begin{matrix} \hat{λ} (ϵ) = sup \{\bar{λ} | \sum_{{j \in U : MGJS (P_{1}, P_{2}^{j}, γ) < \bar{λ}}} w (j) \leq ϵ\} . \end{matrix}

(21)

It should be noted that

\hat{λ} (ϵ)

depends on

{w (j)}_{j \in U}

, but not on

{v (i)}_{i \in S}

.

Proof.

The proof is provided in Section 3.3. □

Remark 2.

If

S

and

U

are singletons (that is,

s = 1

and

u = 1

), Theorem 2 reduces to the following formula given by Zhou et al. [5]:

\begin{matrix} \hat{λ} (ϵ) = GJS (P_{1}, P_{2}, γ) (0 \leq \forall ϵ < 1), \end{matrix}

(22)

which means that

\hat{λ} (ϵ)

does not depend on ϵ and the strong converse holds in this case, unlike in the case

| S |, | U | > 1

. On the other hand, for general

S

and

U

but in the spacial case of

ϵ = 0

, formula (21) reduces to

\begin{matrix} \hat{λ} (0) = min_{i \in S, j \in U} GJS (P_{1}^{i}, P_{2}^{j}, γ) . \end{matrix}

(23)

3.3. Proof of Theorem 2

We divide the proof of Theorem 2 into two parts: the achievability (direct) part and the converse part.

3.3.1. Achievability Part

In the achievability proof, we use the type-based test

Λ^{n}

given by (17). Fix any

\begin{matrix} λ < sup \{\bar{λ} | \sum_{{j \in U : MGJS (P_{1}, P_{2}^{j}, γ) < \bar{λ}}} w (j) \leq ϵ\} . \end{matrix}

(24)

Then, for any pair of distributions

(P_{1}, P_{2}) \in P_{S} (X) \times P_{U} (X)

and for all pairs of distributions

({\tilde{P}}_{1}, {\tilde{P}}_{2}) \in P_{S} (X) \times P_{U} (X)

, we show

\begin{matrix} β_{1} (Λ^{n} | {\tilde{P}}_{1}, {\tilde{P}}_{2}) & \leq exp (- n λ), \end{matrix}

(25)

\begin{matrix} \underset{n \to \infty}{lim sup} β_{2} (Λ^{n} | P_{1}, P_{2}) & \leq ϵ . \end{matrix}

(26)

First, we prove (25). For preliminaries, we define the following sets used in the proof:

\begin{matrix} T_{n} (q_{x}) & : = \{x^{'} \in X^{n} : q_{x^{'}} = q_{x}\}, \end{matrix}

(27)

\begin{matrix} {\tilde{Λ}}_{2}^{n} & : = \{(x, t_{1}) : MGJS (q_{t_{1}}, q_{x}, γ) \geq \tilde{λ}\}, \end{matrix}

(28)

\begin{matrix} Γ (Λ) & : = \{(q_{x}, q_{t_{1}}) : (x, t_{1}) \in Λ\}, \end{matrix}

(29)

where

{\tilde{Λ}}_{2}^{n}

is the projection of

Λ_{2}^{n} \subseteq X^{n} \times X^{s N} \times X^{u N}

onto the space

X^{n} \times X^{s N}

. To evaluate the probability of a source sequence being in

T_{n} (q_{x})

, the following relationship holds from the method of types [15].

Lemma 1.

Suppose that the sequence

x

is sampled independently from the source

P \in P (X)

. Then,

\begin{matrix} \frac{1}{| P_{n} (X) |} exp \{- n D (q_{x} | | P)\} \leq P (T_{n} (q_{x})) \leq exp \{- n D (q_{x} | | P)\}, \end{matrix}

(30)

where

P_{n} (X)

denotes the set of types formed from length-n sequences with alphabet

X

and

| P_{n} {(X) | \leq (n + 1)}^{| X |}

.

Then, an upper bound on the type-II error probability of the test

Λ^{n} = (Λ_{1}^{n}, Λ_{2}^{n})

for all pairs of distributions

({\tilde{P}}_{1}, {\tilde{P}}_{2}) \in P_{S} (X) \times P_{U} (X)

can be evaluated as follows:

\begin{matrix} β_{1} (Λ^{n} | {\tilde{P}}_{1}, {\tilde{P}}_{2}) \end{matrix}

\begin{matrix} = \sum_{(x, t_{1}, t_{2}) \in Λ_{2}^{n}} {\tilde{P}}_{1} (x) \prod_{r = 1}^{s} {\tilde{P}}_{1}^{r} (t_{1}^{r}) \prod_{j = 1}^{u} {\tilde{P}}_{2}^{j} (t_{2}^{j}) \\ = \sum_{(x, t_{1}) \in {\tilde{Λ}}_{2}^{n}} \sum_{i = 1}^{s} {\tilde{P}}_{1}^{i} (x) \prod_{r = 1}^{s} {\tilde{P}}_{1}^{r} (t_{1}^{r}) v (i) \\ = \sum_{(q_{x}, q_{t_{1}}) \in Γ (Λ_{2}^{n})} \sum_{i = 1}^{s} {\tilde{P}}_{1}^{i} (T_{n} (q_{x})) \prod_{r = 1}^{s} {\tilde{P}}_{1}^{r} (T_{N} (q_{t_{1}^{r}})) v (i) \end{matrix}

(31)

\begin{matrix} \leq \sum_{(q_{x}, q_{t_{1}}) \in Γ (Λ_{2}^{n})} \sum_{i = 1}^{s} exp \{- n D (q_{x} | | {\tilde{P}}_{1}^{i}) - \sum_{r = 1}^{s} N D (q_{t_{1}^{r}} | | {\tilde{P}}_{1}^{r})\} \\ \leq \sum_{(q_{x}, q_{t_{1}}) \in Γ (Λ_{2}^{n})} \sum_{i = 1}^{s} exp \{- n D (q_{x} | | {\tilde{P}}_{1}^{i}) - N D (q_{t_{1}^{i}} | | {\tilde{P}}_{1}^{i})\} \end{matrix}

(32)

\begin{matrix} = \sum_{(q_{x}, q_{t_{1}}) \in Γ (Λ_{2}^{n})} \sum_{i = 1}^{s} exp \{- n D (q_{x} | | q_{y_{1}^{i}}) - N D (q_{t_{1}^{i}} | | q_{y_{1}^{i}}) - n (1 + γ) D (\frac{q_{x} + γ q_{t_{1}^{i}}}{1 + γ} | | {\tilde{P}}_{1}^{i})\}, \end{matrix}

(33)

where (31) is derived from (4), (32) follows from Lemma 1 and (33) is derived in Appendix A. Minimizing exponents in (33) with respect to

i \in S

, we further obtain

\begin{matrix} β_{1} (Λ^{n} | {\tilde{P}}_{1}, {\tilde{P}}_{2}) \\ \leq \sum_{(q_{x}, q_{t_{1}}) \in Γ (Λ_{2}^{n})} s exp \{- n min_{i \in S} [D (q_{x} | | q_{y_{1}^{i}}) + γ D (q_{t_{1}^{i}} | | q_{y_{1}^{i}})]\} \\ \cdot exp \{- n min_{i \in S} (1 + γ) D (\frac{q_{x} + γ q_{t_{1}^{i}}}{1 + γ} | | {\tilde{P}}_{1}^{i})\} \end{matrix}

(34)

\begin{matrix} \leq \sum_{(q_{x}, q_{t_{1}}) \in Γ (Λ_{2}^{n})} s exp \{- n \tilde{λ}\} \cdot exp \{- n min_{i \in S} (1 + γ) D (\frac{q_{x} + γ q_{t_{1}^{i}}}{1 + γ} | | {\tilde{P}}_{1}^{i})\} \\ = exp \{- n [\tilde{λ} - \frac{log s}{n}]\} \cdot \sum_{(q_{x}, q_{t_{1}}) \in Γ (Λ_{2}^{n})} exp \{- n min_{i \in S} (1 + γ) D (\frac{q_{x} + γ q_{t_{1}^{i}}}{1 + γ} | | {\tilde{P}}_{1}^{i})\} \\ \leq exp \{- n [\tilde{λ} - \frac{log s}{n}]\} \cdot \sum_{Q \in P_{n + N} (X)} exp \{min_{i \in S} - (n + N) D (Q | | {\tilde{P}}_{1}^{i})\} \end{matrix}

(35)

\begin{matrix} \leq exp \{- n [\tilde{λ} - \frac{log s}{n}]\} {(n + N + 1)}^{| X |} \cdot \sum_{Q \in P_{n + N} (X)} max_{i \in S} {\tilde{P}}_{1}^{i} (T_{n + N} (Q)) \\ \leq exp \{- n [\tilde{λ} - \frac{log s}{n}]\} {(n + N + 1)}^{| X |} \cdot \sum_{Q \in P_{n + N} (X)} \sum_{i = 1}^{s} {\tilde{P}}_{1}^{i} (T_{n + N} (Q)) \\ = exp \{- n [\tilde{λ} - \frac{log s}{n}]\} {(n + N + 1)}^{| X |} \cdot \sum_{i = 1}^{s} \sum_{Q \in P_{n + N} (X)} {\tilde{P}}_{1}^{i} (T_{n + N} (Q)) \\ = exp \{- n [\tilde{λ} - \frac{log s}{n}]\} {(n + N + 1)}^{| X |} s \\ = exp \{- n [\tilde{λ} - \frac{2 log s}{n} - \frac{| X | log (n + N + 1)}{n}]\} \\ = exp \{- n λ\}, \end{matrix}

(36)

where (34) is derived from (16) and (28), and (35) follows from Lemma 1.

Next, we demonstrate (26). For preliminaries, we define a new typical set and show some properties for the proof. For any given

Q \in P (X)

, define the following typical set:

\begin{matrix} B_{n} (Q) : = \{x \in X^{n} : max_{x \in X} | q_{x} (x) - Q (x) | \leq \sqrt{\frac{log n}{n}}\} . \end{matrix}

(37)

By using [6, Lemma 22], for

t_{1} \sim Q_{1}

,

x \sim Q_{2}

, generated from memoryless sources

(Q_{1}, Q_{2}) \in P {(X)}^{2}

, we have

\begin{matrix} Pr \{t_{1} \notin B_{N} (Q_{1}) or x \notin B_{n} (Q_{2})\} \leq \frac{(1 + γ^{2}) | X |}{γ^{2} n^{2}} . \end{matrix}

(38)

For any given

x \in X

and any pair of distributions

(Q_{1}, Q_{2}) \in P {(X)}^{2}

, we define information density as

\begin{matrix} ι_{i} (x | Q_{1}, Q_{2}, γ) : = log \frac{(1 + γ) Q_{i} (x)}{γ Q_{1} (x) + Q_{2} (x)} (i = 1, 2) . \end{matrix}

(39)

Furthermore, for any pair of distributions

(Q_{1}, Q_{2}) \in P_{S} (X) \times P (X)

, define the function

i^{*} (Q_{1}, Q_{2})

, as the index of the subclass of

Q_{1}

, as follows:

\begin{matrix} i^{*} (Q_{1}, Q_{2}) & : = \underset{i \in S}{arg min} \{γ D (Q_{1}^{i} | | \frac{γ Q_{1}^{i} + Q_{2}}{1 + γ}) + D (Q_{2} | | \frac{γ Q_{1}^{i} + Q_{2}}{1 + γ})\} \\ = \underset{i \in S}{arg min} GJS (Q_{1}^{i}, Q_{2}, γ) . \end{matrix}

(40)

Hereafter, we denote

i^{*} (Q_{1}, Q_{2})

simply as

i^{*} (Q_{2})

when the first argument is clear from the context.

Lemma 2.

Assume that

(Q_{1}, Q_{2}) \in P_{S} (X) \times P (X)

. For

q_{t_{1}^{i}} \in B_{N} (Q_{1}^{i}) (i \in S)

,

q_{x} \in B_{n} (Q_{2})

, we have

\begin{matrix} i^{*} (q_{t_{1}}, q_{x}) \to i^{*} (Q_{1}, Q_{2}) (n \to \infty) . \end{matrix}

(41)

Proof.

The proof is provided in Appendix B. □

Lemma 3

(Zhou et al. [5]). Assume that

(Q_{1}, Q_{2}) \in P_{S} (X) \times P (X)

. For

q_{t_{1}^{i}} \in B_{N} (Q_{1}^{i}) (i \in S)

,

q_{x} \in B_{n} (Q_{2})

, by applying the Taylor expansion to

GJS (q_{t_{1}^{i}}, q_{x}, γ)

around

(Q_{1}^{i}, Q_{2})

, we have

\begin{matrix} GJS (q_{t_{1}^{i}}, q_{x}, γ) = \frac{γ}{N} \sum_{m = 1}^{N} ι_{1} (t_{1, m}^{i} | Q_{1}^{i}, Q_{2}, γ) + \frac{1}{n} \sum_{m = 1}^{n} ι_{2} (x_{m} | Q_{1}^{i}, Q_{2}, γ) + O (\frac{log n}{n}), \end{matrix}

(42)

where

t_{1, m}^{i}

denotes the m-th symbol of sequence

t_{1}^{i}

and

x_{m}

denotes the m-th symbol of sequence

x

.

Note that the probability

P_{2}

is calculated by assuming that the test sequence

x

is generated from

P_{2}

(cf. Equation (11)). An upper bound on the type-II error probability can be evaluated as follows:

\begin{matrix} \underset{n \to \infty}{lim sup} β_{2} (Λ^{n} | P_{1}, P_{2}) \\ = \underset{n \to \infty}{lim sup} P_{2} \{MGJS (q_{t_{1}}, q_{x}, γ) < λ + η (n)\} \\ = \underset{n \to \infty}{lim sup} \sum_{j = 1}^{u} w (j) P_{2}^{j} \{MGJS (q_{t_{1}}, q_{x}, γ) < λ + η (n)\} \\ \leq \underset{n \to \infty}{lim sup} \sum_{j = 1}^{u} w (j) P_{2}^{j} \{MGJS (q_{t_{1}}, q_{x}, γ) < λ + η (n), x \in B_{n} (P_{2}^{j}), t_{1}^{i} \in B_{N} (P_{1}^{i}), \forall i \in S\} \\ + \underset{n \to \infty}{lim sup} \sum_{i = 1}^{s} \sum_{j = 1}^{u} w (j) \cdot P_{2}^{j} \{x \notin B_{n} (P_{2}^{j}) or t_{1}^{i} \notin B_{N} (P_{1}^{i})\} \\ \leq \underset{n \to \infty}{lim sup} \sum_{j = 1}^{u} w (j) P_{2}^{j} \{GJS (q_{t_{1}^{i^{*} (q_{x})}}, q_{x}, γ) < λ + η (n), x \in B_{n} (P_{2}^{j}), t_{1}^{i} \in B_{N} (P_{1}^{i}), \forall i \in S\}, \end{matrix}

(43)

where Equation (43) follows from Equations (14), (38) and (40). By using Lemma 3, Equation (43) can be expanded as follows:

\begin{matrix} \underset{n \to \infty}{lim sup} β_{2} (Λ^{n} | P_{1}, P_{2}) \\ \leq \underset{n \to \infty}{lim sup} \sum_{j = 1}^{u} w (j) P_{2}^{j} {\frac{γ}{N} \sum_{m = 1}^{N} ι_{1} (t_{1, m}^{i^{*} (q_{x})} | P_{1}^{i^{*} (q_{x})}, P_{2}^{j}, γ) + \frac{1}{n} \sum_{m = 1}^{n} ι_{2} (x_{m} | P_{1}^{i^{*} (q_{x})}, P_{2}^{j}, γ) < λ \\ + O (\frac{log n}{n}), x \in B_{n} (P_{2}^{j}), t_{1}^{i} \in B_{N} (P_{1}^{i}), \forall i \in S} \\ \leq \sum_{j = 1}^{u} \underset{n \to \infty}{lim sup} w (j) P_{2}^{j} {\frac{γ}{N} \sum_{m = 1}^{N} ι_{1} (t_{1, m}^{i^{*} (q_{x})} | P_{1}^{i^{*} (q_{x})}, P_{2}^{j}, γ) + \frac{1}{n} \sum_{m = 1}^{n} ι_{2} (x_{m} | P_{1}^{i^{*} (q_{x})}, P_{2}^{j}, γ) < λ \end{matrix}

\begin{matrix} + O (\frac{log n}{n}), x \in B_{n} (P_{2}^{j}), t_{1}^{i} \in B_{N} (P_{1}^{i}), \forall i \in S} \\ = \sum_{j = 1}^{u} \underset{n \to \infty}{lim sup} w (j) P_{2}^{j} \{x \in B_{n} (P_{2}^{j}), t_{1}^{i} \in B_{N} (P_{1}^{i}), \forall i \in S\} \\ \cdot P_{2}^{j} {\frac{γ}{N} \sum_{m = 1}^{N} ι_{1} (t_{1, m}^{i^{*} (q_{x})} | P_{1}^{i^{*} (q_{x})}, P_{2}^{j}, γ) + \frac{1}{n} \sum_{m = 1}^{n} ι_{2} (x_{m} | P_{1}^{i^{*} (q_{x})}, P_{2}^{j}, γ) < λ \end{matrix}

(44)

\begin{matrix} + O (\frac{log n}{n}) |x \in B_{n} (P_{2}^{j}), t_{1}^{i} \in B_{N} (P_{1}^{i}), \forall i \in S\}, \end{matrix}

(45)

where Equation (44) is derived from Fatou’s lemma. It follows from (38) that

\begin{matrix} P_{2}^{j} \{t_{1}^{i} \in B_{N} (P_{1}^{i}), x \in B_{n} (P_{2}^{j}), \forall i \in S\} \to 1 (n \to \infty) . \end{matrix}

(46)

Here, Equation (14) can be also expressed as follows:

\begin{matrix} GJS (Q_{1}, Q_{2}, γ) & = γ E_{Q_{1}} [ι_{1} (X | Q_{1}, Q_{2}, γ)] + E_{Q_{2}} [ι_{2} (X | Q_{1}, Q_{2}, γ)] . \end{matrix}

(47)

Therefore, by the weak law of large numbers, for any given

δ > 0

,

\begin{matrix} \underset{n \to \infty}{lim sup} Pr {| \frac{γ}{N} \sum_{m = 1}^{N} ι_{1} (t_{1, m}^{i^{*} (q_{x})} | P_{1}^{i^{*} (q_{x})}, P_{2}^{j}, γ) + \frac{1}{n} \sum_{m = 1}^{n} ι_{2} (x_{m} | P_{1}^{i^{*} (q_{x})}, P_{2}^{j}, γ) \\ - GJS (P_{1}^{i^{*} (q_{x})}, P_{2}^{j}, γ) | \leq δ} = 1 . \end{matrix}

(48)

From this result of the weak law of large numbers and Lemma 2, combining (45)–(48) gives

\begin{matrix} \underset{n \to \infty}{lim sup} β_{2} (Λ^{n} | P_{1}, P_{2}) & \leq \sum_{{j \in U : GJS (P_{1}^{i^{*} (P_{2}^{j})}, P_{2}^{j}, γ) < λ}} w (j) . \end{matrix}

(49)

Thus, by (24), we can see that

\begin{matrix} \underset{n \to \infty}{lim sup} β_{2} (Λ^{n} | P_{1}, P_{2}) \leq ϵ . \end{matrix}

(50)

3.3.2. Converse Part

For any pair of distributions

(P_{1}, P_{2}) \in P_{S} (X) \times P_{U} (X)

and for all pairs of distributions

({\tilde{P}}_{1}, {\tilde{P}}_{2}) \in P_{S} (X) \times P_{U} (X)

, fix any test

ϕ^{n}

satisfying that

\begin{matrix} β_{1} (ϕ^{n} | {\tilde{P}}_{1}, {\tilde{P}}_{2}) & \leq exp (- n λ), \end{matrix}

(51)

\begin{matrix} \underset{n \to \infty}{lim sup} β_{2} (ϕ^{n} | P_{1}, P_{2}) & \leq ϵ . \end{matrix}

(52)

We show

\begin{matrix} λ \leq sup \{\bar{λ} | \sum_{{j \in U : MGJS (P_{1}, P_{2}^{j}, γ) < \bar{λ}}} w (j) \leq ϵ\} . \end{matrix}

(53)

We first give some lemmas, which are useful in the proof of the converse part.

Lemma 4.

Let

ϕ^{n}

be a test in which the decision rule depends only on

(x, t_{1}, t_{2}) \in X^{n} \times X^{s N} \times X^{u N}

. Then, for any given

κ \in [0, 1]

, we can construct a type-based test

Ω^{n}

satisfying

\begin{matrix} β_{1} (ϕ^{n} | {\tilde{P}}_{1}, {\tilde{P}}_{2}) & \geq κ β_{1} (Ω^{n} | {\tilde{P}}_{1}, {\tilde{P}}_{2}), \end{matrix}

(54)

\begin{matrix} β_{2} (ϕ^{n} | {\tilde{P}}_{1}, {\tilde{P}}_{2}) & \geq (1 - κ) β_{2} (Ω^{n} | {\tilde{P}}_{1}, {\tilde{P}}_{2}) \end{matrix}

(55)

for any pair of distributions

({\tilde{P}}_{1}, {\tilde{P}}_{2}) \in P_{S} (X) \times P_{U} (X)

.

Proof.

Lemma 4 can be proved in the same way as (Lemma 7 [5]), the proof of which is inspired by (Lemma 2 [3]). □

Remark 3.

As in the proof of (Lemma 2 [3]) and (Lemma 7 [5]), a type-based test

Ω^{n}

specified in Lemma 4 is obtained by tailoring

ϕ^{n}

and satisfies Equations (54) and (55) for all

({\tilde{P}}_{1}, {\tilde{P}}_{2}) \in P_{S} (X) \times P_{U} (X)

. In other words, the construction of

Ω^{n}

is universal, which is in the same spirit of (Lemma 2 [3]). This claim is slightly stronger than the one in (Lemma 7 [5]).

Lemma 5.

For any

λ \in R^{+}

, any type-based test

Ω^{n}

satisfying the condition that for all pairs of distributions

({\tilde{P}}_{1}, {\tilde{P}}_{2}) \in P_{S} (X) \times P_{U} (X)

,

\begin{matrix} β_{1} (Ω^{n} | {\tilde{P}}_{1}, {\tilde{P}}_{2}) \leq exp (- n λ), \end{matrix}

(56)

we have that for any pair of distributions

(P_{1}, P_{2}) \in P_{S} (X) \times P_{U} (X)

\begin{matrix} β_{2} (Ω^{n} | P_{1}, P_{2}) \geq P_{2} \{MGJS (q_{t_{1}}, q_{x}, γ) < λ - ρ (n)\}, \end{matrix}

(57)

where

ρ (n) : = \frac{| X | log (n + 1) + (s + u) | X | log (N + 1) - log v^{*}}{n}

with

v^{*} : = min {v (i) : v (i) > 0, i \in S}

.

Proof.

The proof is provided in Appendix C. □

The type-based test

Ω^{n}

specified in Lemma 4 satisfies Equations (54) and (55) for all

({\tilde{P}}_{1}, {\tilde{P}}_{2}) \in P_{S} (X) \times P_{U} (X)

. If we set

κ = 1 / n

in Lemma 4, and combine it with Lemma 5, we can derive the following relation:

Corollary 1.

For any given

λ \in R^{+}

, any test

ϕ^{n}

satisfying the condition that for all pairs of distributions

({\tilde{P}}_{1}, {\tilde{P}}_{2}) \in P_{S} (X) \times P_{U} (X)

\begin{matrix} β_{1} (ϕ^{n} | {\tilde{P}}_{1}, {\tilde{P}}_{2}) \leq exp (- n λ), \end{matrix}

(58)

we have that for any pair of distributions

(P_{1}, P_{2}) \in P_{S} (X) \times P_{U} (X)

\begin{matrix} β_{2} (ϕ^{n} | P_{1}, P_{2}) \geq (1 - \frac{1}{n}) \cdot P_{2} \{MGJS (q_{t_{1}}, q_{x}, γ) + ρ (n) + \frac{log n}{n} < λ\} . \end{matrix}

(59)

Proof.

The proof is provided in Appendix D. □

By (59), a lower bound on the type-II error probability can be evaluated as follows:

\begin{matrix} \underset{n \to \infty}{lim sup} β_{2} (ϕ^{n} | P_{1}, P_{2}) \\ \geq \underset{n \to \infty}{lim sup} (1 - \frac{1}{n}) P_{2} \{MGJS (q_{t_{1}}, q_{x}, γ) + O (\frac{log n}{n}) < λ\} \\ = \underset{n \to \infty}{lim sup} \sum_{j = 1}^{u} w (j) P_{2}^{j} \{MGJS (q_{t_{1}}, q_{x}, γ) + O (\frac{log n}{n}) < λ\} \\ \geq \underset{n \to \infty}{lim sup} \sum_{j = 1}^{u} w (j) P_{2}^{j} \{MGJS (q_{t_{1}}, q_{x}, γ) + O (\frac{log n}{n}) < λ, x \in B_{n} (P_{2}^{j}), t_{1}^{i} \in B_{N} (P_{1}^{i}), \forall i \in S\} \\ = \underset{n \to \infty}{lim sup} \sum_{j = 1}^{u} w (j) \cdot P_{2}^{j} {\frac{1}{n} \sum_{m = 1}^{N} ι_{1} (t_{1, m}^{i^{*} (q_{x})} | P_{1}^{i^{*} (q_{x})}, P_{2}^{j}, γ) + \frac{1}{n} \sum_{m = 1}^{n} ι_{2} (x_{m} | P_{1}^{i^{*} (q_{x})}, P_{2}^{j}, γ) \end{matrix}

(60)

\begin{matrix} + O (\frac{log n}{n}) < λ, x \in B_{n} (P_{2}^{j}), t_{1}^{i} \in B_{N} (P_{1}^{i}), \forall i \in S} \\ \geq \sum_{j = 1}^{u} \underset{n \to \infty}{lim inf} w (j) \cdot P_{2}^{j} {\frac{1}{n} \sum_{m = 1}^{N} ι_{1} (t_{1, m}^{i^{*} (q_{x})} | P_{1}^{i^{*} (q_{x})}, P_{2}^{j}, γ) + \frac{1}{n} \sum_{m = 1}^{n} ι_{2} (x_{m} | P_{1}^{i^{*} (q_{x})}, P_{2}^{j}, γ) \end{matrix}

(61)

\begin{matrix} + O (\frac{log n}{n}) < λ, x \in B_{n} (P_{2}^{j}), t_{1}^{i} \in B_{N} (P_{1}^{i}), \forall i \in S} \\ \geq \sum_{j = 1}^{u} \underset{n \to \infty}{lim inf} w (j) P_{2}^{j} \{x \in B_{n} (P_{2}^{j}), t_{1}^{i} \in B_{N} (P_{1}^{i}), \forall i \in S\} \\ \cdot P_{2}^{j} {\frac{γ}{N} \sum_{m = 1}^{N} ι_{1} (t_{1, m}^{i^{*} (q_{x})} | P_{1}^{i^{*} (q_{x})}, P_{2}^{j}, γ) + \frac{1}{n} \sum_{m = 1}^{n} ι_{2} (x_{m} | P_{1}^{i^{*} (q_{x})}, P_{2}^{j}, γ) < λ \\ + O (\frac{log n}{n}) |x \in B_{n} (P_{2}^{j}), t_{1}^{i} \in B_{N} (P_{1}^{i}), \forall i \in S\} \\ = \sum_{{j \in U : GJS (P_{1}^{i^{*} (P_{2}^{j})}, P_{2}^{j}, γ) < λ}} w (j), \end{matrix}

(62)

where Equations (60) and (61) are derived from Lemma 3 and Fatou’s lemma, respectively. Equation (62) follows from Equations (45)–(48). By Equation (52), Equation (62) indicates that

\begin{matrix} λ \leq sup \{\bar{λ} | \sum_{{j \in U : GJS (P_{1}^{i^{*} (P_{2}^{j})}, P_{2}^{j}, γ) < \bar{λ}}} w (j) \leq ϵ\} . \end{matrix}

(63)

3.4. Second-Order Maximum Error Exponent

In this section, we characterize the second-order maximum error exponent. For simplicity, we assume that only

P_{2}

has subclasses, but

P_{1}

does not (

s = 1

). First, from Theorem 2 with

s = 1

, the first-order maximum error exponent in this setting is characterized as follows: for any pair of distributions

(P_{1}, P_{2}) \in P (X) \times P_{U} (X)

, we have

\begin{matrix} \hat{λ} (ϵ) = sup \{\bar{λ} | \sum_{{j \in U : GJS (P_{1}, P_{2}^{j}, γ) < \bar{λ}}} w (j) \leq ϵ\}, \end{matrix}

(64)

where

MGJS (P_{1}, P_{2}^{j}, γ)

in (21) is replaced by

GJS (P_{1}, P_{2}^{j}, γ)

.

Next, we provide a characterization of the second-order maximum error exponent in Definition 2 with

s = 1

in the case where only

P_{2}

has subclasses. By definition,

\hat{r} (ϵ, λ) = + \infty

if

λ < \hat{λ} (ϵ)

and

\hat{r} (ϵ, λ) = - \infty

if

λ > \hat{λ} (ϵ)

. Therefore, in the discussion of the second-order error exponent, we focus on the case

λ = \hat{λ} (ϵ)

.

Theorem 3.

For any pair of distributions

(P_{1}, P_{2}) \in P (X) \times P_{U} (X)

and

ϵ \in [0, 1)

,

\begin{matrix} \hat{r} (ϵ, λ) & = sup \{r | \sum_{\{j \in U : GJS (P_{1}, P_{2}^{j}, γ) < λ\}} w (j) + \sum_{\{j \in U : GJS (P_{1}, P_{2}^{j}, γ) = λ\}} Φ_{j} (r) w (j) \leq ϵ\}, \end{matrix}

(65)

where

λ = \hat{λ} (ϵ)

,

\begin{matrix} Φ_{j} (r) & : = G (\frac{r}{\sqrt{V (P_{1}, P_{2}^{j}, γ)}}), \end{matrix}

(66)

\begin{matrix} G (a) & : = \frac{1}{\sqrt{2 π}} \int_{- \infty}^{a} e^{- \frac{x^{2}}{2}} d x, \end{matrix}

(67)

which is the cumulative distribution function of the standard Gaussian distribution, and for any pair of distributions

(Q_{1}, Q_{2}) \in P {(X)}^{2}

,

\begin{matrix} V (Q_{1}, Q_{2}, γ) : = γ {Var}_{Q_{1}} [ι_{1} (X | Q_{1}, Q_{2}, γ)] + {Var}_{Q_{2}} [ι_{2} (X | Q_{1}, Q_{2}, γ)], \end{matrix}

(68)

where

{Var}_{Q} [\cdot]

represents the variances with respect to

Q \in P (X)

.

Proof.

The proof is provided in Appendix E. □

Remark 4.

If

U

is a singleton (

u = 1

), Theorem 3 reduces to

\hat{r} (ϵ, λ) = \sqrt{V (P_{1}, P_{2}^{j}, γ)} G^{- 1} (ϵ)

for

λ = \hat{λ} (ϵ) = GJS (P_{1}, P_{2}^{j}, γ)

, which is the same result given by Zhou et al. [5].

Remark 5.

We can summarize the two terms on the right-hand side of (65) into the following single term called the canonical equation [6]:

\begin{matrix} \sum_{j \in U} w (j) lim_{n \to \infty} Φ_{j} (\sqrt{n} (λ - GJS (P_{1}, P_{2}^{j}, γ)) + r) . \end{matrix}

(69)

We focus on the case

λ = \hat{λ} (ϵ)

. From Theorem 2, it holds that

\begin{matrix} \sum_{{j \in U : MGJS (P_{1}, P_{2}^{j}, γ) < λ}} w (j) \leq ϵ \end{matrix}

(70)

and

\begin{matrix} \sum_{{j \in U : MGJS (P_{1}, P_{2}^{j}, γ) \leq λ}} w (j) \geq ϵ . \end{matrix}

(71)

Here, let us consider the following canonical equation for r

\begin{matrix} \sum_{j \in U} w (j) lim_{n \to \infty} Φ_{j} (\sqrt{n} (\hat{λ} (ϵ) - GJS (P_{1}, P_{2}^{j}, γ)) + r) = ϵ . \end{matrix}

(72)

Thus, in view of (70) and (71), this equation always has the solution

r = r (ϵ)

. If

\begin{matrix} \sum_{\{j \in U : GJS (P_{1}, P_{2}^{j}, γ) = \hat{λ} (ϵ)\}} w (j) = 0, \end{matrix}

(73)

the solution is not unique (

r (ϵ) = + \infty

). By using the solution

r (ϵ)

, Equation (65) with

λ = \hat{λ} (ϵ)

can be expressed in a simpler form

\begin{matrix} \hat{r} (ϵ, λ) = r (ϵ) . \end{matrix}

(74)

4. Generalization to Mixed Memoryless Sources with General Mixture

In this section, we consider the classification problem in the case where

P_{1}

does not have subclasses and

P_{2}

is given by a general mixture model. The general mixture model considered in this problem represents an extension of the source with multiple subclasses defined in Section 2.2. Since the decision rule that achieves the maximum error exponent can be operated using only one of the training sequences, we assume in this section that only the training sequence

t_{1}

is available. Then, we provide a characterization of the maximum error exponents in a single-letter form under this setting. First, we define the source referred to as a mixed memoryless source with general mixture [6,9] as follows. Let

Θ

be an arbitrary probability space with a general probability measure

w (θ), θ \in Θ

. Then, the probability of

x \in X^{n}

is given by

\begin{matrix} P_{2} (x) = \int_{θ \in Θ} P_{2}^{θ} (x) d w (θ), \end{matrix}

(75)

where

P_{2}^{θ}

is a stationary and memoryless source for each

θ \in Θ

. That is, for

x = (x_{1}, x_{2}, \dots, x_{n}) \in X^{n}

\begin{matrix} P_{2}^{θ} (x) = \prod_{i = 1}^{n} P_{2}^{θ} (x_{i}) . \end{matrix}

(76)

When a test sequence is output from

P_{2}

, the probability distribution of the sequence takes the form of (75). Here, type-I and type-II error probabilities of a test

ϕ^{n} = (ϕ_{1}^{n}, ϕ_{2}^{n})

are given by

\begin{matrix} β_{1} (ϕ^{n} | P_{1}, P_{2}) = \sum_{(x, t_{1}) \in ϕ_{2}^{n}} P_{1} (x) P_{1} (t_{1}), \end{matrix}

(77)

\begin{matrix} β_{2} (ϕ^{n} | P_{1}, P_{2}) = \sum_{(x, t_{1}) \in ϕ_{1}^{n}} P_{2} (x) P_{1} (t_{1}) . \end{matrix}

(78)

Theorem 4.

(First-and second-order maximum error exponents) For any pair of distributions

(P_{1}, P_{2}) \in P (X) \times P_{Θ} (X)

and

ϵ \in [0, 1)

, we have

\begin{matrix} \hat{λ} (ϵ) = sup \{\bar{λ} | \int_{{θ \in Θ : GJS (P_{1}, P_{2}^{θ}, γ) < \bar{λ}}} d w (θ) \leq ϵ\} \end{matrix}

(79)

and

\begin{matrix} \hat{r} (ϵ, λ) & = sup \{r | \int_{\{θ \in Θ : GJS (P_{1}, P_{2}^{θ}, γ) < λ\}} d w (θ) + \int_{\{θ \in Θ : GJS (P_{1}, P_{2}^{θ}, γ) = λ\}} Φ_{θ} (r) d w (θ) \leq ϵ\} . \end{matrix}

(80)

Proof.

We can prove this theorem in the same way as Theorems 2 and 3. □

5. Numerical Calculation

5.1. First-Order Maximum Error Exponent

In this section, we present a numerical example to illustrate the first-order maximum type-I error exponent

\hat{λ} (ϵ)

characterized in Theorem 2.

A numerical example of the first-order maximum error exponent is given by calculating the right-hand side of (64) for the following settings. We assume that

X = {0, 1}

. We fix the set of probabilities and weights

\begin{matrix} P_{1} = \{\{Bern (0.389), v (1) = \frac{1}{3}\}, \{Bern (0.322), v (2) = \frac{1}{3}\}, \{Bern (0.256), v (3) = \frac{1}{3}\}\} \end{matrix}

(81)

and

\begin{matrix} P_{2} = \{\{Bern (0.301), w (1) = \frac{1}{3}\}, \{Bern (0.244), w (2) = \frac{1}{6}\}, \{Bern (0.223), w (3) = \frac{1}{2}\}\}, \end{matrix}

(82)

where

Bern (\cdot)

denotes the Bernoulli distribution. The relation among

\hat{λ} (ϵ)

,

ϵ

and

γ

is shown in Figure 2. Additionally, for

γ = 2

, the behavior of

\hat{λ} (ϵ)

is depicted in Figure 3. When

ϵ

becomes larger, the value of

\hat{λ} (ϵ)

also increases like a step function. The step increases when

ϵ = \frac{1}{6}

and

ϵ = \frac{1}{2}

. We can also confirm that

\hat{λ} (ϵ)

is right-continuous in

ϵ

. This is due to the limit that the superior of the type-II error probability is constrained in Definition 1.

Figure 2. The first-order maximum type-I error exponent

\hat{λ} (ϵ)

(

0 < γ \leq 2

).

Figure 3. The first-order maximum type-I error exponent

\hat{λ} (ϵ)

(

γ = 2

).

5.2. Second-Order Maximum Error Exponent

As in the previous subsection, we present a numerical example to illustrate the second-order maximum type-I error exponent

\hat{r} (ϵ, λ)

characterized in Theorem 3.

A numerical example of the second-order maximum error exponent is given by calculating the right-hand side of (65) for the following settings. We assume that

γ = 2

,

X = {0, 1}

. We fix

λ = \hat{λ} (ϵ)

and the set of probabilities and weights

\begin{matrix} P_{1} = Bern (0.268) \end{matrix}

(83)

and

\begin{matrix} P_{2} = \{\{Bern (0.301), w (1) = \frac{1}{3}\}, \{Bern (0.244), w (2) = \frac{1}{6}\}, \{Bern (0.223), w (3) = \frac{1}{2}\}\}, \end{matrix}

(84)

where

P_{2}

is the same as the setting in the previous subsection. The behavior of

\hat{r} (ϵ, λ)

is shown in Figure 4. The value of

\hat{r} (ϵ, λ)

takes the inverse of the cumulative distribution function of the standard Gaussian for each interval of

ϵ

such that

0 \leq ϵ < \frac{1}{6}

,

\frac{1}{6} \leq ϵ < \frac{1}{2}

and

\frac{1}{2} \leq ϵ < 1

. In contrast to the first-order

\hat{λ} (ϵ)

,

\hat{r} (ϵ, λ)

is no longer right continuous in

ϵ

.

Figure 4. The second-order maximum type-I error exponent

\hat{r} (ϵ, λ)

.

6. Conclusions

For binary classification of sources with multiple subclasses, we characterized the first- and second-order maximum error exponents. First, we revealed the first-order maximum error exponent in the case where

P_{1}

and

P_{2}

are sources with multiple subclasses. In order to derive this representation, we gave a classifier which achieves the asymptotically maximum error exponent in the class of deterministic classifiers for sources with multiple subclasses.

Next, we showed the second-order maximum error exponent in the case where only one of sources has subclasses. The most important key technique to derive the second-order maximum error exponent is to apply the Berry–Esseen theorem [8] instead of the weak law of large numbers. One may wonder whether we can also derive the second-order approximation in the case where

P_{1}

is also a source with multiple subclasses. To this end, we need to evaluate Lemma 2 more rigorously. This is future work.

In addition, for binary classification using only a training sequence generated from

P_{1}

in the case where

P_{1}

does not have subclasses and

P_{2}

is given by a general mixture model, we generalized the analysis for the first- and second-order error exponents. From these results, we revealed the asymptotic performance limits of statistical classification for sources with multiple subclasses.

In this paper, we considered a binary classification problem, but in practice, multiclass classification is of importance. In the case where each class is a memoryless source (without multiple subclasses), the first- and second-order maximum error exponents were analyzed in [5]. Extending the obtained results to multiclass classification for sources with multiple subclasses is also a subject of future studies.

Author Contributions

Author H.K. contributed to the conceptualization of the research goals and aims, the visualization, the formal analysis of the results, and the review and editing. Author H.Y. contributed to the conceptualization of the ideas, the validation of the results, and the supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by JSPS KAKENHI Grant Numbers JP20K04462 and JP18H01438.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Proof (Proof of Equation (33)).

We shall show

\begin{matrix} N D (q_{t_{1}} | | {\tilde{P}}_{1}) + n D (q_{x} | | {\tilde{P}}_{1}) = N D (q_{t_{1}} | | q_{y_{1}}) + n D (q_{x} | | q_{y_{1}}) + n (1 + γ) D (\frac{γ q_{t_{1}} + q_{x}}{1 + γ} | | {\tilde{P}}_{1}) . \end{matrix}

(A1)

The left-hand side of (A1) is expanded as follows:

\begin{matrix} N D (q_{t_{1}} | | {\tilde{P}}_{1}) + n D (q_{x} | | {\tilde{P}}_{1}) \\ = n γ \sum_{x \in X} q_{t_{1}} (x) log \frac{q_{t_{1}} (x)}{{\tilde{P}}_{1} (x)} + n \sum_{x \in X} q_{x} (x) log \frac{q_{x} (x)}{{\tilde{P}}_{1} (x)} - n γ \sum_{x \in X} q_{t_{1}} (x) log \frac{γ q_{t_{1}} (x) + q_{x_{1}} (x)}{1 + γ} \\ + n γ \sum_{x \in X} q_{t_{1}} (x) log \frac{γ q_{t_{1}} (x) + q_{x} (x)}{1 + γ} - n \sum_{x \in X} q_{x} (x) log \frac{γ q_{t_{1}} (x) + q_{x} (x)}{1 + γ} \\ + n \sum_{x \in X} q_{x} (x) log \frac{γ q_{t_{1}} (x) + q_{x_{1}} (x)}{1 + γ} \\ = n γ E_{q_{t_{1}}} [log \frac{(1 + γ) q_{t_{1}} (x)}{γ q_{t_{1}} (x) + q_{x} (x)}] + n E_{q_{x}} [log \frac{(1 + γ) q_{x} (x)}{γ q_{t_{1}} (x) + q_{x} (x)}] \\ - n \sum_{x \in X} (γ q_{t_{1}} (x) + q_{x} (x)) log {\tilde{P}}_{1} (x) + n \sum_{x \in X} (γ q_{t_{1}} (x) + q_{x} (x)) log \frac{γ q_{t_{1}} (x) + q_{x} (x)}{1 + γ} \\ = N D (q_{t_{1}} | | q_{y_{1}}) + n D (q_{x} | | q_{y_{1}}) + n (1 + γ) D (\frac{γ q_{t_{1}} + q_{x}}{1 + γ} | | {\tilde{P}}_{1}) . \end{matrix}

(A2)

Therefore, we obtain Equation (33). □

Appendix B

Proof of Lemma 2).

Applying the Taylor expansion to

GJS (q_{t_{1}^{i}}, q_{x}, γ)

around

(Q_{1}^{i}, Q_{2})

for any

q_{x} \in B_{n} (Q_{2}), q_{t_{1}^{i}} \in B_{N} (Q_{1}^{i}) (\forall i \in S)

, we obtain

\begin{matrix} GJS (q_{t_{1}^{i}}, q_{x}, γ) = GJS (Q_{1}^{i}, Q_{2}, γ) + O (\sqrt{\frac{log n}{n}}) . \end{matrix}

(A3)

Therefore by (14) for any

q_{x} \in B_{n} (Q_{2})

,

q_{t_{1}^{i}} \in B_{N} (Q_{1}^{i}) (\forall i \in S)

,

\begin{matrix} γ D (q_{t_{1}^{i}} | | \frac{q_{x} + γ q_{t_{1}^{i}}}{1 + γ}) + D (q_{x} | | \frac{q_{x} + γ q_{t_{1}^{i}}}{1 + γ}) \\ \to γ D (Q_{1}^{i} | | \frac{Q_{2} + γ Q_{1}^{i}}{1 + γ}) + D (Q_{2} | | \frac{Q_{2} + γ Q_{1}^{i}}{1 + γ}) (n \to \infty) \end{matrix}

(A4)

holds, and the convergence is uniform in

i \in S

. Since for any

(Q_{1}, Q_{2}) \in P_{S} (X) \times P (X)

,

i^{*} (Q_{1}, Q_{2})

was given in the form

\begin{matrix} i^{*} (Q_{1}, Q_{2}) = \underset{i \in S}{arg min} γ D (Q_{1} | | \frac{Q_{2} + γ Q_{1}}{1 + γ}) + D (Q_{2} | | \frac{Q_{2} + γ Q_{1}}{1 + γ}), \end{matrix}

(A5)

we obtain

\begin{matrix} i^{*} (q_{t_{1}}, q_{x}) \to i^{*} (Q_{1}, Q_{2}) (n \to \infty) . \end{matrix}

(A6)

□

Appendix C

Proof of Lemma 5.

It follows from (56) that

\begin{matrix} exp \{- n λ\} \end{matrix}

\begin{matrix} \geq \sum_{(x, t_{1}, t_{2}) \in Ω_{2}^{n}} {\tilde{P}}_{1} (x) \prod_{r = 1}^{s} {\tilde{P}}_{1}^{r} (t_{1}^{r}) \prod_{j = 1}^{u} {\tilde{P}}_{2}^{j} (t_{2}^{j}) \\ = \sum_{(x, t_{1}, t_{2}) \in Ω_{2}^{n}} \sum_{i = 1}^{s} {\tilde{P}}_{1}^{i} (x) \prod_{r = 1}^{s} {\tilde{P}}_{1}^{r} (t_{1}^{r}) \prod_{j = 1}^{u} {\tilde{P}}_{2}^{j} (t_{2}^{j}) v (i) \\ \geq \sum_{(x, t_{1}, t_{2}) \in Ω_{2}^{n}} {\tilde{P}}_{1}^{i} (x) \prod_{r = 1}^{s} {\tilde{P}}_{1}^{r} (t_{1}^{r}) \prod_{j = 1}^{u} {\tilde{P}}_{2}^{j} (t_{2}^{j}) v (i) (\forall i \in S, v (i) > 0) \\ = \sum_{(q_{x}, q_{t_{1}}, q_{t_{2}}) \in Γ (Ω_{2}^{n})} {\tilde{P}}_{1}^{i} (T_{n} (q_{x})) \prod_{r = 1}^{s} {\tilde{P}}_{1}^{r} (T_{N} (q_{t_{1}^{r}})) \prod_{j = 1}^{u} {\tilde{P}}_{2}^{j} (T_{N} (q_{t_{2}^{j}})) v (i) \\ \geq {\tilde{P}}_{1}^{i} (T_{n} (q_{x})) \prod_{r = 1}^{s} {\tilde{P}}_{1}^{r} (T_{N} (q_{t_{1}^{r}})) \prod_{j = 1}^{u} {\tilde{P}}_{2}^{j} (T_{N} (q_{t_{2}^{j}})) v (i) (\forall (q_{x}, q_{t_{1}}, q_{t_{2}}) \in Γ (Ω_{2}^{n})) . \end{matrix}

(A7)

Using (30) on the right-hand side of (A7), we obtain

\begin{matrix} exp \{- n λ\} \\ \geq exp \{- n ρ (n)\} exp \{- n [D (q_{x} | | {\tilde{P}}_{1}^{i}) + \sum_{r = 1}^{s} γ D (q_{t_{1}^{r}} | | {\tilde{P}}_{1}^{r}) + \sum_{j = 1}^{u} γ D (q_{t_{2}^{j}} | | {\tilde{P}}_{2}^{j})]\}, \end{matrix}

(A8)

where

ρ (n) = \frac{| X | log (n + 1) + (s + u) | X | log (N + 1) - log v^{*}}{n}

. Taking the negative logarithm of both sides and divide by n in (A8), we have

\begin{matrix} λ - ρ (n) \leq D (q_{x} | | {\tilde{P}}_{1}^{i}) + \sum_{r = 1}^{s} γ D (q_{t_{1}^{r}} | | {\tilde{P}}_{1}^{r}) + \sum_{j = 1}^{u} γ D (q_{t_{2}^{j}} | | {\tilde{P}}_{2}^{j}) . \end{matrix}

(A9)

Since

({\tilde{P}}_{1}^{i}, {\tilde{P}}_{1}^{r}, {\tilde{P}}_{2}^{j}) \in P {(X)}^{3}

is arbitrary in (A9), we can set the i-th subclass as

{\tilde{P}}_{1}^{i} = q_{x t_{1}^{i}} = q_{y_{1}^{i}}

and the other as

{\tilde{P}}_{1}^{r} = q_{t_{1}^{r}}

,

r \neq i

. Furthermore, we set

{\tilde{P}}_{2}^{j} = q_{t_{2}^{j}}, \forall j \in U

. Then we obtain

\begin{matrix} λ - ρ (n) & \leq D (q_{x} | | q_{y_{1}^{i}}) + γ D (q_{t_{1}^{i}} | | q_{y_{1}^{i}}) (\forall (q_{x}, q_{t_{1}}, q_{t_{2}}) \in Γ (Ω_{2}^{n}), \forall i \in S) \\ ⟺ λ - ρ (n) & \leq min_{i \in S} \{D (q_{x} | | q_{y_{1}^{i}}) + γ D (q_{t_{1}^{i}} | | q_{y_{1}^{i}})\} (\forall (q_{x}, q_{t_{1}}, q_{t_{2}}) \in Γ (Ω_{2}^{n})) . \end{matrix}

(A10)

Therefore by (18), (A10) implies that

Ω_{2}^{n}

is constrained in

\begin{matrix} {\bar{Λ}}_{2}^{n} : = \{(x, t_{1}, t_{2}) | MGJS (q_{t_{1}}, q_{x}, γ) \geq λ - ρ (n)\}, \end{matrix}

(A11)

and for any pair of distributions

P = (P_{1}, P_{2}) \in P_{S} (X) \times P_{U} (X)

, we obtain

\begin{matrix} Ω_{2}^{n} \subset {\bar{Λ}}_{2}^{n} \Rightarrow {\bar{Λ}}_{1}^{n} \subset Ω_{1}^{n} \Rightarrow β_{2} ({\bar{Λ}}^{n} | P) \leq β_{2} (Ω^{n} | P) . \end{matrix}

(A12)

□

Appendix D

Proof of Corollary 1.

It follows from (58) that

\begin{matrix} exp (- n λ) & \geq β_{1} (ϕ^{n} | {\tilde{P}}_{1}, {\tilde{P}}_{2}) \\ \geq \frac{1}{n} β_{1} (Ω^{n} | {\tilde{P}}_{1}, {\tilde{P}}_{2}), \end{matrix}

(A13)

where Lemma 4 with

κ = 1 / n

guarantees the existence of such a type-based test

Ω^{n}

satisfying (A13) for

({\tilde{P}}_{1}, {\tilde{P}}_{2}) \in P_{S} (X) \times P_{U} (X)

and (55) for

(P_{1}, P_{2}) \in P_{S} (X) \times P_{U} (X)

(cf. Remark 3). Then, the type-I error probability of

Ω^{n}

satisfies

\begin{matrix} β_{1} (Ω^{n} | {\tilde{P}}_{1}, {\tilde{P}}_{2}) & \leq n exp (- n λ) \\ = exp (- n \{λ - \frac{log n}{n}\}) \\ = exp (- n λ^{'}), \end{matrix}

(A14)

where

λ^{'} = λ - \frac{log n}{n}

. From Lemma 5, the type-II error probability of

Ω^{n}

satisfies

\begin{matrix} β_{2} (Ω^{n} | P_{1}, P_{2}) \geq P_{2} \{MGJS (q_{t_{1}}, q_{x}, γ) < λ^{'} - ρ (n)\} . \end{matrix}

(A15)

Combining (A15) and (55) for

(P_{1}, P_{2}) \in P_{S} (X) \times P_{U} (X)

, we have

\begin{matrix} β_{2} (ϕ^{n} | P_{1}, P_{2}) & \geq (1 - \frac{1}{n}) P_{2} \{MGJS (q_{t_{1}}, q_{x}, γ) < λ^{'} - ρ (n)\} \\ = (1 - \frac{1}{n}) P_{2} \{MGJS (q_{t_{1}}, q_{x}, γ) + ρ (n) + \frac{log n}{n} < λ\}, \end{matrix}

(A16)

establishing (59). □

Appendix E

Proof of Theorem 3.

We divide the proof of Theorem 3 into two parts: the achievability (direct) part and the converse part. First, for preliminaries, we define the following quantity used in the proof: For

(Q_{1}, Q_{2}) \in P {(X)}^{2}

,

\begin{matrix} T (Q_{1}, Q_{2}, γ) & : = γ E_{Q_{1}} [| ι_{1} (X | Q_{1}, Q_{2}, γ) - E_{Q_{1}} [ι_{1} (X | Q_{1}, Q_{2}, γ)] |^{3}] \\ + E_{Q_{2}} [| ι_{2} (X | Q_{1}, Q_{2}, γ) - E_{Q_{2}} [ι_{2} (X | Q_{1}, Q_{2}, γ)] |^{3}] . \end{matrix}

(A17)

Next, we give a lemma that is important in the proof of Theorem 3. □

Lemma A1

(The Berry–Esseen theorem [8]). Let

X_{k}, k = 1, . . ., n

be independent with

\begin{matrix} μ_{k} & = E [X_{k}], \end{matrix}

(A18)

\begin{matrix} σ^{2} & = \sum_{k = 1}^{n} Var [X_{k}], \end{matrix}

(A19)

\begin{matrix} T & = \sum_{k = 1}^{n} E [| X_{k} - μ_{k} |^{3}], \end{matrix}

(A20)

\begin{matrix} Q (x) & = \int_{x}^{\infty} \frac{1}{\sqrt{2 π}} exp (- \frac{t^{2}}{2}) d t . \end{matrix}

(A21)

Then for any

- \infty < λ < \infty

, it holds that

\begin{matrix} |P [\sum_{k = 1}^{n} (X_{k} - μ_{k}) \geq λ σ] - Q (λ)| \leq \frac{6 T}{σ^{3}} . \end{matrix}

(A22)

Appendix E.1. Achievability Part

In the achievability proof, we use the following test

Λ^{n} = (Λ_{1}^{n}, Λ_{2}^{n})

:

\begin{matrix} Λ_{2}^{n} = \{(x, t_{1}, t_{2}) | GJS (q_{t_{1}}, q_{x}, γ) \geq \tilde{λ} + \frac{r}{\sqrt{n}}\}, \end{matrix}

(A23)

where

MGJS (q_{t_{1}}, q_{x}, γ)

in (17) is now

GJS (q_{t_{1}}, q_{x}, γ)

because the first source is assumed to be a stationary and memoryless source. Fix any

\begin{matrix} r < sup \{r | \sum_{\{j \in U : GJS (P_{1}, P_{2}^{j}, γ) < λ\}} w (j) + \sum_{\{j \in U : GJS (P_{1}, P_{2}^{j}, γ) = λ\}} Φ_{j} (r) w (j) \leq ϵ\} . \end{matrix}

(A24)

Then, for any pair of distributions

(P_{1}, P_{2}) \in P (X) \times P_{U} (X)

and for all pairs of distributions

({\tilde{P}}_{1}, {\tilde{P}}_{2}) \in P (X) \times P_{U} (X)

, we show

\begin{matrix} β_{1} (Λ^{n} | {\tilde{P}}_{1}, {\tilde{P}}_{2}) & \leq exp (- n λ - \sqrt{n} r), \end{matrix}

(A25)

\begin{matrix} \underset{n \to \infty}{lim sup} β_{2} (Λ^{n} | P_{1}, P_{2}) & \leq ϵ . \end{matrix}

(A26)

The test

Λ^{n}

defined by (A23) is the same as the test defined by (17) with replacing

\tilde{λ}

with

\tilde{λ} + \frac{r}{\sqrt{n}}

and s with 1, and so (A25) can be derived from the argument in Section 3.3.1. Therefore, we will prove only (A26). We use the set

B_{n} (Q), Q \in P (X)

defined by (37). An upper bound on the type-II error probability of the test

Λ^{n}

, defined by (A23), for any pair of distributions

(P_{1}, P_{2}) \in P (X) \times P_{U} (X)

can be evaluated as follows:

\begin{matrix} \underset{n \to \infty}{lim sup} β_{2} (Λ^{n} | P_{1}, P_{2}) \\ = \underset{n \to \infty}{lim sup} P_{2} \{GJS (q_{t_{1}}, q_{x}, γ) \leq λ + \frac{r}{\sqrt{n}} + η (n)\} \\ = \underset{n \to \infty}{lim sup} \sum_{j = 1}^{u} w (j) P_{2}^{j} \{GJS (q_{t_{1}}, q_{x}, γ) \leq λ + \frac{r}{\sqrt{n}} + η (n)\} \\ \leq \underset{n \to \infty}{lim sup} \sum_{j = 1}^{u} w (j) P_{2}^{j} \{GJS (q_{t_{1}}, q_{x}, γ) \leq λ + \frac{r}{\sqrt{n}} + η (n), t_{1} \in B_{N} (P_{1}), x \in B_{n} (P_{2}^{i})\} \\ + \underset{n \to \infty}{lim sup} \sum_{j = 1}^{u} w (j) P_{2}^{j} \{t_{1} \notin B_{N} (P_{1}) or x \notin B_{n} (P_{2}^{i})\} \\ = \underset{n \to \infty}{lim sup} \sum_{j = 1}^{u} w (j) P_{2}^{j} {\frac{1}{n} \sum_{i = 1}^{N} ι_{1} (t_{1, i} | P_{1}, P_{2}^{j}, γ) + \frac{1}{n} \sum_{i = 1}^{n} ι_{2} (x_{i} | P_{1}, P_{2}^{j}, γ) \leq λ + \frac{r}{\sqrt{n}} \end{matrix}

(A27)

\begin{matrix} + O (\frac{log n}{n}), t_{1} \in B_{N} (P_{1}), x \in B_{n} (P_{2}^{i})} \\ \leq \underset{n \to \infty}{lim sup} \sum_{j = 1}^{u} w (j) P_{2}^{j} {\frac{1}{n} \sum_{i = 1}^{N} ι_{1} (t_{1, i} | P_{1}, P_{2}^{j}, γ) + \frac{1}{n} \sum_{i = 1}^{n} ι_{2} (x_{i} | P_{1}, P_{2}^{j}, γ) \leq λ + \frac{r}{\sqrt{n}} \\ + O (\frac{log n}{n})} \\ = \underset{n \to \infty}{lim sup} \sum_{j = 1}^{u} w (j) P_{2}^{j} {\sum_{i = 1}^{N} \{ι_{1} (t_{1, i} | P_{1}, P_{2}^{j}, γ) - E_{P_{1}} [ι_{1} (X | P_{1}, P_{2}^{j}, γ)]\} \\ + \sum_{i = 1}^{n} \{ι_{2} (x_{i} | P_{1}, P_{2}^{j}, γ) - E_{P_{2}} [ι_{2} (X | P_{1}, P_{2}^{j}, γ)]\} \leq n (λ + \frac{r}{\sqrt{n}} - GJS (P_{1}, P_{2}^{j}, γ) \\ + O (\frac{log n}{n}))} \end{matrix}

(A28)

where (A27) follows from Lemma 3 and the proof of (A28) is provided in Appendix F.

\begin{matrix} \underset{n \to \infty}{lim sup} β_{2} (Λ^{n} | P_{1}, P_{2}) \\ \leq \underset{n \to \infty}{lim sup} \sum_{j = 1}^{u} w (j) {G ([λ + \frac{r}{\sqrt{n}} - GJS (P_{1}, P_{2}^{j}, γ) + O (\frac{log n}{n})] \sqrt{\frac{n}{V (P_{1}, P_{2}^{j}, γ)}}) \end{matrix}

(A29)

\begin{matrix} + \frac{6 T (P_{1}, P_{2}^{j}, γ)}{\sqrt{n {(V (P_{1}, P_{2}^{j}, γ))}^{3}}}} \\ = \underset{n \to \infty}{lim sup} \sum_{j = 1}^{u} w (j) G ([λ + \frac{r}{\sqrt{n}} - GJS (P_{1}, P_{2}^{j}, γ) + O (\frac{log n}{n})] \sqrt{\frac{n}{V (P_{1}, P_{2}^{j}, γ)}}) \\ \leq \sum_{j = 1}^{u} w (j) \underset{n \to \infty}{lim sup} G ([λ + \frac{r}{\sqrt{n}} - GJS (P_{1}, P_{2}^{j}, γ) + O (\frac{log n}{n})] \sqrt{\frac{n}{V (P_{1}, P_{2}^{j}, γ)}}), \end{matrix}

(A30)

where (A29) and the last inequality are derived from Lemma A1 and Fatou’s lemma, respectively. Here,

\begin{matrix} \underset{n \to \infty}{lim sup} G ([λ + \frac{r}{\sqrt{n}} - GJS (P_{1}, P_{2}^{j}, γ) + O (\frac{log n}{n})] \sqrt{\frac{n}{V (P_{1}, P_{2}^{j}, γ)}}) \\ = \{\begin{matrix} 1 for j \in U : GJS (P_{1}, P_{2}^{j}, γ) < λ \\ G (\frac{r}{\sqrt{V (P_{1}, P_{2}^{j}, γ)}}) for j \in U : GJS (P_{1}, P_{2}^{j}, γ) = λ \\ 0 for j \in U : GJS (P_{1}, P_{2}^{j}, γ) > λ \end{matrix} . \end{matrix}

(A31)

Therefore,

\begin{matrix} \underset{n \to \infty}{lim sup} β_{2} (Λ^{n} | P_{1}, P_{2}) \\ \leq \sum_{\{j \in U : GJS (P_{1}, P_{2}^{j}, γ) < λ\}} w (j) + \sum_{\{j \in U : GJS (P_{1}, P_{2}^{j}, γ) = λ\}} G (\frac{r}{\sqrt{V (P_{1}, P_{2}^{j}, γ)}}) w (j) \\ = \sum_{\{j \in U : GJS (P_{1}, P_{2}^{j}, γ) < λ\}} w (j) + \sum_{\{j \in U : GJS (P_{1}, P_{2}^{j}, γ) = λ\}} Φ (r) w (j) . \end{matrix}

(A32)

Thus, by (A24), we can see that

\begin{matrix} \underset{n \to \infty}{lim sup} β_{2} (Λ^{n} | P_{1}, P_{2}) & \leq ϵ . \end{matrix}

(A33)

Appendix E.2. Converse Part

For any pair of distributions

(P_{1}, P_{2}) \in P (X) \times P_{U} (X)

and for all pairs of distributions

({\tilde{P}}_{1}, {\tilde{P}}_{2}) \in P (X) \times P_{U} (X)

, fix any test

ϕ^{n}

satisfying that

\begin{matrix} β_{1} (ϕ^{n} | {\tilde{P}}_{1}, {\tilde{P}}_{2}) & \leq exp (- n λ - \sqrt{n} r), \end{matrix}

(A34)

\begin{matrix} \underset{n \to \infty}{lim sup} β_{2} (ϕ^{n} | P_{1}, P_{2}) & \leq ϵ . \end{matrix}

(A35)

We show

\begin{matrix} r \leq sup \{r | \sum_{\{j \in U : GJS (P_{1}, P_{2}^{j}, γ) < λ\}} w (j) + \sum_{\{j \in U : GJS (P_{1}, P_{2}^{j}, γ) = λ\}} Φ_{j} (r) w (j) \leq ϵ\} . \end{matrix}

(A36)

To prove (A36), we use Corollary 1 with replacing

λ

with

λ + \frac{r}{\sqrt{n}}

. A lower bound on the type-II error probability of the test

ϕ^{n}

for any pair of distributions

(P_{1}, P_{2}) \in P (X) \times P_{U} (X)

can be evaluated as follows:

\begin{matrix} \underset{n \to \infty}{lim sup} β_{2} (ϕ^{n} | P_{1}, P_{2}) \\ \geq \underset{n \to \infty}{lim sup} (1 - \frac{1}{n}) \sum_{j = 1}^{u} w (j) P_{2}^{j} \{GJS (q_{t_{1}}, q_{x}, γ) + ρ (n) + O (\frac{log n}{n}) < λ + \frac{r}{\sqrt{n}}\} \\ \geq \underset{n \to \infty}{lim sup} (1 - \frac{1}{n}) \sum_{j = 1}^{u} w (j) P_{2}^{j} {GJS (q_{t_{1}}, q_{x}, γ) + ρ (n) + O (\frac{log n}{n}) < λ + \frac{r}{\sqrt{n}}, \end{matrix}

\begin{matrix} t_{1} \in B_{N} (P_{1}), x \in B_{n} (P_{2}^{j})} \\ = \underset{n \to \infty}{lim sup} (1 - \frac{1}{n}) \sum_{j = 1}^{u} w (j) P_{2}^{j} \{\frac{1}{n} \sum_{i = 1}^{N} ι_{1} (t_{1, i} | P_{1}, P_{2}^{j}, γ) + \frac{1}{n} \sum_{i = 1}^{n} ι_{2} (x_{i} | P_{1}, P_{2}^{j}, γ) < λ + \frac{r}{\sqrt{n}} \end{matrix}

(A37)

\begin{matrix} + O (\frac{log n}{n})\} - \underset{n \to \infty}{lim sup} P_{2}^{j} \{t_{1} \in B_{N} (P_{1}), x \in B_{n} (P_{2}^{j})\} \\ \geq \underset{n \to \infty}{lim sup} (1 - \frac{1}{n}) \sum_{j = 1}^{u} w (j) {G ([λ + \frac{r}{\sqrt{n}} - GJS (P_{1}, P_{2}^{j}, γ) + O (\frac{log n}{n})] \sqrt{\frac{n}{V (P_{1}, P_{2}^{j}, γ)}}) \end{matrix}

(A38)

\begin{matrix} - \frac{6 T (P_{1}, P_{2}^{j}, γ)}{\sqrt{n {(V (P_{1}, P_{2}^{j}, γ))}^{3}}}} \\ = \underset{n \to \infty}{lim sup} \sum_{j = 1}^{u} w (j) {G ([λ + \frac{r}{\sqrt{n}} - GJS (P_{1}, P_{2}^{j}, γ) + O (\frac{log n}{n})] \sqrt{\frac{n}{V (P_{1}, P_{2}^{j}, γ)}}) \\ \geq \underset{n \to \infty}{lim inf} \sum_{j = 1}^{u} w (j) G ([λ + \frac{r}{\sqrt{n}} - GJS (P_{1}, P_{2}^{j}, γ) + O (\frac{log n}{n})] \sqrt{\frac{n}{V (P_{1}, P_{2}^{j}, γ)}}) \\ \geq \sum_{j = 1}^{u} w (j) \underset{n \to \infty}{lim inf} G ([λ + \frac{r}{\sqrt{n}} - GJS (P_{1}, P_{2}^{j}, γ) + O (\frac{log n}{n})] \sqrt{\frac{n}{V (P_{1}, P_{2}^{j}, γ)}}), \end{matrix}

(A39)

where (A37) and (38) follow from Lemma 3 and Lemma A1, respectively. The last inequality is due to Fatou’s lemma. Here,

\begin{matrix} \underset{n \to \infty}{lim inf} G ([λ + \frac{r}{\sqrt{n}} - GJS (P_{1}, P_{2}^{j}, γ) + O (\frac{log n}{n})] \sqrt{\frac{n}{V (P_{1}, P_{2}^{j}, γ)}}) \\ = \{\begin{matrix} 1 for j \in U : GJS (P_{1}, P_{2}^{j}, γ) < λ \\ G (\frac{r}{\sqrt{V (P_{1}, P_{2}^{j}, γ)}}) for j \in U : GJS (P_{1}, P_{2}^{j}, γ) = λ \\ 0 for j \in U : GJS (P_{1}, P_{2}^{j}, γ) > λ \end{matrix} . \end{matrix}

(A40)

Therefore,

\begin{matrix} \underset{n \to \infty}{lim sup} β_{2} (ϕ^{n} | P_{1}, P_{2}) \\ \geq \sum_{\{j \in U : GJS (P_{1}, P_{2}^{j}, γ) < λ\}} w (j) + \sum_{\{j \in U : GJS (P_{1}, P_{2}^{j}, γ) = λ\}} G (\frac{r}{\sqrt{V (P_{1}, P_{2}^{j}, γ)}}) w (j) \\ = \sum_{\{j \in U : GJS (P_{1}, P_{2}^{j}, γ) < λ\}} w (j) + \sum_{\{j \in U : GJS (P_{1}, P_{2}^{j}, γ) = λ\}} Φ (r) w (j) . \end{matrix}

(A41)

In view of (A35), Equation (A41) indicates that

\begin{matrix} r \leq sup \{r | \sum_{\{j \in U : GJS (P_{1}, P_{2}^{j}, γ) < λ\}} w (j) + \sum_{\{j \in U : GJS (P_{1}, P_{2}^{j}, γ) = λ\}} Φ_{j} (r) w (j) \leq ϵ\} . \end{matrix}

(A42)

□

Appendix F

Proof of Equation (A28)).

We shall show

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{N} ι_{1} (t_{1, i} | P_{1}, P_{2}^{j}, γ) + \frac{1}{n} \sum_{i = 1}^{n} ι_{2} (x_{i} | P_{1}, P_{2}^{j}, γ) \leq λ + \frac{r}{\sqrt{n}} + O (\frac{log n}{n}) \\ = \frac{1}{n} \sum_{i = 1}^{N} \{ι_{1} (t_{1, i} | P_{1}, P_{2}^{j}, γ) - E_{P_{1}} [ι_{1} (X | P_{1}, P_{2}^{j}, γ)]\} + \frac{1}{n} \sum_{i = 1}^{n} {ι_{2} (x_{i} | P_{1}, P_{2}^{j}, γ) \\ - E_{P_{2}} [ι_{2} (X | P_{1}, P_{2}^{j}, γ)]} \leq λ + \frac{r}{\sqrt{n}} - GJS (P_{1}, P_{2}^{j}, γ) + O (\frac{log n}{n}) . \end{matrix}

(A43)

The left-hand side of (A43) can be rewritten as follows:

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{N} ι_{1} (t_{1, i} | P_{1}, P_{2}^{j}, γ) + \frac{1}{n} \sum_{i = 1}^{n} ι_{2} (x_{i} | P_{1}, P_{2}^{j}, γ) \leq λ + \frac{r}{\sqrt{n}} + O (\frac{log n}{n}) \\ ⟺ & \frac{1}{n} \sum_{i = 1}^{N} ι_{1} (t_{1, i} | P_{1}, P_{2}^{j}, γ) - γ E_{P_{1}} [ι_{1} (X | P_{1}, P_{2}^{j}, γ)] + \frac{1}{n} \sum_{i = 1}^{n} ι_{2} (x_{i} | P_{1}, P_{2}^{j}, γ) \\ - E_{P_{2}} [ι_{2} (X | P_{1}, P_{2}, γ)] \leq λ + \frac{r}{\sqrt{n}} - GJS (P_{1}, P_{2}^{j}, γ) + O (\frac{log n}{n}) \\ ⟺ & \frac{1}{n} \sum_{i = 1}^{N} ι_{1} (t_{1, i} | P_{1}, P_{2}^{j}, γ) - \frac{1}{n} \sum_{i = 1}^{N} E_{P_{1}} [ι_{1} (X | P_{1}, P_{2}^{j}, γ)] + \frac{1}{n} \sum_{i = 1}^{n} ι_{2} (x_{i} | P_{1}, P_{2}^{j}, γ) \end{matrix}

(A44)

\begin{matrix} - \frac{1}{n} \sum_{i = 1}^{n} E_{P_{2}} [ι_{2} (X | P_{1}, P_{2}, γ)] \leq λ + \frac{r}{\sqrt{n}} - GJS (P_{1}, P_{2}^{j}, γ) + O (\frac{log n}{n}), \end{matrix}

(A45)

where (A44) follows from (47). Therefore, we obtain Equation (A28). □

References

Merhav, N.; Ziv, J. A Bayesian approach for classification of Markov sources. IEEE Trans. Inf. Theory 1991, 37, 1067–1071. [Google Scholar] [CrossRef]
Saito, S.; Matsushima, T. Evaluation of error probability of classification based on the analysis of the Bayes code. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 21–26. [Google Scholar]
Gutman, M. Asymptotically optimal classification for multiple tests with empirically observed statistics. IEEE Trans. Inf. Theory 1989, 35, 401–408. [Google Scholar] [CrossRef]
Hsu, H.-W.; Wang, I.-H. On binary statistical classification from mismatched empirically observed statistics. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 2538–3533. [Google Scholar]
Zhou, L.; Tan, V.Y.F.; Motani, M. Second-order asymptotically optimal statistical classification. Inf. Inference J. IMA 2020, 9, 81–111. [Google Scholar]
Han, T.S.; Nomura, R. First- and second-order hypothesis testing for mixed memoryless sources. Entropy 2018, 20, 174. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef] [Green Version]
Polyanskiy, Y.; Poor, H.V.; Verdú, S. Channel coding rate in the finite blocklength regime. IEEE Trans. Inf. Theory 2010, 56, 2307–2359. [Google Scholar] [CrossRef]
Yagi, H.; Han, T.S.; Nomura, R. First- and second-order coding theorems for mixed memoryless channels with general mixture. IEEE Trans. Inf. Theory 2016, 62, 4395–4412. [Google Scholar] [CrossRef]
Ziv, J. On classification with empirically observed statistics and universal data compression. IEEE Trans. Inf. Theory 1988, 34, 278–286. [Google Scholar] [CrossRef]
Kelly, B.G.; Wagner, A.B.; Tularak, T.; Viswanath, P. Classification of homogeneous data with large alphabets. IEEE Trans. Inf. Theory 2013, 59, 782–795. [Google Scholar] [CrossRef] [Green Version]
Unnikrishnan, J.; Huang, D. Weak convergence analysis of asymptotically optimal hypothesis tests. IEEE Trans. Inf. Theory 2016, 62, 4285–4299. [Google Scholar] [CrossRef]
He, H.; Zhou, L.; Tan, V.Y.F. Distributed detection with empirically observed statistics. IEEE Trans. Inf. Theory 2020, 66, 4349–4367. [Google Scholar] [CrossRef]
Saito, S.; Matsushima, T. Evaluation of error probability of classification based on the analysis of the Bayes code: Extension and example. In Proceedings of the 2021 IEEE International Symposium on Information Theory (ISIT), Melbourne, VIC, Australia, 12–20 July 2021; pp. 1445–1450. [Google Scholar]
Csiszár, I. The method of types. IEEE Trans. Inf. Theory 1998, 44, 2505–2523. [Google Scholar] [CrossRef] [Green Version]
Nielsen, F. On a variational definition for the Jensen-Shannon symmetrization of distances based on the information radius. Entropy 2021, 23, 464. [Google Scholar] [CrossRef] [PubMed]

Figure 1. System model.

Figure 2. The first-order maximum type-I error exponent

\hat{λ} (ϵ)

(

0 < γ \leq 2

).

Figure 2. The first-order maximum type-I error exponent

\hat{λ} (ϵ)

(

0 < γ \leq 2

).

Figure 3. The first-order maximum type-I error exponent

\hat{λ} (ϵ)

(

γ = 2

).

Figure 3. The first-order maximum type-I error exponent

\hat{λ} (ϵ)

(

γ = 2

).

Figure 4. The second-order maximum type-I error exponent

\hat{r} (ϵ, λ)

.

Figure 4. The second-order maximum type-I error exponent

\hat{r} (ϵ, λ)

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.