Next Article in Journal
Block Access Token Renewal Scheme Based on Secret Sharing in Apache Hadoop
Previous Article in Journal
Network Decomposition and Complexity Measures: An Information Geometrical Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Characterizing the Asymptotic Per-Symbol Redundancy of Memoryless Sources over Countable Alphabets in Terms of Single-Letter Marginals

by
Maryam Hosseini
and
Narayana Santhanam
*
Department of Electrical Engineering, University of Hawaii at Manoa, Honolulu, HI 96822, USA
*
Author to whom correspondence should be addressed.
Entropy 2014, 16(7), 4168-4184; https://doi.org/10.3390/e16074168
Submission received: 27 May 2014 / Revised: 24 June 2014 / Accepted: 7 July 2014 / Published: 23 July 2014
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
The minimum expected number of bits needed to describe a random variable is its entropy, assuming knowledge of the distribution of the random variable. On the other hand, universal compression describes data supposing that the underlying distribution is unknown, but that it belongs to a known set 𝒫 of distributions. However, since universal descriptions are not matched exactly to the underlying distribution, the number of bits they use on average is higher, and the excess over the entropy used is the redundancy. In this paper, we study the redundancy incurred by the universal description of strings of positive integers (ℤ+), the strings being generated independently and identically distributed (i.i.d.) according an unknown distribution over ℤ+ in a known collection 𝒫. We first show that if describing a single symbol incurs finite redundancy, then 𝒫 is tight, but that the converse does not always hold. If a single symbol can be described with finite worst-case regret (a more stringent formulation than redundancy above), then it is known that describing length n i.i.d. strings only incurs vanishing (to zero) redundancy per symbol as n increases. On the contrary, we show it is possible that the description of a single symbol from an unknown distribution of 𝒫 incurs finite redundancy, yet the description of length n i.i.d. strings incurs a constant (> 0) redundancy per symbol encoded. We then show a sufficient condition on single-letter marginals, such that length n i.i.d. samples will incur vanishing redundancy per symbol encoded.

1. Introduction

A number of statistical inference problems of significant contemporary interest, such as text classification, language modeling and DNA microarray analysis, are what are called large alphabet problems. They require inference on sequences of symbols where the symbols come from a set (alphabet) with a size comparable or even larger than the sequence length. For instance, language models for speech recognition estimate distributions over English words using text samples much smaller than the English vocabulary.
An abstraction behind several of these problems is universal compression over large alphabets. The general idea here is to model the problem at hand with a collection of models 𝒫 instead of a single distribution. The model underlying the data is assumed or known to belong to the collection 𝒫, but the exact identity of the model remains unknown. Instead, we aim to use a universal description of data.
The universal description uses more bits on average (averaged over the random sample) than if the underlying model were known, and the additional number of bits used by the universal description is called the redundancy against the true model. The average excess bits over the entropy of the true model will be referred to as the model redundancy for that model. Since one does not know the true model in general, a common approach is to consider collection redundancy or simply redundancy, which is the supremum of the model redundancy, the supremum being taken over all models of the collection.
Typically, we look at sequences of i.i.d. symbols, and therefore, we usually refer to the redundancy of distributions over length n sequences obtained by i.i.d. sampling from distributions from 𝒫. The length n of sequences considered will typically be referred to as the sample size.
The nuances of prediction, the compression or estimation where the alphabet size and sample size are roughly equal are not well captured by studying a collection over a finite alphabet when the sample size is increased to infinity. Rather, they are better captured when we begin with a countably infinite support and let the sample size approach infinity or when we let the alphabet size scale as a function of the sample size. However, the collection of all i.i.d. distributions over countably infinite supports has infinite redundancy that renders most estimations or prediction problems impossible. Therefore, there are several alternative formulations to tackle language modeling, classification and estimation questions over large alphabets.
Patterns: One line of work is the patterns [1] approach that considers the compression of the pattern of a sequence rather than the sequence itself. Patterns abstract the identities of symbols and indicate only the relative order of appearance. For example, the pattern of TATTLE is 121134, while that of HONOLULU is 12324545. The point to note is that patterns of length n i.i.d. sequences can be compressed (no matter what the underlying countably infinite alphabet is) with redundancy that grows sublinearly in n [1]; therefore, the excess bits needed to describe patterns are asymptotically vanishing per symbol encoded. Indeed, insights learned in this line of work will be used to understand the compression of sequences, as well, in this paper.
Envelope on Model Classes: A second line of work considers restricted model classes for applications, particularly where the collection of models can be described in terms of an envelope [2]. This approach leads to an understanding of the worst-case formulations. In particular, we are interested in the result that if the worst-case regret (different from and a more stringent formulation than the redundancy described here) of describing a single sample is finite, then the per-symbol redundancy diminishes to zero. We will interpret this result towards the end of the Introduction. While envelope classes are usually chosen so that they are compressible in the worst case, a natural extension is the possibility of choosing classes that are only average-case, but not worst-case, compressible. For this, we need to understand how the single-letter average case redundancy of a class influences the redundancy of compressing strings sampled i.i.d. from distributions in the class—the focus of this paper.
Data-derived Consistency: A third line of work ignores the uniform convergence framework underlying redundancy or regret formulations. This is useful for large or infinite alphabet model collections that have poor or no redundancy guarantees, but ask a question that cannot be answered with the approaches above. In this line of work, one obtains results on the model redundancy described above instead of (the collection) redundancy. For example, a model collection is said to be weakly compressible if there is a universal measure that ensures that for all models, the model redundancy normalized by the sample size (per-symbol) diminishes to zero. The rate at which the per-symbol model redundancy diminishes to zero depends on the underlying model and for some models could be arbitrarily slower than others. Given a particular block length n, however large, there may be, hence, no non-trivial guarantee that holds over the entire model collection, unlike the redundancy formulation.
However, if we add on the additional constraint that we should estimate the rate of convergence from the data, we get the data-derived consistency formulations in [3]. Fundamental to further research in this direction is a better understanding of how single-letter redundancy (of 𝒫) relates to the redundancy of length n strings (that of 𝒫n). The primary theme of this paper is to collect such results on the redundancy of classes over countably infinite support.
In the fixed alphabet setting, this connection is well understood. If the alphabet has size k, the redundancy of 𝒫 is easily seen to be always finite (in fact, ≤ log k) and that of 𝒫n scales as k 1 2 log n. However, when 𝒫 does not have a finite support, the above bounds are meaningless.
Redundancy Capacity Theorem: On the other hand, the redundancy of a collection 𝒫 over a countably infinite support may be infinite. In this paper we let ℤ+ = {1, 2, 3, ...} be the set of positive integers and ℕ = {0, 1, 2, ...} be the set of non-negative integers. However, what about the case where the redundancy of a collection 𝒫 over ℤ+ is finite? Now, a well-known redundancy-capacity [4] argument can be used to interpret the redundancy, which equates the redundancy to the amount of information we can get about the source from the data. In this case, finite (infinite, respectively) redundancy of 𝒫 implies that a single symbol contains a finite (infinite, respectively) amount of information about the model.
The natural question then is the following. If a collection 𝒫 over ℤ+ has finite redundancy, does it imply that the redundancy of length n i.i.d. strings from 𝒫 grows sublinearly? Equivalently, do finite redundancy collections behave similar to their fixed alphabet counterparts? If true, roughly speaking, such a result would inform us that as the universal encoder sees more and more of the sequence, it learns less and less of the underlying model. This would be in line with our intuition, where seeing more data fixes the model. Therefore, the more data we have already seen, the less there is to learn. Yet, as we will show, that is not the case.
Results: To understand these nuances, we first show that if the redundancy of a collection 𝒫 of distributions over ℤ+ is finite, then 𝒫 is tight. This turns out to be a useful tool to check if the redundancy is finite in [3], for example.
However, in a departure from other worst-case regret formulations, as in [2], we demonstrate that it is possible for a class 𝒫 to have finite redundancy, yet the asymptotic per-symbol redundancy of strings sampled i.i.d. from 𝒫 is bounded away from zero. Therefore, roughly speaking, no matter how much of the sequence the universal encoder has seen, it learns at least a constant number of bits about the underlying model each time it sees an additional symbol. No matter how much data we see, there is more to learn about the underlying model! We finally obtain a sufficient condition on a class 𝒫, such that the asymptotic per-symbol redundancy of length n i.i.d. strings diminishes to zero.

2. Notation and Background

We introduce the notation used in the paper, as well as some prior results that will be used. Following information theoretic conventions, log indicates logarithms to base two and ln to base e. In this paper we let ℤ+ = {1, 2, 3, ...} be the set of positive integers and ℕ = {0, 1, 2, ... } be the set of non-negative integers.

2.1. Redundancy

The notation used here is mostly standard, but we include it for completeness. Let 𝒫 be a collection of distributions over ℤ+. Let 𝒫n be the set of distributions over length-n sequences obtained by i.i.d. sampling from distributions in 𝒫.
𝒫 is the collection of measures over infinite length sequences of ℤ+ obtained by i.i.d. sampling as follows. Observe that + n is countable for every n. For simplicity of exposition, we will think of each length n string x as a subset of + —the set of all semi-infinite strings of positive integers that begin with x. Each subset of + n is therefore a subset of + . Now the collection 𝒥 of all subsets of + n and all n ∈ ℤ+, is a semi-algebra [5]. The probabilities i.i.d. sampling assigns to finite unions of disjoint sets in 𝒥 is the sum of that assigned to the components of the union. Therefore, there is a sigma-algebra over the uncountable set + that extends 𝒥 and matches the probabilities assigned to sets in 𝒥 by i.i.d. sampling. The reader can assume that 𝒫 is the measure on the minimal sigma-algebra that extends 𝒥 and matches what the probabilities i.i.d. sampling gives to sets in 𝒥. See, e.g., [5], for a development of elementary measure theory that lays out the above steps.
Let q be a measure over infinite sequences that we call:
R n ( 𝒫 ) = inf q sup p 𝒫 E p log p ( X n ) q ( X n )
the redundancy of length n sequences, or length n i.i.d. redundancy, or simply length n redundancy. The single-letter redundancy refers to the special case when n = 1. We often normalize Rn(𝒫) in (1) by the block length n. We will call Rn(𝒫)/n the per-symbol length n redundancy.
In particular, note the distinction between single letter and per-symbol length n redundancy. In the definition (1), we do not require q to be i.i.d.. The single-letter redundancy would correspond to obtaining the infimum in (1) only over the restricted class of i.i.d. measures, while the per-symbol length n redundancy allows for the infimum over all possible measures q. Thus, the per-symbol length n redundancy is upper bounded by the single letter redundancy. Any difference between the two can be thought of as the advantage accrued, because the universal measure learns the underlying measure p.
In this paper, our primary goal is to understand the connections between the single-letter redundancy, on the one hand, and the behavior of length n i.i.d. redundancy, on the other. As mentioned in the Introduction, length n redundancy is the capacity of a channel from 𝒫 to + n, where the conditional probability distribution over + n given p𝒫 is simply the distribution p over length n sequences. Roughly speaking, it quantifies how much information about the source we can extract from the sequence.
We will often speak of the per-symbol length n redundancy, which is simply length n redundancy normalized by n, i.e.,, Rn(𝒫)/n. Furthermore, the limit lim supn→∞ Rn(𝒫)/n is the asymptotic per-symbol redundancy. Whether the asymptotic per-symbol redundancy is zero (we will equivalently say that the asymptotic per-symbol redundancy diminishes to zero to keep in line with prior literature) is in many ways a litmus test for compression, estimation and other related problems. Loosely speaking, if Rn(𝒫)/n → 0, the redundancy-capacity interpretation [4] mentioned above implies that after a point, there is little further information to be learned when we see an additional symbol, no matter what the underlying source is. In this sense, this is the case where we can actually learn the underlying model at a uniform rate over the entire class.
We note that it is possible to define an even more stringent notion—a worst-case-regret. For length n sequences, this is:
inf p sup p 𝒫 sup X n log p ( X n ) q ( X n ) .
Single-letter regret is the special case where n = 1, and asymptotic per-symbol regret is the limit as n → ∞ of the length n regret normalized by n. We will not concern ourselves with the worst case formulation in this paper, but mention it in passing for comparison. In the worst-case setting, finite single letter redundancy is necessary and sufficient [2] for the asymptotic per-symbol worst-case regret to diminish to zero.
Yet, we show in this paper that it is not necessarily the case for redundancy. It is quite possible that collections with finite single-letter redundancy have asymptotic per-symbol redundancy bounded away from zero.

2.2. Patterns

Recent work [1] has formalized a similar framework for countably infinite alphabets. This framework is based on the notion of patterns of sequences that abstract the identities of symbols and indicates only the relative order of appearance. For example, the pattern of PATTERN is 1233456. The k-th distinct symbol of a string is given an index k when it first appears, and that index is used every time the symbol appears henceforth. The crux of the patterns approach is to consider the set of measures induced over patterns of the sequences instead of considering the set of measures 𝒫 over infinite sequences,
Denote the pattern of a string x by Ψ(x). There is only one possible pattern of strings of length one (no matter what the alphabet, the pattern of a length one string is one), two possible patterns of strings of length two (11 and 12), and so on. The number of possible patterns of length n is the n-th Bell number [1], and we denote the set of all possible length n patterns by Ψn. The measures induced on patterns by a corresponding measure p on infinite sequences of positive integers assigns to any pattern ψ a probability:
p ( ψ ) = p ( { x : Ψ ( x ) = ψ } ) .
In [1], the length n pattern redundancy,
inf p sup p 𝒫 E p log p ( Ψ ( X n ) ) q ( Ψ ( X n ) ) ,
was shown to be upper bounded by π ( log e ) 2 n 3. It was also shown in [6] that there is a measure q over infinite length sequences that satisfies for all n simultaneously:
sup p 𝒫 sup X n log p ( Ψ ( X n ) ) q ( Ψ ( X n ) ) π ( log e ) 2 n 3 + log ( n ( n + 1 ) ) .
Let the measure induced on patterns by q be denoted as qΨ for convenience.
We can interpret the probability estimator qΨ as a sequential prediction procedure that estimates the probability that the symbol Xn+1 will be “new” (has not appeared in X 1 n) and the probability that Xn+1 takes a value that has been seen so far. This view of estimation also appears in the statistical literature on Bayesian nonparametrics that focuses on exchangeability. Kingman [7] advocated the use of exchangeable random partitions to accommodate the analysis of data from an alphabet that is not bounded or known in advance. A more detailed discussion of the history and philosophy of this problem can be found in the works of Zabell [8,9] collected in [10].

2.3. Cumulative Distributions and Tight Collections

For our purposes, the cumulative distribution function of any probability distribution p on ℤ+ (ℕ, respectively) is a function Fp : ℝ ∪ {∞} → [0, 1] defined in the following (slightly unconventional) way. We let Fp(0) = 0 in case the support is ℤ+ (Fp(−1) = 0 if the support is ℕ, respectively). We then define Fp on points in the support of p in the way cumulative distribution functions are normally defined. Specifically for all y in the support of p,
F p ( y ) = j 0 y p ( j ) .
We let Fp(−∞) ≔ 0 and Fp(∞) ≔ 1. Finally, we extend the definition of Fp to all real numbers by linearly interpolating between the values defined already.
Let F p 1 : [ 0 , 1 ] { } denote the inverse function of Fp defined as follows. To begin with,
F p 1 ( 0 ) = sup { y : F p ( y ) = 0 } ,
If p has infinite support, then F p 1 ( 1 ) = , else F p 1 ( 1 ) is the smallest positive integer y, such that Fp(y) = 1. It follows [11] then that:
p { x + : x F q 1 ( 1 γ ) } > γ and p { x + : x > 2 F q 1 ( 1 γ 2 ) } γ
A collection 𝒫 of distributions on ℤ+ is defined to be tight if for all γ > 0,
sup p 𝒫 F p 1 ( 1 γ ) < .

3. Redundancy and Tightness

We focus on the single-letter redundancy in this section and explore the connections between the single-letter redundancy of a collection 𝒫 and the tightness of 𝒫.
Lemma 1. A collection 𝒫 over ℕ with bounded length n redundancy is tight. Namely, if the single-letter redundancy of 𝒫 is finite, then for any γ > 0:
sup p 𝒫 F p 1 ( 1 γ ) < .
Proof Since 𝒫 has bounded single-letter redundancy, fix a distribution q over ℕ, such that:
sup p 𝒫 D ( p q ) < .
We define R ≝ supp𝒫 D(pq) where D(pq) is the Kullback–Leibler distance between p and q. We will first show that for all p𝒫 and any m > 0,
p ( | log p ( X ) q ( X ) | > m ) ( R + ( 2 log e ) / e ) / m .
To see Equation (2), let S be the set of all x ∈ ℕ, such that p(x) < q(x). A well-known convexity argument shows that the partial contribution to KL divergence from S,
x S p ( x ) log p ( x ) q ( x ) p ( S ) log p ( S ) q ( S ) log e e ,
and hence:
x + p ( x ) | log p ( x ) q ( x ) | = x + p ( x ) log p ( x ) q ( x ) 2 x S p ( x ) log p ( x ) q ( x ) R + 2 log e e .
Then, Equation (2) follows by a simple application of Markov’s inequality.
We will now use Equation (2) to complete the proof of the lemma. Specifically, we will show that for all γ > 0,
sup p 𝒫 F p 1 ( 1 γ ) < 2 F q 1 ( 1 γ / 2 m * + 2 )
where m* is the smallest integer, such that (R + (2 log e)/e)/m* < γ/2. Equivalently, for all γ > 0 and p𝒫, we show that:
p { x : x > 2 F q 1 ( 1 γ / 2 m * + 2 ) } γ .
We prove the above by partitioning q′s tail; numbers x 2 F q 1 ( 1 γ / 2 m * + 2 ) into two parts.
(i)
the set W 1 = { x : x > 2 F q 1 ( 1 γ / 2 m + 2 and log p ( x ) q ( x ) > m * }. Clearly:
W 1 { y : | log p ( y ) q ( y ) | > m * } ,
and thus:
p ( W 1 ) p { y : | log p ( y ) q ( y ) | > m * } γ 2
where the right inequality follows from Equation (2).
(ii)
the set W 2 = { x : x > 2 F q 1 ( 1 γ / 2 m * + 2 ) and log p ( x ) q ( x ) m * }. Clearly:
W 2 { y : y > 2 F q 1 ( 1 γ / 2 m * + 2 ) }
and therefore:
q ( W 2 ) q { y : y > 2 F q 1 ( 1 γ / 2 m * + 2 ) } γ 2 m * + 1 .
By definition, all xW2 satisfy log p ( x ) q ( x ) m * or that p(x) ≤ q(x)2m*. Hence, we have:
p ( W 2 ) q ( W 2 ) 2 m * γ 2 m * 2 m * + 1 = γ 2 .
The lemma follows.
The converse is not necessarily true. Tight collections need not have finite single-letter redundancy, as the following example demonstrates.
Construction: Consider the following collection of distributions over ℤ+. First, partition the set of positive integers into the sets Ti, i ∈ ℕ, where:
T i = { 2 i , , 2 i + 1 1 } .
Note that |Ti| =2i. Now, is the collection of all possible distributions that can be formed as follows: for all i ∈ ℤ+, pick exactly one element of Ti and assign probability 1/((i +1)(i +2)) to the element of Ti chosen choosing the support as above implicitly assumes the axiom of choice. Note that the set is uncountably infinite.
Corollary 2. The set of distributions is tight.
Proof For all p,
x 2 k x + p ( x ) = 1 k + 1 ,
namely, all tails are uniformly bounded over the collection . Put another way, for all δ > 0 and all distributions: p,
F p 1 ( 1 δ ) 2 1 δ .
On the other hand:
Proposition 1. The collection does not have finite redundancy.
Proof Suppose q is any distribution over ℤ+. We will show that ∃p, such that:
x + p ( x ) log p ( x ) q ( x )
is not finite. Since the entropy of every p is finite, we just have to show that for any distribution q over ℤ+, there exists p, such that:
x + p ( x ) log 1 q ( x )
is not finite.
Consider any distribution q over ℤ+. Observe that for all i, |Ti| =2i. It follows that for all i, there is xiTi, such that:
q ( x i ) 1 2 i .
However, by construction, contains a distribution p* that has for its support {xi : i ∈ ℤ+} identified above. Furthermore p* assigns:
p * ( x i ) = 1 ( i + 1 ) ( i + 2 ) i + .
The KL divergence from p* to q is not finite, and the Lemma follows, since q is arbitrary.

4. Length n Redundancy

We study how the single-letter properties of a collection 𝒫 of distributions influence the compression of length n strings obtained by i.i.d. sampling from distributions in 𝒫. Namely, we try to characterize when the length n redundancy of 𝒫 grows sublinearly in the block length n.
Lemma 3. Let 𝒫 be a collection of distributions over a countable support 𝒳. For some m ∈ ℤ+, consider m pairwise disjoint subsets Si𝒳 (1 ≤ im), and let δ > 1/2. If there exist p1, ..., pm𝒫, such that:
p i ( S i ) δ ,
then for all distributions q over 𝒳,
sup p 𝒫 D ( p q ) δ log m .
In particular if there are an infinite number of sets Si, i ∈ ℤ+ and distributions pi𝒫, such that pi(Si) ≥ δ, then the redundancy is infinite.
Proof This is a simplified formulation of the distinguishability concept in [4]. For a proof, see e.g., [12].

4.1. Counterexample

We now show that it is possible for the single-letter redundancy of a collection of distributions to be finite, yet the asymptotic per-symbol redundancy (the length n redundancy of normalized by n) remains bounded away from zero; in the limit, the block length goes to infinity. To show this, we obtain such a collection .
Construction: As before, partition the set ℤ+ into Ti = {2i, ..., 2i+1 − 1}, i ∈ ℕ. Recall that Ti has 2i elements. For all 0 < ≤ 1, let n = 1 . Let 1 ≤ j ≤ 2n, and let p∊,j be a distribution on ℤ+ that assigns probability 1 − to the number one (or equivalently, to the set T0) and to the j-th smallest element of Tn, namely the number 2n + j − 1. (mnemonic for binary, since every distribution has a support of size two) is the collection of distributions p∊,j for all > 0 and 1 ≤ j ≤ 2n. is the set of measures over infinite sequences of numbers corresponding to i.i.d. sampling from .
We first verify that the single-letter redundancy of is finite.
Proposition 2. Let q be a distribution that assigns q ( T i ) = 1 ( i + 1 ) ( i + 2 ) and for all jTi,
q ( j | T i ) = 1 | T i | .
Then:
sup p x + p ( x ) log p ( x ) q ( x ) 2 .
However, the redundancy of compressing length n sequences from scales linearly with n.
Proposition 3. For all n ∈ ℤ+,
inf q sup p B E p log p ( X n ) q ( X n ) n ( 1 1 n ) n .
Proof Let the set {1n} denote a set containing a length n sequence of only ones. For all n, define 2n pairwise disjoint sets Si of + n, 1 ≤ i ≤ 2n, where:
S i = { 1 , 2 n + i 1 } n { 1 n }
is the set of all length n strings containing at most two numbers (one and 2n + i − 1) and at least one occurrence of 2n + i − 1. Clearly, for distinct i and j between one and 2n, Si and Sj are disjoint. Furthermore, the measure p 1 n , i assigns Si the probability:
p 1 n , i ( S i ) = 1 ( 1 1 n ) n > 1 1 e .
From Lemma 3, it follows that length n redundancy of is lower bounded by:
( 1 1 e ) log 2 n = n ( 1 1 e ) .
In a preview of what is to come, we notice that though the single-letter redundancy of the class over ℤ+ is finite, the single-letter tail redundancy, as described in the equation below, does not diminish to zero; namely, for all M:
sup p x M p ( x ) log p ( x ) q ( x ) 1 .
In fact, in the next section, we relate the single-letter tail redundancy above diminishing to zero to sublinear growth of the i.i.d. length n redundancy.

4.2. Sufficient Condition

In this section, we show a sufficient condition on single-letter marginals of 𝒫 and its redundancy that allows for i.i.d. length n redundancy of 𝒫 to grow sublinearly with n. This condition is, however, not necessary; and the characterization of a condition that is both necessary and sufficient is as yet open.
For all > 0, let Ap,∊ be the set of all elements in the support of p with probability ≥, and let Tp,∊ = ℤ+Ap,∊. Let G0 = {ϕ}, where ϕ denotes the empty string. For all i, the sets:
G i = { x i : A p , 2 ln ( i + 1 ) i { x 1 , x 2 , , x i } }
where, in a minor abuse of notation, we use {x1, ...,xi} to denote the set of distinct symbols in the string x 1 i. Let B0 = {}, and let B i = + i G i. Observe from an argument similar to the coupon collector problem that:
Lemma 4. For all i ≥ 2,
p ( B i ) 1 ( i + 1 ) ln ( i + 1 ) .
Proof The proof follows from an elementary union bound:
p ( B i ) | A p , 2 ln ( i + 1 ) i | ( 1 2 ln ( i + 1 ) i ) i i 2 ln ( i + 1 ) ( 1 2 ln ( i + 1 ) i ) i i 2 ln ( i + 1 ) e 2 i ln ( i + 1 ) i i 2 ( i + 1 ) 2 ln ( i + 1 ) 1 ( i + 1 ) ln ( i + 1 ) .
Theorem 5. Suppose 𝒫 is a collection of distributions over ℤ+. Let the entropy be uniformly bounded over the entire collection, and in addition, let the redundancy of the collection be finite. Namely,
sup p 𝒫 x + p ( x ) log 1 p ( x ) H < and q 1 over + s . t . sup p 𝒫 x + p ( x ) log p ( x ) q 1 ( x ) < .
We will denote:
R = sup p 𝒫 x + p ( x ) log p ( x ) q 1 ( x ) .
Recall that for any distribution p, the set Tp,δ denotes the support of p, all of whose probabilities are < δ. Let:
lim δ 0 sup p 𝒫 x T p , δ p ( x ) log 1 p ( x ) = 0 and q 1 over + s . t . lim δ 0 sup p 𝒫 x T p , δ p ( x ) log p ( x ) q 1 ( x ) = 0 .
Then, the redundancy of length n distributions obtained by i.i.d. sampling from distributions in 𝒫, denoted by Rn(𝒫), grows sublinearly:
limsup n 1 n R n ( 𝒫 ) = 0 .
Remark If the conditions of the theorem are met, we can always assume without loss of generality that there is a distribution q1 that satisfies (3) and simultaneously has finite redundancy. To see this, suppose q 1 satisfies the finite-redundancy condition, namely:
sup p 𝒫 x T p , δ p ( x ) log p ( x ) q 1 ( x ) = 0 ;
while a different distribution q 1 satisfies the second tail-redundancy condition,
lim δ 0 sup p 𝒫 x T p , δ p ( x ) log p ( x ) q 1 ( x ) = 0 .
It is easy to verify that the distribution q that assigns to any x ∈ ℤ+, q 1 ( x ) = q 1 ( x ) + q 1 ( x ) 2 satisfies both conditions simultaneously.
Proof In what follows, xi represents a string x1, ..., xi and x0 denotes the empty string. For all n, we denote Ψ(xn) = ψ1, ..., ψn and Ψ(Xn)=Ψ1, ..., Ψn.
We will construct q, such that lim sup n 1 n E p log p ( X n ) q ( X n ) = 0. Recall that qΨ is the optimal universal pattern encoder over patterns of i.i.d. sequences defined in Section 2.2. Furthermore, recall that the redundancy of 𝒫 is finite and that q1 is the universal distribution over ℤ+ that attains redundancy R for 𝒫.
The universal encoder q is now defined as follows:
q ( x n ) = q ( x n , Ψ ( x n ) ) = q ( ψ 1 , x 1 , ψ 2 , x 2 , , ψ n , x n ) = i + q ( ψ i | ψ 1 i 1 , x 1 i 1 ) j + q ( x j | ψ 1 j , x 1 j 1 ) i + q Ψ ( ψ i | ψ 1 i 1 ) j + q ( x j | ψ 1 j , x 1 j 1 ) .
Furthermore, we define for all x 1 i 1 + i 1 and all ψi ∈ Ψi, such that ψi−1 =Ψ(xi−1),
q ( x i | ψ 1 i , x 1 i 1 ) { 1 if x i { x 1 , , x i 1 } and Ψ ( x i ) = ψ i q 1 ( x i ) if x i { x 1 , , x i 1 } and Ψ ( x i ) = ψ i .
Namely, we use an optimal universal pattern encoder over patterns of i.i.d. sequences and encode any new symbol using a universal distribution over 𝒫. We now bound the redundancy of q as defined above. We have for all p𝒫,
E p log p ( X n ) q ( X n ) = x n p ( x n ) log i + p ( ψ i | ψ 1 i 1 , x 1 i 1 ) q Ψ ( ψ i | ψ 1 i 1 ) j + p ( x j | ψ 1 j , x 1 j 1 ) q ( x j | ψ 1 j , x 1 j 1 ) = x n p ( x n ) i = 1 n log p ( ψ i | ψ 1 i 1 , x 1 i 1 ) q Ψ ( ψ i | ψ 1 i 1 ) + x n p ( x n ) j = 1 n log p ( x j | ψ 1 j , x 1 j 1 ) q ( x j | ψ 1 j , x 1 j 1 ) .
Since ψ1 is always one, p(ψ1) = qΨ(ψ1) = 1. Therefore, we have:
x n p ( x n ) i = 1 n log p ( ψ i | ψ 1 i 1 , x 1 i 1 ) q Ψ ( ψ i | ψ 1 i 1 ) = x n p ( x n ) i = 2 n log p ( ψ i | ψ 1 i 1 , x 1 i 1 ) q Ψ ( ψ i | ψ 1 i 1 ) .
The first term, normalized by n, can be upper bounded as follows:
1 n x n p ( x n ) i = 2 n log p ( ψ i | ψ 1 i 1 , x 1 i 1 ) q Ψ ( ψ i | ψ 1 i 1 ) 1 n i = 2 n x 1 i p ( x 1 i ) log p ( ψ i | ψ 1 i 1 , x 1 i 1 ) p ( ψ i | ψ 1 i 1 ) + 1 n ( π log e 2 n 3 + log n ( n + 1 ) ) = 1 n i = 2 n ( H ( Ψ i | Ψ 1 i 1 ) H ( Ψ i | X 1 i 1 ) ) + 1 n ( π log e 2 n 3 + log n ( n + 1 ) ) 1 n ( n H p ) 1 n i = 2 n H ( Ψ i | X 1 i 1 ) ) + 1 n ( π log e 2 n 3 + log n ( n + 1 ) )
where we define Hp as:
H p x + p ( x ) log 1 p ( x )
and the last inequality follows, since:
H ( Ψ n ) H ( X n ) = n H p .
Now for i ≥ 2,
H ( Ψ i | X 1 i 1 ) = p ( ψ i | x i 1 ) p ( x i 1 ) log 1 p ( ψ i | x 1 i 1 ) = p ( x i 1 ) ( x { x 1 , , x i 1 } p ( x ) log 1 p ( x ) + y { x 1 , , x i 1 } p ( y ) log 1 y { x 1 , , x i 1 } p ( y ) )
Then,
H p H ( Ψ i | X 1 i 1 ) = x i 1 p ( x i 1 ) H p H ( Ψ i | X 1 i 1 ) = x i 1 p ( x i 1 ) x i { x 1 , , x i 1 } p ( x i ) log 1 p ( x i ) x i 1 p ( x i 1 ) x i { x 1 , , x i 1 } p ( x i ) log 1 x i { x 1 , , x i 1 } p ( x i ) x i 1 p ( x 1 i 1 ) x i { x 1 , , x i 1 } p ( x i ) log 1 p ( x i ) p ( G i 1 ) x i T p , 2 ln i i 1 p ( x i ) log 1 p ( x i ) + p ( B i 1 ) H x i T p , 2 ln i i 1 p ( x i ) log 1 p ( x i ) + H i ln i .
We have split the length i − 1 sequences into the sets Gi−1 and Bi−1 and use separate bounds on each set that hold uniformly over the entire model collection. The last inequality above follows from Lemma 4. From Condition (3) of the Theorem, we have that:
lim i sup p 𝒫 x T p , 2 ln i i 1 p ( x ) log 1 p ( x ) = 0 .
Therefore, we have:
lim n sup p 𝒫 1 n i = 2 n ( x T p , 2 ln i i 1 p ( x ) log 1 p ( x ) + H i ln i ) lim n 1 n i = 2 n ( sup p 𝒫 x T p , 2 ln i i 1 p ( x ) log 1 p ( x ) + H i ln i ) = ( a ) 0 .
The first term on the left in the first equation above is non-negative, hence the limit above has to equal zero. The equality (a) follows from Cesaro’s lemma asserting that for any sequence {ai, i ∈ ℤ+} with ai < ∞ for all i, if limi→∞ ai exists, then:
lim i a i = lim n 1 n j = 1 n a j .
Therefore,
lim n sup p 𝒫 1 n x n p ( x n ) i = 2 n log p ( ψ i | ψ 1 i 1 , x 1 i 1 ) q Ψ ( ψ i | ψ 1 i 1 ) = 0 .
For the second term, observe that:
x n p ( x n ) j = 1 n log p ( x j | ψ 1 j , x 1 j 1 ) q ( x j | ψ 1 j , x 1 j 1 ) R + x n p ( x n ) j = 2 n log p ( x j | ψ 1 j , x 1 j 1 ) q ( x j | ψ 1 j , x 1 j 1 ) .
Furthermore,
x n p ( x n ) j = 2 n log p ( x j | ψ 1 j , x 1 j 1 ) q ( x j | ψ 1 j , x 1 j 1 ) = j = 2 n x j p ( x j ) log p ( x j | ψ 1 j , x 1 j 1 ) q ( x j | ψ 1 j , x 1 j 1 ) j = 2 n x j 1 p ( x j 1 ) x j { x 1 , , x j 1 } p ( x j ) log p ( x j ) q 1 ( x j ) j = 2 n ( p ( G j 1 ) x j T p , 2 ln j j 1 p ( x j ) log p ( x j ) q 1 ( x j ) + R p ( B j 1 ) ) j = 2 n x j T p , 2 ln j j 1 p ( x j ) log p ( x j ) q 1 ( x j ) + R j ln j .
As before, the last inequality is from Lemma 4. Again, from Condition (3), we have:
lim j ( sup p 𝒫 x j T p , 2 ln j j 1 p ( x j ) log p ( x j ) q 1 ( x j ) + R j ln j ) = 0 .
Therefore, as before:
lim n sup p 𝒫 1 n ( j = 1 n x j T p , 2 ln j j 1 p ( x j ) log p ( x j ) q 1 ( x j ) + j = 2 n R j ln j ) = 0
as well. The theorem follows.
A few comments about (3) in Theorem 5 are in order. Neither condition automatically implies the other. The set of distributions in Section 4.1 is an example where every distribution has finite entropy, the redundancy of is finite,
lim δ 0 sup p x T p , δ p ( x ) log 1 p ( x ) = 0 but q over + s . t . lim δ 0 sup p x T p , δ p ( x ) log p ( x ) q ( x ) > 0 .
We will now construct another set 𝒰 of distributions over ℤ+, such that every distribution in 𝒰 has finite entropy, the redundancy of 𝒰 is finite,
lim δ 0 sup p 𝒰 x T p , δ p ( x ) log 1 p ( x ) > 0 but q over + s . t . lim δ 0 sup p 𝒰 x T p , δ p ( x ) log p ( x ) q ( x ) = 0 .
At the same time, the length n redundancy of 𝒰 diminishes sublinearly. This is therefore also an example to show that the conditions in Theorem 5 are only sufficient, but, in fact, not necessary. It is yet open to find a condition on single-letter marginals that is both necessary and sufficient for the asymptotic per-symbol redundancy to diminish to zero.
Construction: 𝒰 is a countable collection of distributions pk, k ∈ ℤ+, on ℕ where:
p k ( x ) = { 1 1 k 2 x = 0 , 1 k 2 2 k 2 1 x 2 k 2 , 0 x > 2 k 2 .
The entropy of pk𝒰 is therefore 1 + h ( 1 k 2 ). Note that the redundancy of 𝒰 is finite, too. To see this, first note that:
x + sup k + p k ( x ) x + p k : k + p k ( x ) = p k : k + x + p k ( x ) = p k : k + 1 k 2 = π 2 6 .
Now, letting: R+ ≝ log(∑x∈ℤ+ supk∈ℕ pk(x)), observe that the distribution:
q ( x ) = { 1 / 2 x = 0 , sup k + p k ( x ) 2 R + + 1 x + .
satisfies for all pk𝒰:
x p k ( x ) log p k ( x ) q ( x ) 1 + R + + 1 k 2 R + + 2 ,
implying that the redundancy of 𝒰 is ≤ R+ + 2. Furthermore, Equation (5) implies that worst-case regret is finite, and from [2] the length n redundancy of 𝒰 diminishes sublinearly. Now, pick an integer m ∈ ℤ+. We have for all p𝒰,
x T p , 1 m 2 2 m 2 p ( x ) log p ( x ) q ( x ) R + + 1 m 2 ,
yet, for all km, we have:
x T p , 1 m 2 2 m 2 p k ( x ) log 1 p k ( x ) = 1 .
Thus, the length n redundancy of 𝒰 diminishes to zero, while not satisfying all of the requirements of Theorem 5. Therefore, the conditions of Theorem 5 are only sufficient, not necessary.

5. Open Problems

We have demonstrated that finite single-letter redundancy of a collection 𝒫 of distributions over a countably infinite support does not imply that the asymptotic per-symbol redundancy of i.i.d. samples from 𝒫 diminishes to zero. This is in contrast to the scenario for worst-case regret, where single-letter worst-case regret, being finite, is both necessary and sufficient for asymptotic per-symbol regret to diminish to zero. We have also demonstrated sufficient conditions on the collection 𝒫, so that asymptotic per-symbol redundancy of i.i.d. samples diminish to zero in this paper. However, as we show, the sufficient conditions we provide are not necessary. It is yet open to find a condition on single-letter marginals that is both necessary and sufficient for the asymptotic per-symbol redundancy to diminish to zero.

Acknowledgments

This research was supported by National Science Foundation Grants CCF-1065632 and CCF-1018984.

Conflicts of Interest

The authors do not have any conflicts of interest.

References

  1. Orlitsky, A.; Santhanam, N.P.; Zhang, J. Universal compression of memoryless sources over unknown alphabets. IEEE Tran. Inf. Theory 2004, 50, 1469–1481. [Google Scholar]
  2. Boucheron, S.; Garivier, A.; Gassiat, E. Coding on countably infinite alphabets. 2008; arXiv.org: 0801.2456. [Google Scholar]
  3. Santhanam, N.; Anantharam, V.; Kavcic, A.; Szpankowski, W. Data driven weak universal redundancy. Proceedings of 2014 IEEE International Symposium on Information Theory (ISIT), Honolulu, HI, USA, 29 June–4 July 2014.
  4. Merhav, N.; Feder, M. Universal prediction. IEEE Tran. Inf. Theory 1998, 44, 2124–2147. [Google Scholar]
  5. Rosenthal, J.S. A First Look at Rigorous Probability Theory, 2nd ed. ; World Scientific: Singapore, Singapore, 2008. [Google Scholar]
  6. Santhanam, N. Probability Estimation and Compression Involving Large Alphabets. Ph.D. Thesis, University of California, San Diego, CA, USA. 2006. [Google Scholar]
  7. Kingman, J.F.C. The Mathematics of Genetic Diversity; SIAM: Philadelphia, PA, USA, 1980. [Google Scholar]
  8. Zabell, S.L. Predicting the unpredictable. Synthese 1992, 90, 205–232. [Google Scholar]
  9. Earman, J.; Norton, J.D. The continuum of inductive methods revisited. In The Cosmos of Science: Essays of Exploration; Earman, J.; Norton, J.D. The University of Pittsburgh Press: Pittsburgh, PA, USA; 1997; Chapter 12. [Google Scholar]
  10. Zabell, S.L. Symmetry and Its Discontents: Essays on the History of Inductive Probability; Cambridge Studies in Probability, Induction, and Decision Theory; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
  11. Santhanam, N.; Anantharam, V. Agnostic insurance of model classes. 2012; arXiv.org: 1212:3866. [Google Scholar]
  12. Orlitsky, A.; Santhanam, N. Lecture notes on universal compression. Available online: http://www-ee.eng.hawaii.edu/~prasadsn/ (accessed on 9 July 2014).

Share and Cite

MDPI and ACS Style

Hosseini, M.; Santhanam, N. Characterizing the Asymptotic Per-Symbol Redundancy of Memoryless Sources over Countable Alphabets in Terms of Single-Letter Marginals. Entropy 2014, 16, 4168-4184. https://doi.org/10.3390/e16074168

AMA Style

Hosseini M, Santhanam N. Characterizing the Asymptotic Per-Symbol Redundancy of Memoryless Sources over Countable Alphabets in Terms of Single-Letter Marginals. Entropy. 2014; 16(7):4168-4184. https://doi.org/10.3390/e16074168

Chicago/Turabian Style

Hosseini, Maryam, and Narayana Santhanam. 2014. "Characterizing the Asymptotic Per-Symbol Redundancy of Memoryless Sources over Countable Alphabets in Terms of Single-Letter Marginals" Entropy 16, no. 7: 4168-4184. https://doi.org/10.3390/e16074168

Article Metrics

Back to TopTop