1. Introduction
Logarithmic loss is a unique distortion measure in the sense that it allows a “soft” estimation (or reconstruction) of the source. Although logarithmic loss plays a crucial role in learning theory, not much work has been published regarding lossy compression until recently. A few exceptions are a line of work on multiterminal source coding [
1,
2,
3], the single-shot approach to lossy source coding under logarithmic loss [
4], and several universal properties of logarithmic loss in information theory [
5,
6,
7]. In [
4], Shkel and Verdú focused on the lossy-compression problem when the distortion measure is given by logarithmic loss. On the other hand, Jiao et al. justified logarithmic loss by showing it is the only loss function that satisfies a natural data-processing requirement [
5]. Painsky and Wornell provided a universal property of logarithmic loss in the context of classification. In [
7], No focused on the universal property of logarithmic loss in the successive refinement problem. We would also like to point out that the information bottleneck method [
8,
9,
10,
11] is related to lossy compression under logarithmic loss. Indeed, it is equivalent to the noisy lossy-compression problem under logarithmic loss [
12].
In this paper, we present a new universal property of logarithmic loss in fixed-length lossy-compression problems. Consider an arbitrary fixed-length lossy-compression problem, where source and reconstruction alphabets and are discrete. Suppose arbitrary distortion measure is given. Then, we show that there exists a corresponding fixed-length lossy-compression problem where the source alphabet remains the same, but the reconstruction alphabet is a set of distributions on , and the distortion measure is logarithmic loss. This implies that there is a correspondence between any fixed-length lossy-compression problem under an arbitrary distortion measure and that under logarithmic loss. The correspondence is in the following strong sense:
We are more precise about the “optimal” and “goodness” of the scheme in later sections. This finding essentially implies that it is enough to consider the lossy-compression problem under logarithmic loss.
The above correspondence provides new insights into the fixed-length lossy-compression problem. In general, the reconstruction alphabet in the lossy-compression problem does not have any well-defined operations. However, in the corresponding lossy compression under logarithmic loss, reconstruction symbols are probability distributions that have their own algebraic structure. Thus, under the corresponding setting, we can apply various techniques, such as the information geometric approach, clustering with Bregman divergence, and relaxation of the optimization problem. Furthermore, the equivalence relation suggests a new algorithm in the categorical data-clustering problem, where data are not in the continuous space.
The remainder of the paper is organized as follows. In
Section 2, we revisit some of the known results of logarithmic loss and fixed-length lossy compression.
Section 3 is dedicated to the equivalence between lossy compression under arbitrary distortion measures and that under logarithmic loss. In
Section 4, we present the geometric interpretation of our result. We provide the log-convex relaxation of lossy compression and connection to the clustering problems in
Section 5. Finally, we conclude in
Section 6.
Notation: Uppercase X denotes a random variable, where denotes a set of alphabet. On the other hand, lowercase x denotes a specific possible realization of random variable X, i.e., . Similarly, denotes an n-dimensional random vector while lowercase denotes a realization of . The absolute value of function denotes a size of image of function , i.e., . If it was clear from the context, we used instead of . We used a natural logarithm and nats instead of bits.
2. Preliminaries
2.1. Logarithmic Loss
Suppose is a finite set of discrete symbols, and is the set of probability measures on . For and , the definition of logarithmic loss is given by
2.2. Fixed-Length Lossy Compression
In this section, we briefly introduce the basic settings of the fixed-length lossy-compression problem [
13]. In a fixed-length lossy-compression setting, we have a source
X with finite alphabet
and source distribution
. An encoder
maps the source symbol to one of
M messages. On the other side, a decoder
maps the message to actual reconstruction
, where the reconstruction alphabet is also finite
. Let
be a distortion measure between source and reconstruction.
First, we can define the code that the expected distortion is lower than a given distortion level.
Definition 1 (Average distortion criterion)
. An code is a pair of an encoder f with and a decoder g, such thatThe minimum number of codewords required to achieve average distortion not exceeding D is defined by Similarly, we can define the minimum achievable average distortion given number of codewords M. One may consider a stronger criterion that restricts the probability of exceeding a given distortion level.
Definition 2 (Excess distortion criterion)
. An code is a pair of an encoder f with and a decoder g such thatThe minimum number of codewords required to achieve excess distortion probability ϵ, and distortion D is defined by Similarly, we can define the minimum achievable excess distortion probability given target distortion D and number of codewords M. Given target distortion
D and
, the information rate-distortion function is defined by
We make the following benign assumptions:
There exists a unique rate-distortion function achieving conditional distribution .
We assume that for all ∈ since we can always discard the reconstruction symbol with zero probability.
If for all , then . (If for all x, then, there is no difference between and in terms of loss. Thus, we can always discard without loss of generality.)
2.3. D-Tilted Information
Define the information density of joint distribution
by
Then, we are ready to define D-tilted information that plays a key role in fixed-length lossy compression.
Definition 3 ([
13] (Definition 6))
. The D-tilted information in is defined aswhere the expectation is with respect to the marginal distribution of and . Note that is a random variable that has a marginal distribution of , and is the first derivative of rate-distortion function .
Theorem 1 ([
14] (Lemma 1.4))
. For all ∈ ,therefore, we have Let
be the induced conditional probability from
. Then, (
2) can equivalently be expressed as
The following lemma shows that are all distinct.
Lemma 1 ([
7] (Lemma 2))
. For all , there exists such that . 3. One-to-One Correspondence Between General Distortion and Logarithmic Loss
3.1. Main Results
Consider fixed-length lossy compression under arbitrary distortion
, as described in
Section 2.2. We have a source
X with finite alphabet
, source distribution
, and finite reconstruction alphabet
. For a fixed number of messages
M, let
and
be the encoder and decoder that achieve optimal average distortion
, i.e.,
Let
denote the rate-distortion function achieving conditional distribution at distortion
. In other words,
achieves the infimum in
Note that may be strictly smaller than in general since is an information rate-distortion function that does not characterize the best achievable performance for the “one-shot” setting in which is defined.
Now, we define the corresponding fixed-length lossy-compression problem under logarithmic loss. In the corresponding problem, source alphabet
, source distribution
, and number of messages
M remain the same. However, we have a different reconstruction alphabet
where
pertains to the achiever of the infimum in Equation (
4) associated with the original loss function. Recall that
is the set of all probability measures on
. Let the distortion of the corresponding problem be the logarithmic loss.
We now further connect the encoding and decoding schemes between the two problems. Suppose
and
are an encoder and decoder pair in the original problem. When
f and
g are given in the original problem, we define the corresponding encoder and decoder in the corresponding problem as follows. We let the encoder be the same
, and define the decoder
by
Then, and are a valid encoder and decoder pair for the corresponding fixed-length lossy-compression problem under logarithmic loss. Conversely, given and , we can find corresponding f and g because Lemma 1 guarantees that are distinct.
The following result shows the relation between the corresponding schemes.
Theorem 2. For any encoder–decoder pair for the corresponding fixed-length lossy-compression problem under logarithmic loss, we havewhere is the corresponding encoder–decoder pair for the original lossy-compression problem. Note that and the expectations are with respect to distribution . Moreover, equality holds if and only if and . Proof. Then, Equation (
3) implies that
where Equation (
5) is because
with respect to distribution
. Inequality (
6) is because
is the minimum achievable average distortion with
M codewords. Equality holds if and only if
, which can be achieved by the optimal scheme for the original lossy-compression problem. In other words, the equality holds if
□
In the above theorem, distortion
plays a critical role, which is the minimal achievable distortion in the one-shot setting. We also used
in the corresponding problem, which is the rate-distortion-achieving conditional distribution. This might be confusing since the rate-distortion function provides the optimal rate in the asymptotic setting. However, recall that the minimal mutual information between
X and
in Equation (
1) is the “information” rate-distortion function. The “information” rate-distortion function is equal to the optimum rate in the asymptotic case if the source is independent and identically distributed.
On the other hand, we viewed the “information” rate-distortion function differently. We considered the one-shot setting where source X and reconstruction are single variables. Given number of messages M, the minimal achievable distortion is given by . Under this setting, we focused on minimal mutual information between X and when the distortion between X and is restricted by . Our theorem implies that minimal achieving distribution provides the corresponding one-shot lossy-compression problem under logarithmic loss.
Remark 1. In the corresponding fixed-length lossy-compression problem under logarithmic loss, the minimal achievable average distortion given number of codewords M iswhere the conditional entropy is with respect to distribution . Remark 2. From now on, we denote the original lossy-compression problem under given distortion measure with reconstruction alphabet by “original problem”. On the other hand, we denote the corresponding lossy-compression problem under logarithmic loss with reconstruction alphabet by “corresponding problem”.
3.2. Example: Memoryless Bernoulli Source with Hamming Distortion Measure
In this section, we consider the memoryless Bernoulli source under Hamming distortion measure as an example of the above equivalence. Let
be a memoryless Bernoulli source with probability
, where
, and reconstruction
is also an
n-dimensional binary vector where
. Note that block length
n is fixed, so the problem is in the one-shot setting. Distortion measure
d is separable Hamming distortion, i.e.,
where
if
and
if
. Let
M be the number of messages. Then, we are interested in optimal encoding and decoding schemes that achieve distortion
.
In this scenario, the information rate-distortion function is not hard to compute [
15]:
where
is the binary entropy function. Let
be the distribution that achieves the infimum in Equation (
7). We have an analytic formula for rate-distortion-achieving distribution
. For
and
, we have
Then, the corresponding problem is the rate-distortion problem under logarithmic loss where the set of reconstruction symbols is
Remark 3. We can rewrite Equation (3) in this case. The above equation explicitly shows the correspondence between logarithmic loss and the original distortion measure.
3.3. Discussion
3.3.1. One-to-One Correspondence
Theorem 2 implies that, for any fixed-length lossy-compression problem, we can find an equivalent problem under logarithmic loss where optimal encoding schemes are the same. Thus, without loss of generality, we can restrict our attention to the problem under logarithmic loss with reconstruction alphabet for some .
3.3.2. Scheme Suboptimality
Suppose
f and
g are a suboptimal encoder and decoder for the original fixed-length lossy-compression problem. Then, the theorem implies
The left-hand side of Equation (9) is the cost of suboptimality for the corresponding lossy-compression problem. On the other hand, the right-hand side is proportional to the cost of suboptimality for the original problem. In
Section 3.3.1, we discussed that the optimal schemes of the two problems coincide. Equation (9) shows stronger equivalence in which costs of suboptimalities are linearly related. This implies that a good code for one problem is also good for the other.
3.3.3. Operations on the Reconstruction Alphabet
In general, reconstruction alphabet
does not have an algebraic structure. However, in the corresponding rate-distortion problem, the reconstruction alphabet is the set of probability measures where we have natural operations such as convex combinations of elements or projection to a convex hull. We discuss such operations closer in
Section 5.
3.4. Exact Performance of Optimal Scheme
In the previous section, we showed that there is a corresponding lossy-compression problem under logarithmic loss that shares the same optimal coding scheme. In this section, we investigate the exact performance of the optimal scheme for the fixed-length lossy-compression problem under logarithmic loss, when the reconstruction alphabet is the set of all probability measures on
, i.e.,
. (Recently, Shkel and Verdu [
4] independently proposed similar results. The result was also presented in our conference version of the paper [
16].) We also characterize minimal average distortion
when we have a fixed number of messages
M. Note that this is a single-letter version of ([
2], [Lemma 1]). Although the optimal scheme associated with
may differ from the optimal scheme with restricted reconstruction alphabets
, it provides an insight, as we show in
Section 4. In this section, we restrict our attention to deterministic schemes. However, it is not hard to show that the same result holds even if we allow a stochastic encoder and decoder.
Let an encoder and a decoder be
and
where
. Then, we have
where
and
. Since
for all
, we have
Equality can be achieved by choosing , which can be done no matter what f is. Thus, we have
This implies that the optimal encoder is function
f that maximizes
, and the optimal decoder is given by
. The above result provides a trivial lower bound:
The optimal scheme under an excess distortion criterion is given in
Appendix A.
4. Geometrical Interpretation
In this section, we present another geometrical interpretation of the decoder in lossy-compression problems. Consider the original lossy-compression problem with discrete reconstruction alphabet
and distortion measure
. Suppose encoding function
f is given that may or may not be optimal, where
. Let
, which is the set of source symbols that are mapped to message
m. Then, optimal reconstruction
is given by
Now, consider the corresponding lossy-compression problem under logarithmic loss. Recall that the set of reconstruction alphabets is given by
where
. As we have seen in
Section 3.4, the optimal reconstruction is
if we have extended set of reconstruction alphabet
. Thus, it is natural to find the probability distribution in
, which is the nearest distribution from
. We propose Kullback–Leibler divergence to measure the distance between probability distributions. In other words, we want to find
, such that
This can be viewed as projecting the optimal solution from extended set
to original feasible set
. Since
, there exists
∈
, such that
. Then, the above Kullback–Leibler divergence is given by
where the last equality is from Equation (
2). Note that
is the only term that is a function of
, and
is positive. Thus, if
achieves the minimum in Equation (11), then
minimizes the following:
Since Equation (12) coincides with Equation (10), we have
Remark 4. In Section 3, we directly defined . However, we obtained via the following two-step procedure: extend the reconstruction set from to , then characterize optimal decoding functions ; and
find the measure that is closest to .
The above result (13) implies that .
5. Log-Convex Relaxation
In the previous section, we obtained the optimal reconstruction symbol from the extended reconstruction alphabet, and projected it to the feasible set. In this section, instead of direct projection to , we propose another slight extension of , namely, log-convex hull. As we show in the following sections, the log-convex hull has interesting properties.
5.1. rI-Projection
Before defining the log-convex hull, we need to define the log-convex combination of probability distributions. Let
p and
q be probability distributions in
. For
, the log-convex combination of
p and
q is given by
It is clear to see that
is a convex combination of
and
with a normalizing constant. We can now define log-convex hull logconv
that is a set of log-convex combination of probability measures in set
. More precisely,
where
r is a weight vector (i.e.,
), and
is a normalizing constant. By definition, logconv
is log-convex since it contains all log-convex combinations of probability distributions in
.
Instead of having projection of
to
, we consider the projection to logconv
. Since logconv
is log-convex, ([
17], [Theorem 1]) implies that there exists unique probability distribution
that achieves the following minimum.
Projection is called an rI-projection of to logconv. Let be the corresponding weights, i.e.,
Csiszár and Matúš ([
17], [Theorem 1]) showed that the
rI-projection satisfies the following inequality for all
∈
.
On the other hand, the log-convex combination of probability measures
is called the geometric mean of probability measures [
18]. The author also provided geometric compensation identity, which is given by
The above result holds for any
; therefore, Equation (16) also holds when
. Together with Inequality (15), we get the following result. For all
∈
, if
, then
Remark 5. The above result is similar to the projection to polytope in Euclidean space. Suppose vectors form a polytope, and consider the projection from a vector w to the polytope. Let h be a projection. Then, h is a convex combination of ’s. Thus, there exist coefficients , such thatwhere , and for all i. Let be the set of indices of nonzero coefficients. Then, projection h is on the plane generated by . Thus, two vectors and are orthogonal for all . Then, Pythagorean theorem implies that, for all i, we have either or 5.2. Optimization
As we saw in the previous section, we want to find
that minimizes
. Note that
Since the first term is not a function of
, it is enough to consider the second term. By the definition of
, we have
Thus, minimizing is equivalent to solving the following optimization problem.
Since the objective function is a convex function of , the above problem is a convex optimization problem that can be efficiently solved.
5.3. Relaxation in Clustering
In the corresponding lossy-compression problem under logarithmic loss, reconstruction symbols are probability measures that have a natural algebraic structure, as we discussed in
Section 3.3.3. In this section, we present the benefits of such a property when we apply some known techniques from the clustering literature.
Lossy compression is closely related to the clustering problem [
19,
20,
21]. Many works focused on the application of
k-means clustering to a lossy-compression problem [
22,
23,
24], which is an extension of the Lloyd max algorithm [
25,
26]. However,
k-means clustering is only available when there exists a well-defined operation in
(e.g.,
). This is because
k-means clustering requires computing the mean of data points, which is the center of each cluster. In general lossy-compression problems, reconstruction alphabet
may not have such an operation. In such cases, we may have to apply
k-medoidlike clustering [
27], where the center of each cluster has to be a data point. The
k-medoidlike algorithm in the context of lossy compression is shown in Algorithm 1.
Algorithm 1k-medoidlike clustering in lossy compression. |
Randomly initialize repeat Set for all . for do where end for for to M do end for until converge |
On the other hand, in the corresponding problem, the reconstruction alphabet is the set of probability distributions where operations such as log-convex combinations are well-defined. This allows us to propose a k-meanslike clustering algorithm, as shown in Algorithm 2.
Algorithm 2k-meanslike clustering in lossy compression. |
Randomly initialize repeat Set for all for do where Set where end for for to M do end for until converge |
The main idea of the above algorithm is that log-convex combination
behaves like center of cluster
. In the clustering literature, there are many known variations of
k-means clustering [
28,
29]. The above result shows that we can borrow those techniques and apply them to the lossy-compression problem even without any algebraic structures on the reconstruction alphabet.
5.4. Application to General Clustering Problems
The idea of the previous section can be applied to an actual clustering problem. We mainly focus on clustering categorical data where data points are not in continuous space [
30,
31,
32,
33,
34]. Since operations such as
mean are not well-defined in this case, it is hard to apply known data-clustering algorithms in continuous space. The key idea is that the equivalence relation with logarithmic loss allows the algebraic structure on any set. More precisely, we can transform any clustering problem to the clustering problem in continuous space and apply known techniques such as variations of
k-means.
A more rigorous definition of the problem is given below. Assume that we have a finite set of data points , and each data point has its weight . We normalize the weights so that , and the weights may or may not be uniform. The distance between two points are given by measure . Suppose we want to partition the data points into M clusters.
If we let
=
, then the clustering problem turns out to be a lossy-compression problem under distortion measure
, where the number of messages is
M. Let
be the optimal achievable distortion, and
be the distribution that achieves rate-distortion function
as defined in Equation (
4). Then, we can find the corresponding lossy-compression problem under logarithmic loss. Finally, we can apply clustering algorithms in continuous space such as
k-means to the corresponding problem. For example, Algorithm 2 can be applied to the corresponding problem.
Remark 6. Note that it is hard to have an exact analytic formula for or . However, as we mentioned in Section 3.3.2, we do not have to find an optimal scheme under the exact problem formulation. If we can provide a good scheme of the corresponding problem with , that should be a good enough scheme in the original problem.