Qualitative Evaluation

ˆ

Figure 2 shows t-SNE plots for embeddings produced with *LSC* versus *Lspread* on the CIFAR10 test set. *Lspread* produces embeddings that are more spread out than those produced by *LSC* and avoids class collapse. As a result, images from different strata can be better differentiated in embedding space. For example, we show two dogs, one from a common stratum and one from a rare stratum (rare pose). The two dogs are much more distinguishable by distance in the *Lspread* embedding space, which suggests that it helps preserve distinctions between strata.

### **4. Geometry of Strata**

We first discuss some existing theoretical tools for analyzing contrastive loss geometrically and their shortcomings with respect to understanding how strata are embedded. In Section 4.2, we propose a simple thought experiment about the distances between strata in embedding space when trained under a finite subsample of data to better understand our prior qualitative observations. Then, in Section 4.3, we discuss implications of representations that preserve strata distinctions, showing theoretically how they can yield better generalization error on both coarse-to-fine transfer and the original task and empirically how they allow for new downstream applications.

### *4.1. Existing Analysis*

Previous works have studied the geometry of optimal embeddings under contrastive learning [2,8,9], but their techniques cannot analyze strata because strata information is not directly used in the loss function. These works use the *infinite encoder* assumption, where any distribution on S*d*−<sup>1</sup> is realizable by the encoder *f* applied to the input data. This allows the minimization of the contrastive loss to be equivalent to an optimization problem over probability measures on the hypersphere. As a result, solving this new problem yields a distribution whose characterization is solely determined by information in the loss function (e.g., labels information [2,9]) and is decoupled from other information about the input data *x* and hence decoupled from strata.

More precisely, if we denote the measure of *x* ∈ X as *μ*X , minimizing the contrastive loss over the mapping *f* is equal (at the population level) to minimizing over the pushforward measure *μ*X ◦ *f* −1 : S*d*−<sup>1</sup> → [0, 1]. The infinite encoder assumption allows us to relax the problem and instead consider optimizing over any *μ* ∈ <sup>M</sup>(S*d*−<sup>1</sup>) in the Borel set of probability measures on the hypersphere. Then, the optimal *μ* learned is independent of the distribution of the input data P beyond what is in the relaxed objective function.

This approach using the infinite encoder assumption does not allow for analysis of strata. Strata are unknown at training time and thus cannot be incorporated explicitly into the loss function. Their geometries will not be reflected in the characterization of the optimal distribution obtained from previous theoretical tools. Therefore, we need additional reasoning for our empirical observations that strata distinctions are preserved in embedding space under *Lspread*.

### *4.2. Subsampling Strata*

We propose a simple thought experiment based on *subsampling the dataset*—randomly sampling a fraction of the training data—to analyze strata. Consider the following: we subsample a fraction *t* ∈ [0, 1] of a training set of *N* points from P. We use this subsampled dataset D*t* to learn an encoder ˆ *ft*, and we study the average distance under ˆ *ft* between two strata *z* and *z* as *t* varies.

The average distance between *z* and *z* is *δ*( ˆ *ft*, *z*, *z*) = <sup>E</sup>*x*∼P*z* [ ˆ *ft*(*x*)] − <sup>E</sup>*x*∼P*z* [ ˆ *ft*(*x*)]2 and depends on whether *z* and *z* are both in the subsampled dataset. We study when *z* and *z* belong to the same class. We have three cases (with probabilities stated in Appendix C.2) based on strata frequency and *t*—when both, one, or neither of the strata appears in D*t*:

ˆ


ˆ

We make two observations from these cases. First, if *z* and *z* are both common strata, then as *t* increases, the distance between them depends on the optimal asymptotic distribution. Therefore, if we set *α* = 1 in *Lspread*, these common strata will collapse. Second, if *z* is a common strata and *z* is uncommon, the second case occurs frequently over randomly sampled D*<sup>t</sup>*, and thus the strata are separated based on the difficulty of the

respective out-of-distribution problem. We thus arrive at the following insight from our thought experiment:

*Common strata are more tightly clustered together, while rarer and more semantically distinct strata are far away from them.*

Figure 3 demonstrates this insight. It shows a t-SNE visualization of embeddings from training on CIFAR100 with coarse superclass labels, and with artifically imbalanced subclasses. We show points from the largest subclasses in dark blue and points from the smallest subclasses in light blue. Points from the largest subclasses (dark blue) cluster tightly, whereas points from small subclasses (light blue) are scattered throughout the embedding space.

**Figure 3.** Points from large subclasses cluster tightly; points from small subclasses scatter (CIFAR100- Coarse, unbalanced subclasses).
