**3. Method**

We now highlight some theoretical problems with class collapse under our generative model of strata (Section 3.1). We then propose and qualitatively analyze the loss function *Lspread* (Section 3.2).

### *3.1. Theoretical Motivation*

We show that the conditions under which collapsed embeddings minimize generalization error on coarse-to-fine transfer and the original task do *not* hold when distinct strata exist.

Consider the downstream *coarse-to-fine transfer* task (*<sup>x</sup>*, *z*) of using embeddings *f*(*x*) learned on (*<sup>x</sup>*, *y*) to classify points by fine-grained strata. Formally, coarse-to-fine transfer involves learning an end model with weight matrix *W* ∈ R*C*×*<sup>d</sup>* and fixed *f*(*x*) (as described in Section 2.2) on points (*<sup>x</sup>*, *<sup>z</sup>*), where we assume the data are class-balanced across *z*.

**Observation 1.** *Class collapse minimizes* L(*<sup>x</sup>*, *z*, *f*) *if for all x, (1) p*(*y* = *h*(*x*)|*x*) = 1*, meaning that each x is deterministically assigned to one class, and (2) p*(*z*|*x*) = 1 *m where z* ∈ *Sh*(*x*)*. The second condition implies that p*(*x*|*z*) = *p*(*x*|*y*) *for all z* ∈ *Sy, meaning that there is no distinction among strata from the same class. This contradicts our data model described in Section 2.1.*

Similarly, we characterize when collapsed embeddings are optimal for the original task (*<sup>x</sup>*, *y*).

**Observation 2.** *Class collapse minimizes* L(*<sup>x</sup>*, *y*, *f*) *if, for all x, p*(*y* = *h*(*x*)|*x*) = 1*. This contradicts our data model.*

Proofs are in Appendix D.1. We also analyze transferability of *f* on arbitrary new distributions (*x* , *y* ) information-theoretically in Appendix C.1, finding that a one-to-one encoder obeys the Infomax principle [7] better than collapsed embeddings on (*x* , *y* ). These observations sugges<sup>t</sup> that a distribution over the embeddings that preserves strata distinctions and does not collapse classes is more desirable.

### *3.2. Modified Contrastive Loss Lspread*

We introduce the loss *Lspread*, a weighted sum of two contrastive losses *Lattract* and *Lrepel*. *Lattract* is a supervised contrastive loss, while *Lrepel* encourages intra-class separation. For *α* ∈ [0, 1],

$$L\_{spread} = \alpha L\_{attract} + (1 - \alpha) L\_{repel}.\tag{1}$$

For a given anchor *xi*, define *xaug i* as an augmentation of the same point as *x*. Define the set of negative examples for *i* to be *<sup>N</sup>*(*<sup>i</sup>*, *B*) = {*a* ∈ *B*\*i* : *h*(*a*) = *h*(*i*)}. Then,

$$\hat{L}\_{\text{attract}}(f, \mathbf{x}\_i, \mathcal{B}) = \frac{-1}{|P(i, \mathcal{B})|} \times \sum\_{p \in P(i, \mathcal{B})} \frac{\exp(\sigma(\mathbf{x}\_i, \mathbf{x}\_p))}{\exp(\sigma(\mathbf{x}\_i, \mathbf{x}\_p)) + \sum\_{a \in N(i, \mathcal{B})} \exp(\sigma(\mathbf{x}\_i, \mathbf{x}\_a))} \tag{2}$$

$$\mathbb{L}\_{\text{repel}}(f, \mathbf{x}\_{i}, \mathcal{B}) = -\log \frac{\exp(\sigma(\mathbf{x}\_{i}, \mathbf{x}\_{j}^{\text{allg}}))}{\sum\_{p \in P(i, \mathcal{B})} \exp(\sigma(\mathbf{x}\_{i}, \mathbf{x}\_{p}))}.\tag{3}$$

*L attract* is a variant of the SupCon loss, which encourages class separation in embedding space as suggested by Graf et al. [2]. *L* ˆ *repel* is a class-conditional InfoNCE loss, where the positive distribution consists of augmentations and the negative distribution consists of i.i.d samples from the same class. It encourages points within a class to be spread apart, as suggested by the analysis of the InfoNCE loss by Wang and Isola [8].
