**5. Experiments**

This section evaluates *Lspread* on embedding quality and model quality:


### *5.1. Datasets and Models*

Tabel 1 lists all the datasets we use in our evaluation. CIFAR10, CIFAR100, and MNIST are the standard computer vision datasets. We also use coarse versions of each, wherein classes are combined to create coarse superclasses (animals/vehicles for CIFAR10, standard superclasses for CIFAR100, and <5, ≥5 for MNIST). In CIFAR100-Coarse-U, some subclasses have been artificially imbalanced. Waterbirds, ISIC and CelebA are image datasets with documented hidden strata [5,14–16]. We use a ViT model [17] (4 × 4, 7 layers) for CIFAR and MNIST and a ResNet50 for the rest. For the ViT models, we jointly optimize the contrastive loss with a cross entropy loss head. For the ResNets, we train the contrastive loss on its own and use linear probing on the final layer. More details in Appendix E.


**Table 1.** Summary of the datasets we use for evaluation.

### *5.2. Coarse-to-Fine Transfer Learning*

In this section, we use coarse-to-fine transfer learning to evaluate how well *Lspread* retains strata information in the embedding space. We train on coarse superclass labels, freeze the weights, and then use transfer learning to train a linear layer with subclass labels. We use this supervised strata recovery setting to isolate how well the embeddings can recover strata in the optimal setting. For baselines, we compare against training with *LSC* and the SimCLR loss *LSS*.

Table 2 reports the results. We find that *Lspread* produces better embeddings for coarseto-fine transfer learning than *LSC* and *LSS*. Lift over *LSC* varies from 0.2 points on MNIST (16.7% error reduction), to 23.6 points of lift on CIFAR10. *Lspread* also produces better embeddings than *LSS*, since *LSS* does not encode superclass labels in the embedding space.

**Table 2.** Performance of coarse-to-fine transfer on various datasets compared against contrastive baselines. In these tasks, we first train a model on coarse task labels, then freeze the representation and train a model on fine-grained subclass labels. *Lspread* produces embeddings that transfer better across all datasets. Best in bold.


*5.3. Robustness Against Worst-Group Accuracy and Noise*

In this section, we use robustness to measure how well *Lspread* can recover strata in an unsupervised setting. We use clustering to detect rare strata as an input to worstgroup robustness algorithms, and we use a geometric heuristic over embeddings to correct noisy labels.

To evaluate worst-group accuracy, we follow the experimental setup and datasets from Sohoni et al. [5]. We first train a model with class labels. We then cluster the embeddings to produce pseudolabels for hidden strata, which we use as input for a Group-DRO algorithm to optimize worst-group robustness [14]. We use both *LSC* and cross entropy loss [5] for training the first stage as baselines.

To evaluate robustness against noise, we introduce noisy labels to the contrastive loss head on CIFAR10. We detect noisy labels with a simple geometric heuristic: points with incorrect labels appear to be small strata, so they should be far away from other points of the same class. We then correct noisy points by assigning the label of the nearest cluster in the batch. More details can be found in Appendix E.

Table 3 shows the performance of unsupervised strata recovery and downstream worstgroup robustness. We can see that *Lspread* outperforms both *LSC* and Sohoni et al. [5] on strata recovery. This translates to better worst-group robustness on Waterbirds and CelebA.

Figure 4 (left) shows the effect of noisy labels on performance. When noisy labels are uncorrected (purple), performance drops by up to 10 points at 50% noise. Applying our geometric heuristic (red) can recover 4.8 points at 50% noise, even without using *Lspread*. However, *Lspread* recovers an additional 0.9 points at 50% noise, and an additional 1.6 points at 20% noise (blue). In total, *Lspread* recovers 75% performance at 20% noise, whereas *LSC* only recovers 45% performance.

**Table 3.** Unsupervised strata recovery performance (top, F1), and worst-group performance (AUROC for ISIC, Acc for others) using recovered strata. Best in bold.


**Figure 4.** (**Left**) Performance of models under various amounts of label noise for the contrastive loss head. (**Right**) Performance of a ResNet18 trained with coresets of various sizes. Our coreset algorithm is competitive with the state-of-the-art in the large coreset regime (from 40–90% coresets), but maintains performance for small coresets (smaller than 40%). At the 10% coreset, our algorithm outperforms [18] by 32 points and matches random sampling.

### *5.4. Minimal Coreset Construction*

Now we evaluate how well training on fractional samples of the dataset with *Lspread* can distinguish points from large versus small strata by constructing minimal coresets for CIFAR10. We train a ResNet18 on CIFAR10, following Toneva et al. [18], and compare against baselines from Toneva et al. [18] (Forgetting) and Paul et al. [19] (GradNorm, L2Norm). For our coresets, we train with *Lspread* on subsamples of the dataset and record how often points are correctly classified at the end of each run. We bucket points in the training set by how often the point is correctly classified. We then iteratively remove points from the largest bucket in each class. Our strategy removes easy examples first from the largest coresets, but maintains a set of easy examples in the smallest coresets.

Figure 4 (right) shows the results at various coreset sizes. For large coresets, our algorithm outperforms both methods from Paul et al. [19] and is competitive with Toneva et al. [18]. For small coresets, our method outperforms the baselines, providing up to 5.2 points of lift over Toneva et al. [18] at 30% labeled data. Our analysis helps explain this gap; removing too many easy examples hurts performance, since then the easy examples become rare and hard to classify.

### *5.5. Model Quality*

Finally, we confirm that *Lspread* produces higher-quality models and achieves better sample complexity than both *LSC* and the SimCLR loss *LSS* from [20]. Table 4 reports the performance of models across all our datasets. We find that *Lspread* achieves better overall performance compared to models trained with *LSC* and *LSS* in 7 out of 9 tasks, and matches performance in 1 task. We find up to 4.0 points of lift over *LSC* (Waterbirds), and up to 2.2 points of lift (AUROC) over *LSS* (ISIC). In Appendix F, we additionally evaluate the sample complexity of contrastive losses by training on partial subsamples of CIFAR10. *Lspread* outperforms *LSC* and *LSS* throughout.

**Table 4.** End model performance training with *Lspread* on various datasets compared against contrastive baselines. All metrics are accuracy except for ISIC (AUROC). *Lspread* produces the best performance in 7 out of 9 cases, and matches the best performance in 1 case. Best in bold.


### **6. Related Work and Discussion**

From work in **contrastive learning**, we take inspiration from [21], who use a latent classes view to study self-supervised contrastive learning. Similarly, [22] considers how minimizing the InfoNCE loss recovers a latent data generating model. We initially started from a debiasing angle to study the effects of noise in supervised contrastive learning inspired by [23], but moved to our current strata-based view of noise instead. Recent work has also analyzed contrastive learning from the information-theoretic perspective [24–26], but does not fully explain practical behavior [27], so we focus on the geometric perspective in this paper because of the downstream applications. On the geometric side, we are inspired by the theoretical tools from [8] and [2], who study representations on the hypersphere along with [9].

Our work builds on the recent wave of empirical interest in contrastive learning [20,28–31] and supervised contrastive learning [1]. There has also been empirical work analyzing the transfer performance of contrastive representations and the role of intra-class variability in transfer learning. [32] find that combining supervised and self-supervised contrastive loss improves transfer learning performance, and they hypothesize that this is due to both inter-class separation and intra-class variability. [33] find that combining cross entropy and self-supervised contrastive loss improves coarse-to-fine transfer, also motivated by preserving intra-class variability.

We derive *Lspread* from similar motivations to losses proposed in these works, and we futher theoretically study why class collapse can hurt downstream performance. In particular, we study why preserving distinctions of strata in embedding space may be important, with theoretical results corroborating their empirical studies. We further propose a new thought experiment for why a combined loss function may lead to better separation of strata.

Our treatment of **strata** is strongly inspired by [5,6], who document empirical consequences of hidden strata. We are inspired by empirical work that has demonstrated that detecting subclasses can be important for performance [4,34] and robustness [14,35,36].

Each of our downstream **applications** is a field in itself, and we take inspiration from recent work from each. Our noise heuristic is similar to the ELR [37] and takes inspiration from a various work using contrastive learning to correct noisy labels and for semi-supervised learning [38–40]. Our coreset algorithm is inspired by recent work in coresets for modern deep networks [19,41,42], and takes inspiration from [18] in particular.
