*4.3. Implications*

We discuss theoretical and practical implications of our subsampling argument. First, we show that on both the coarse-to-fine transfer task (*<sup>x</sup>*, *z*) and the original task (*<sup>x</sup>*, *y*), embeddings that preserve strata yield better generalization error. Second, we discuss practical implications arising from our subsampling argumen<sup>t</sup> that enable new applications.

### 4.3.1. Theoretical Implications ˆ

Consider *f*1, the encoder trained on D with all *N* points using *Lspread*, and suppose a mean classifier is used for the end model, e.g., *Wy* = <sup>E</sup>*x*|*y* ˆ*<sup>f</sup>*1(*x*) and *Wz* = <sup>E</sup>*x*|*z* ˆ*<sup>f</sup>*1(*x*). On coarse-to-fine transfer, generalization error depends on how far each stratum center is from the others.

**Lemma 1.** *There exists λz* > 0 *such that the generalization error on the coarse-to-fine transfer task is at most*

$$\mathcal{L}(\mathbf{x}, z, \hat{f}\_1) \le \mathbb{E}\_z \left[ \log \left( \sum\_{z' \in \mathcal{Z}} \exp \left( -\lambda\_z \left( \frac{1}{2} \delta(\hat{f}\_1, z, z')^2 - 1 \right) \right) \right) \right] - 1,\tag{4}$$

*where δ*( ˆ*f*1, *z*, *z*) *is the average distance between strata z and z defined in Section 4.2.*

The larger the distances between strata, the smaller the upper bound on generalization error. We now show that a similar result holds on the original task (*<sup>x</sup>*, *y*), but there is an additional term that penalizes points from the same class being too far apart.

**Lemma 2.** *There exists λy* > 0 *such that the generalization error on the original task is at most*

$$\mathcal{L}(\mathbf{x}, y, \hat{f}\_1) \le \mathbb{E}\_{\mathbf{z}} \left[ \mathbb{E}\_{\mathbf{z}' \mid \mathbf{S}(\mathbf{z})} \left[ \frac{1}{2} \delta(\hat{f}\_1, \mathbf{z}, \mathbf{z}')^2 - 1 \right] \right] \tag{5}$$

$$+\log\left(\sum\_{y\in\mathcal{Y}}\exp\left(\mathbb{E}\_{z'|y}\left[-\lambda\_y\left(\frac{1}{2}\delta(\hat{f}\_1,z,z')^2-1\right)\right]\right)\right).\tag{6}$$

This result suggests that maximizing distances between strata of different classes is desirable, but less so for distances between strata of the same class as suggested by the first term in the expression. Both results illustrate that separating strata to some extent in the embedding space results in better bounds on generalization error. In Appendix C.3, we provide proofs of these results and derive values of the generalization error for these two tasks under class collapse for comparison.

### 4.3.2. Practical Implications

Our discussion in Section 4.2 suggests that training with *Lspread* better distinguishes strata in embedding space. As a result, we can use differences between strata of different sizes for downstream applications. For example, unsupervised clustering can help recover pseudolabels for unlabeled, rare strata. These pseudolabels can be used as inputs to worstgroup robustness algorithms, or used to detect noisy labels, which appear to be rare strata during training (see Section 5.3 for examples). We can also train over subsampled datasets to heuristically distinguish points that come from common strata from points that come from rare strata. We can then downsample points from common strata to construct minimal coresets (see Section 5.4 for examples).
