**7. Conclusions**

We propose a new supervised contrastive loss function to prevent class collapse and produce higher-quality embeddings. We discuss how our loss function better maintains strata distinctions in embedding space and explore several downstream applications. Future directions include encoding label hierarchies and other forms of knowledge in contrastive loss functions and extending our work to more modalities, models, and applications. We hope that our work inspires further work in more fine-grained supervised contrastive loss functions and new theoretical approaches for reasoning about generalization and strata.

**Author Contributions:** Conceptualization, D.Y.F. and M.F.C.; methodology, D.Y.F. and M.F.C.; software, D.Y.F.; validation, D.Y.F. and M.Z.; formal analysis, M.F.C.; investigation, D.Y.F., M.F.C. and M.Z.; resources, D.Y.F. and M.F.C.; data curation, D.Y.F.; writing—original draft preparation, D.Y.F., M.F.C. and M.Z.; writing—review and editing, D.Y.F., M.F.C. and M.Z.; visualization, D.Y.F.; supervision, K.F. and C.R.; project administration, D.Y.F. and M.F.C.; funding acquisition, D.Y.F. and M.F.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by NIH under No. U54EB020405 (Mobilize), NSF under Nos. CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301 (RTML); ONR under No. N000141712266 (Unifying Weak Supervision); ONR N00014-20-1-2480: Understanding and Applying Non-Euclidean Geometry in Machine Learning; N000142012275 (NEPTUNE); the Moore Foundation, NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, the Okawa Foundation, American Family Insurance, Google Cloud, Salesforce, Total, the HAI-GCP Cloud Credits for Research program, the Stanford Data Science Initiative (SDSI), Department of Defense (DoD) through the National Defense Science and Engineering Graduate Fellowship (NDSEG) Program, and members of the Stanford DAWN project: Facebook, Google, and VMWare. The Mobilize Center is a Biomedical Technology Resource Center, funded by the NIH National Institute of Biomedical Imaging and Bioengineering through Grant P41EB027060. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of NIH, ONR, or the U.S. Government.

**Institutional Review Board Statement:** Not Applicable.

**Informed Consent Statement:** Not Applicable.

**Data Availability Statement:** Datasets used in this paper are publicly available and described in Appendix E.

**Acknowledgments:** We thank Nimit Sohoni for helping with coreset and robustness experiments, and we thank Beidi Chen and Tri Dao for their helpful comments.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

We provide a glossary in Appendix A. Then we provide definitions of terms in Appendix B. We discuss additional theoretical results in Appendix C. We provide proofs in Appendix D. We discuss additional experimental details in Appendix E. Finally, we provide additional experimental results in Appendix F.

### **Appendix A. Glossary**

The glossary is given in Table A1 below.



### **Appendix B. Definitions**

We restate definitions used in our proofs.

**Definition A1** (Regular Simplex)**.** *The points* {*vi*}*Ki*=<sup>1</sup> *form a regular simplex inscribed in the hypersphere if*

*1.* ∑*Ki*=<sup>1</sup> *vi* = 0 *2. vi* = 1 *for all i 3.* ∃*cK*≤ 1 *s.t. v i vj*= *cKfori* = *j*

**Definition A2** (Downstream model)**.** *Once an encoder f*(*x*) *is learned, the downstream model consists of a linear classifier trained using the cross-entropy loss:*

$$\mathcal{L}(\mathcal{W}, \mathcal{D}) = \sum\_{\mathbf{x}\_{i} \in \mathcal{D}} -\log \frac{\exp \left( f(\mathbf{x}\_{i})^{\top} \mathbf{W}\_{h(\mathbf{x}\_{i})} \right)}{\sum\_{j=1}^{K} \exp \left( f(\mathbf{x}\_{i})^{\top} \mathbf{W}\_{j} \right)}. \tag{A1}$$

*Define W* ˆ := *argminW*<sup>2</sup>≤1 L <sup>ˆ</sup>(*<sup>W</sup>*, <sup>D</sup>)*. Then, the end model's outputs are the probabilities*

$$\hat{\rho}(y|\mathbf{x}) = \hat{\rho}(y|f(\mathbf{x})) = \frac{\exp(f(\mathbf{x})^\top \hat{W}\_\mathbf{y})}{\sum\_{j=1}^K \exp(f(\mathbf{x})^\top \hat{W}\_j)},\tag{A2}$$

*and the generalization error is*

$$\mathcal{L}(\mathbf{x}, y, f) = \mathbb{E}\_{\mathbf{x}, \boldsymbol{y}}[-\log \beta(\boldsymbol{y}|f(\mathbf{x}))].\tag{A3}$$

### **Appendix C. Additional Theoretical Results**

*Appendix C.1. Transfer Learning on* (*x*, *y*)

We now show an additional transfer learning result on new tasks (*x*, *y*). Formally, recall that we learn the encoder *f* on (*<sup>x</sup>*, *y*) ∼ P. We wish to use it on a new task with target distribution (*x*, *y*) ∼ P. We find that an injective encoder *f*(*x*) is more appropriate to be used on new distributions than collapsed embeddings based on the Infomax principle [7].

**Observation 3.** *Define fc*(*y*) *as the mapping to collapsed embeddings and f*<sup>1</sup>−<sup>1</sup>(*x*) *as an injective mapping, both learned on* P*. Construct a new variable y with joint distribution* (*x*, *y*) ∼ *p*(*y*|*x*) · *p*(*x*) *and suppose that y* ⊥⊥ *y*|*x. Then, by the data processing inequality, it holds that <sup>I</sup>*(*y*, *y*) ≤ *<sup>I</sup>*(*x*, *y*) *where <sup>I</sup>*(·, ·) *is the mutual information between two random variables. We apply fc to y and f*<sup>1</sup>−<sup>1</sup> *to x to get that*

$$I(f\_c(\vec{y}), y') \le I(f\_{1-1}(x'), y').$$

*Therefore, f*<sup>1</sup>−<sup>1</sup> *obeys the Infomax principle [7] better on* P *than fc. Via Fano's inequality, this statement implies that the Bayes risk for learning y from x is lower using f*<sup>1</sup>−<sup>1</sup> *than fc.*

*Appendix C.2. Probabilities of Strata z*, *z Appearing in Subsampled Dataset*

As discussed in Section 4.2, the distance between strata *z* and *z* in embedding space depends on if these strata appear in the subsampled dataset D*t* that the encoder was trained on. We define the exact probabilities of the three cases presented. Let Pr(*<sup>z</sup>*, *z* ∈ <sup>D</sup>*t*) be the probability that both strata are seen, Pr(*z* ∈ D*<sup>t</sup>*, *z* ∈ D / *t*) be the probability that only *z* is seen, and Pr(*<sup>z</sup>*, *z* ∈ D/ *t* be the probability that neither are seen.

First, the probability of neither strata appearing in D*t* is easy to compute. In particular, we have that Pr(*<sup>z</sup>*, *z* ∈ D / *t*)=(<sup>1</sup> − *p*(*z*) − *<sup>p</sup>*(*z*))*tN*. This quantity decreases in *p*(*z*) and *p*(*z*), confirming that it is less likely for two common strata to not appear in D*<sup>t</sup>*.

Second, the probability of *z* being in D*t* and *z* not being in D*t* can be expressed as Pr(*z* ∈ <sup>D</sup>*t*|*z* ∈ D / *t*) · Pr(*z* ∈ D / *t*). Pr(*z* ∈ D / *t*) is equal to (1 − *<sup>p</sup>*(*z*))*tN*, and Pr(*z* ∈ <sup>D</sup>*t*|*z* ∈ D / *t*) = 1 − Pr(*z* ∈ D / *t*|*z* ∈ D / *t*) = 1 − (1 − *p*(*z*|*z* ∈ Z\*z*))*tN*. Finally, note that *p*(*z*|*z* ∈ Z\*z*) = *p*(*z*) <sup>1</sup>−*p*(*z*). Putting this together, we ge<sup>t</sup> that Pr(*z* ∈ D*<sup>t</sup>*, *z* ∈ D / *t*) = (1 − *p*(*z*))*tN* − (1 − *p*(*z*) − *<sup>p</sup>*(*z*))*tN*, and we can similarly construct Pr(*z* ∈ D*<sup>t</sup>*, *z* ∈ D / *t*). This quantity depends on the difference between *p*(*z*) and *p*(*z*), so this case is common when one stratum is common and one is rare.

Lastly, the probability of both *z* and *z* being in D*t* is thus Pr(*<sup>z</sup>*, *z* ∈ <sup>D</sup>*t*) = 1 − Pr(*<sup>z</sup>*, *z* ∈/ <sup>D</sup>*t*) − Pr(*z* ∈ D*<sup>t</sup>*, *z* ∈ D / *t*) − Pr(*z* ∈ D*<sup>t</sup>*, *z* ∈ D / *t*) = 1 + (1 − *p*(*z*) − *p*(*z*))*tN* − (1 − *p*(*z*))*tN* − (1 − *<sup>p</sup>*(*z*))*tN*. This quantity increases in *p*(*z*) and *p*(*z*).

*Appendix C.3. Performance of Collapsed Embeddings on Coarse-to-Fine Transfer and Original Task* **Lemma A1.** *Denote fc to be the encoder that collapses embeddings such that fc*(*x*) = *vy for any* (*<sup>x</sup>*, *y*) ∼ P*. Then, the generalization error on the coarse-to-fine transfer task using fc and a linear classifier learned using cross entropy loss is at least*

$$\mathcal{L}(\mathbf{x}, z, f\_{\mathcal{L}}) \ge \log(m \exp(1) + (\mathbb{C} - m) \exp(c\_{\mathcal{K}}) - 1,$$

*where cK is the dot product of any two different class-collapsed embeddings. The generalization error on the original task under the same setup is at least*

$$\mathcal{L}(x, y, f\_c) \ge \log(\exp(1) + (K - 1)\exp(c\_K)) - 1.$$

**Proof.** We first bound generalization error on the coarse-to-fine transfer task. For collapsed embeddings, *f*(*x*) = *vi* when *h*(*x*) = *i*, where *h*(*x*) is information available at training time that follows the distribution *p*(*y*|*x*). We thus denote the embedding *f*(*x*) as *vh*(*x*). Therefore, we write the generalization error with an expectation over *h*(*x*) and factorize the expectation according to our generative model.

$$\begin{split} \mathbb{E}\_{\mathbf{z},zh(\mathbf{x})} \left[ -\log \hat{p}(z|f(\mathbf{x})) \right] &= -\sum\_{z=1}^{\mathbb{C}} \sum\_{h(\mathbf{x})=1}^{K} \int p(\mathbf{x},z,h(\mathbf{x})) \log \hat{p}(z|h(\mathbf{x})) d\mathbf{x} \\ &= -\sum\_{z=1}^{\mathbb{C}} \sum\_{h(\mathbf{x})=1}^{K} \int p(z)p(\mathbf{x}|z)p(h(\mathbf{x})|\mathbf{x}) \log \hat{p}(z|h(\mathbf{x})) d\mathbf{x} \\ &= -\sum\_{z=1}^{\mathbb{C}} \sum\_{h(\mathbf{x})=1}^{K} \int p(z)p(\mathbf{x}|z)p(h(\mathbf{x})|\mathbf{x}) \log \frac{\exp(f\_{h(\mathbf{x})}^{\top}W\_{z})}{\sum\_{i=1}^{\mathbb{C}} \exp(f\_{h(\mathbf{x})}^{\top}W\_{i})} d\mathbf{x} \\ &= \sum\_{z=1}^{\mathbb{C}} p(z) \mathbb{E}\_{\mathbf{x}\sim\mathcal{P}\_{z}} \left[ \sum\_{y=1}^{K} p(y|\mathbf{x}) \left( -v\_{y}^{\top}W\_{\overline{z}} + \log \sum\_{i=1}^{\mathbb{C}} \exp(v\_{y}^{\top}W\_{i}) \right) \right]. \end{split}$$

Furthermore, since the *W* learned over collapsed embeddings satisfies *Wz* = *vy* for *<sup>S</sup>*(*z*) = *y*, we have that log ∑*Ci*=<sup>1</sup> exp(*<sup>v</sup> y Wi*) = *m* exp(1)+(*<sup>C</sup>* − *m*) exp(*cK*) for any *y*, and our expected generalization error is

$$\sum\_{z=1}^{\mathbb{C}} p(z) \mathbb{E}\_{\mathbf{x} \sim \mathcal{P}\_{z}}[-p(y = S(z)|\mathbf{x}) - p(y \neq S(z)|\mathbf{x})\delta + \log(m \exp(1) + (\mathbb{C} - m)\exp(\mathcal{L}\_{K}))],$$

$$\mathbb{E}\_{\mathbf{x}} = \log(m \exp(1) + (\mathbb{C} - m)\exp(c\kappa)) - c\kappa - (1 - c\kappa)\sum\_{z=1}^{\mathbb{C}} p(z) \mathbb{E}\_{\mathbf{x} \sim \mathcal{P}\_{z}}[p(y = S(z)|\mathbf{x})].$$

This tells us that the generalization error is at most log(*m* exp(1)+(*<sup>C</sup>* − *m*) exp(*cK*)) − *cK* and at least log(*m* exp(1)+(*<sup>C</sup>* − *m*) exp(*cK*)) − 1.

For the original task, we can apply this same approach to the case where *m* = 1, *C* = *K* to ge<sup>t</sup> that the average generalization error is

$$\begin{aligned} \mathbb{E}\_{h(\mathbf{x})} \left[ \mathcal{L}(\mathbf{x}, \mathbf{y}, f\_1) \right] &= \log(\exp(1) + (K - 1) \exp(c\_K)) \\ &- c\_K - (1 - c\_K) \sum\_{z = 1}^{\mathcal{C}} p(z) \mathbb{E}\_{\mathbf{x} \sim \mathcal{P}\_z} [p(\mathbf{y} = S(z) | \mathbf{x})]. \end{aligned}$$

This is at least log(exp(1)+(*<sup>K</sup>* − 1) exp(*cK*)) − 1 and at most log(exp(1)+(*<sup>K</sup>* − 1) exp(*cK*)) − *cK*.

### **Appendix D. Proofs**

*Appendix D.1. Proofs for Theoretical Motivation*

We provide proofs for Section 3.1. First, we characterize the optimal linear classifier (for both the coarse-to-fine transfer task and the original task) learned on the collapsed embeddings. Note that this result appears similar to Corollary 1 of [2], but their result minimizes the cross entropy loss over both the encoder and downstream weights (i.e., in a classical supervised setting where only cross entropy is used in training).

**Lemma A2** (Downstream linear classifier for coarse-to-fine task)**.** *Suppose the dataset* D*z is class-balanced across z, and the embeddings satisfy f*(*x*) = *vi if h*(*x*) = *i where* {*vi*}*Ki*=<sup>1</sup> *form the regular simplex. Then the optimal weight matrix W* ∈ R*C*×*<sup>d</sup> that minimizes* <sup>L</sup><sup>ˆ</sup>(*<sup>W</sup>*, <sup>D</sup>*z*) *satisfies Wz* = *vy for y* = *<sup>S</sup>*(*z*)*.*

**Proof.** Formally, the convex optimization problem we are solving is

$$\text{minimize } -\sum\_{y=1}^{K} \sum\_{z \in S\_{\mathcal{Y}}} \log \frac{\exp(\boldsymbol{v}\_{\boldsymbol{y}}^{\top} \boldsymbol{W}\_{\boldsymbol{z}})}{\sum\_{j=1}^{C} \exp(\boldsymbol{v}\_{\boldsymbol{y}}^{\top} \boldsymbol{W}\_{j})} \tag{A4}$$

$$\text{s.t. } \|\mathcal{W}\_z\|\_2^2 \le 1 \; \forall z \in \mathcal{Z} \tag{A5}$$

The Lagrangian of this optimization problem is

$$\sum\_{y=1}^{K} \sum\_{z \in S\_{\mathcal{Y}}} -\upsilon\_{y}^{\top} \mathcal{W}\_{z} + m \sum\_{y=1}^{K} \log \left( \sum\_{j=1}^{\mathcal{C}} \exp(\upsilon\_{y}^{\top} \mathcal{W}\_{j}) \right) + \sum\_{i=1}^{\mathcal{C}} \lambda\_{i} (||\mathcal{W}\_{i}||\_{2}^{2} - 1)\_{i}$$

and the stationarity condition w.r.t. *Wz* is

$$1 - v\_{S(z)} + m \sum\_{y=1}^{K} \frac{v\_y \exp(v\_y^\top W\_z)}{\sum\_{j=1}^{C} \exp(v\_y^\top W\_j)} + 2\lambda\_z W\_z = 0. \tag{A6}$$

Substituting *Wz* = *vS*(*z*), we ge<sup>t</sup> <sup>−</sup>*vS*(*z*) + *m* <sup>∑</sup>*Ky*=<sup>1</sup> *vy* exp(*<sup>v</sup> y vS*(*z*)) <sup>∑</sup>*Cj*=<sup>1</sup> exp(*<sup>v</sup> y vS*(*j*)) + <sup>2</sup>*λzvS*(*z*) = 0. Using

the fact that *v i vj* = *δ* for all *i* = *j*, this equals <sup>−</sup>*vS*(*z*) + *m* · *vS*(*z*) exp(1)+exp(*δ*) ∑*<sup>y</sup>*=*<sup>S</sup>*(*z*) *vy m* exp(1)+(*<sup>C</sup>*−*<sup>m</sup>*) exp(*δ*) + <sup>2</sup>*λzvS*(*z*) = 0. Next, recall that ∑*Ki*=<sup>1</sup> *vi* = 0. Then, *λz* = 12 1 − *m* · exp(1)−exp(*δ*) *m* exp(1)+(*<sup>C</sup>*−*<sup>m</sup>*) exp(*δ*) ≥ 0, satisfying the dual constraint. We can further verify complementary slackness and primal feasibility, since *Wz* 22 = 1, to confirm that an optimal weight matrix satisfies *Wz* = *vy* for *y* = *<sup>S</sup>*(*z*).

**Corollary A1.** *When we apply the above proof to the case when m* = 1*, we recover that the optimal weight matrix W* ∈ R*K*×*<sup>d</sup> that minimizes* <sup>L</sup><sup>ˆ</sup>(*<sup>W</sup>*, D) *for the original task on* (*<sup>x</sup>*, *y*) ∼ P *satisfies <sup>W</sup>y* = *vy for all y* ∈ Y*.*

We now prove Observation 1 and 2. Then, we present an additional result on transfer learning on collapsed embeddings to general tasks of the form (*x*, *y*) ∼ P.

**Proof of Observation 1.** We write out the generalization error for the downstream task, L(*<sup>x</sup>*, *z*, *f*) = <sup>E</sup>*<sup>x</sup>*,*<sup>z</sup>*[<sup>−</sup> log *p*<sup>ˆ</sup>(*z*|*x*)] using our conditions that *p*(*y* = *h*(*x*)|*x*) = 1 and *p*(*z*|*x*) = 1*m* .

$$\begin{split} \mathcal{L}(\mathbf{x}, z, f) &= -\int p(\mathbf{x}) \sum\_{z=1}^{\mathbb{C}} p(z|\mathbf{x}) \log \hat{p}(z|f(\mathbf{x})) d\mathbf{x} \\ &= -\int p(\mathbf{x}) \sum\_{z=1}^{\mathbb{C}} p(z|\mathbf{x}) \log \frac{\exp(f(\mathbf{x})^\top W\_z)}{\sum\_{i=1}^{\mathbb{C}} \exp(f(\mathbf{x})^\top W\_i)} d\mathbf{x} \\ &= -\sum\_{y=1}^{K} \int\_{\mathbf{x}: h(\mathbf{x}) = y} p(\mathbf{x}) \cdot \frac{1}{m} \sum\_{z \in S\_y} \log \frac{\exp(f(\mathbf{x})^\top W\_z)}{\sum\_{i=1}^{\mathbb{C}} \exp(f(\mathbf{x})^\top W\_i)}. \end{split}$$

To minimize this, *f*(*x*) should be the same across all *x* where *h*(*x*) is the same value, since *p*(*z*|*x*) does not change across fixed *h*(*x*) and thus varying *f*(*x*) will not further decrease the value of this expression. Therefore, we rewrite *f*(*x*) as *fh*(*x*). Using the fact that *y* is class balanced, our loss is now

$$\begin{split} \mathcal{L}(x, y, z) &= -\frac{1}{m} \sum\_{y=1}^{K} \sum\_{z \in S\_{\mathcal{Y}}} \int\_{x: h(x) = y} p(x) \log \frac{\exp(f\_{h(x)}^{\top} W\_{2})}{\sum\_{i=1}^{\mathbb{C}} \exp(f\_{h(x)}^{\top} W\_{2})} dx \\ &= -\frac{1}{\mathbb{C}} \sum\_{y=1}^{K} \sum\_{z \in S\_{\mathcal{Y}}} \log \frac{\exp(f\_{y}^{\top} W\_{2})}{\sum\_{i=1}^{\mathbb{C}} \exp(f\_{y}^{\top} W\_{i})} .\end{split}$$

We claim that *fy* = *vy* and *Wz* = *vy* for all *<sup>S</sup>*(*z*) = *y* minimizes this convex function. The corresponding Lagrangian is

$$\sum\_{y=1}^{K} \sum\_{z \in S\_{\mathcal{Y}}} -f\_{y}^{\top} \mathsf{W}\_{z} + m \sum\_{y=1}^{K} \log \left( \sum\_{i=1}^{\mathbb{C}} \exp(f\_{y}^{\top} \mathsf{W}\_{i}) \right) + \sum\_{y=1}^{K} \nu\_{\mathcal{Y}} (\|f\_{y}\|\_{2}^{2} - 1) + \sum\_{i=1}^{\mathbb{C}} \lambda\_{i} (\|\mathsf{W}\_{i}\|\_{2}^{2} - 1).$$

The stationarity condition with respect to *Wz* is the same as (A6), and we have already demonstrated that the feasibility constraints and complementary slackness are satisfied on *W*. The stationarity condition with respect to *fy* is

$$-\sum\_{z \in S\_{\mathcal{Y}}} \mathcal{W}\_{\boldsymbol{z}} + m \cdot \frac{\sum\_{i=1}^{\mathbb{C}} \mathcal{W}\_{i} \exp\left(f\_{\boldsymbol{\mathcal{Y}}}^{\top} \mathcal{W}\_{i}\right)}{\sum\_{i=1}^{\mathbb{C}} \exp(f\_{\boldsymbol{\mathcal{Y}}}^{\top} \mathcal{W}\_{i})} + 2\lambda\_{\boldsymbol{\mathcal{Y}}} f\_{\boldsymbol{\mathcal{Y}}} = 0.$$

Substituting in *Wi* = *vS*(*i*) and *fy* = *vy*, we ge<sup>t</sup> − ∑*z*<sup>∈</sup>*Sy vy* + *m* · <sup>∑</sup>*Ci*=<sup>1</sup> *vS*(*i*) exp(*<sup>v</sup> y vS*(*i*)) <sup>∑</sup>*Ci*=<sup>1</sup> exp(*<sup>v</sup> y vS*(*i*)) + = 0.Fromtheregularsimplexdefinition,thisis−*mvy*+ *m mvy* exp(1)−*mvy* exp(*δ*) +

<sup>2</sup>*λyvy m* exp(1)+(*<sup>C</sup>*−*<sup>m</sup>*) exp(*δ*) <sup>2</sup>*λyvy* = 0. We thus have that *λy* = *m*2 1 − *<sup>m</sup>*(exp(1)−exp(*δ*)) *m* exp(1)+(*<sup>C</sup>*−*<sup>m</sup>*) exp(*δ*) , and the feasibility constraints are satisfied. Therefore, *fy* = *Wz* = *vy* for *y* = *<sup>S</sup>*(*z*) minimizes the generalization error L(*<sup>x</sup>*, *z*, *f*) when *p*(*h*(*x*)|*x*) = 1 and *p*(*z*|*x*) = 1*m* .

*p*(*z*|*x*) = 1*m* and *p*(*y* = *h*(*x*)|*x*) = 1, so *p*(*z*)= *x*:*h*(*x*)=*S*(*z*) *p*(*<sup>z</sup>*, *<sup>x</sup>*)*dx*= 1*m x*:*h*(*x*)=*S*(*z*) *p*(*x*) = 1 *mK* = 1 *C* . *p*(*z*) being class balanced means that *p*(*x*|*z*) = *p*(*z*|*x*)*p*(*x*) *p*(*z*) = *Kp*(*x*) = *p*(*y*|*x*)*p*(*x*) *p*(*y*) = *p*(*x*|*y*). Therefore, this condition suggests that there is no distinction among the strata within a class.

**Proof of Observation 2.** This observation follows directly from Observation 1 by repeating the proof approach with *z* = *y*, *m* = 1.

Lastly, suppose it is not true that *p*(*y* = *h*(*x*)|*x*) = 1. Then, the generalization error on the original task is L(*<sup>x</sup>*, *y*, *f*) = − X <sup>∑</sup>*Ky*=<sup>1</sup> *p*(*x*)*p*(*y*|*x*)log *p*<sup>ˆ</sup>(*y*| *f*(*x*)), which is mini-

mized when *p*<sup>ˆ</sup>(*y*| *f*(*x*)) = *p*(*y*|*x*). Intuitively, a model constructed with label information, *p* <sup>ˆ</sup>(*y*|*h*(*x*)), will not improve over one that uses *x* itself to approximate *p*(*y*|*x*).

### *Appendix D.2. Proofs for Theoretical Implications*

We provide proofs for Section 4.3.

**Proof of Lemma 1.** The generalization error is

$$\begin{split} \mathcal{L}(\boldsymbol{x}, \boldsymbol{z}, \boldsymbol{f}\_{1}) &= -\mathbb{E}\_{\boldsymbol{z}} \Big[ \mathbb{E}\_{\boldsymbol{x} \sim \mathcal{P}\_{\boldsymbol{z}}} \Big[ \log \frac{\exp(\boldsymbol{f}\_{1}(\boldsymbol{x})^{\top} \boldsymbol{W}\_{\boldsymbol{z}})}{\sum\_{i=1}^{\mathbb{C}} \exp(\hat{f}\_{1}(\boldsymbol{x})^{\top} \boldsymbol{W}\_{i})} \Big] \Big] \\ &= \mathbb{E}\_{\boldsymbol{z}} \Big[ \mathbb{E}\_{\boldsymbol{x} \sim \mathcal{P}\_{\boldsymbol{z}}} \Big[ -\hat{f}\_{1}(\boldsymbol{x})^{\top} \boldsymbol{W}\_{\boldsymbol{z}} + \log \sum\_{i=1}^{\mathbb{C}} \exp(\hat{f}\_{1}(\boldsymbol{x})^{\top} \boldsymbol{W}\_{i}) \Big] \Big] . \end{split}$$

Using the definition of the mean classifier,

$$\begin{split} \mathcal{L}(\boldsymbol{x}, \boldsymbol{z}, \boldsymbol{\hat{f}}\_{1}) &= \mathbb{E}\_{\boldsymbol{z}} \Big[ -1 + \mathbb{E}\_{\boldsymbol{x} \sim \mathcal{P}\_{\boldsymbol{z}}} \Big[ \log \sum\_{i=1}^{\mathsf{C}} \exp \left( \boldsymbol{\hat{f}}\_{1}(\boldsymbol{x})^{\top} \mathbb{E}\_{\boldsymbol{x} \sim \mathcal{P}\_{i}} [\boldsymbol{\hat{f}}\_{1}(\boldsymbol{x})] \right) \Big] \Big] \\ &= -1 + \mathbb{E}\_{\boldsymbol{z}} \Big[ \mathbb{E}\_{\boldsymbol{x} \sim \mathcal{P}\_{\boldsymbol{z}}} \Big[ \log \sum\_{i=1}^{\mathsf{C}} \exp \left( \boldsymbol{\hat{f}}\_{1}(\boldsymbol{x})^{\top} \mathbb{E}\_{\boldsymbol{i}} [\boldsymbol{\hat{f}}\_{1}(\boldsymbol{x})] \right) \Big] \Big]. \end{split}$$

Since ˆ *f*1(*x*) is bounded, there exists a constant *λ* > 0 such that

$$\mathbb{E}\_{\mathbf{x}\sim\mathcal{P}\_{\mathbf{z}}}\left[\log\sum\_{i=1}^{\mathcal{C}}\exp\left(\boldsymbol{\hat{f}}\_{1}(\mathbf{x})^{\top}\mathbb{E}\_{i}[\boldsymbol{\hat{f}}\_{1}(\mathbf{x})]\right)\right] \leq \log\left(\sum\_{i=1}^{\mathcal{C}}\exp\left(\lambda\mathbb{E}\_{\mathbf{z}}[\boldsymbol{\hat{f}}\_{1}(\mathbf{x})]^{\top}\mathbb{E}\_{i}[\boldsymbol{\hat{f}}\_{1}(\mathbf{x})]\right)\right).$$

We can also rewrite the dot product between mean embeddings per strata in terms of the distance between them:

$$\begin{split} \mathcal{L}(\mathbf{x}, z, \hat{f}\_1) &\leq -1 + \mathbb{E}\_{\mathbf{z}} \Bigg[ \log \left( \sum\_{i=1}^{\mathbb{C}} \exp \left( \lambda \mathbb{E}\_{\mathbf{z}} [\hat{f}\_1(\mathbf{x})]^\top \mathbb{E}\_i [\hat{f}\_1(\mathbf{x})] \right) \right) \\ &= -1 + \mathbb{E}\_{\mathbf{z}} \Bigg[ \log \left( \sum\_{i=1}^{\mathbb{C}} \exp \left( -\frac{\lambda}{2} ||\mathbb{E}\_{\mathbf{z}}[\hat{f}\_1(\mathbf{x})] - \mathbb{E}\_i[\hat{f}\_1(\mathbf{x})]||^2 + \lambda \right) \right) \Bigg]. \end{split}$$

This directly gives us our desired bound.

**Proof of Lemma 2.** The generalization error is

$$\begin{split} \mathcal{L}(\boldsymbol{x}, \boldsymbol{y}, \boldsymbol{\hat{f}}\_{1}) &= -\mathbb{E}\_{\boldsymbol{z}} \Big[ \mathbb{E}\_{\boldsymbol{x} \sim \mathcal{P}\_{\boldsymbol{z}}} \Big[ \log \frac{\exp \left( \boldsymbol{\hat{f}}\_{1} (\boldsymbol{x})^{\top} \boldsymbol{W}\_{\boldsymbol{S}(\boldsymbol{z})} \right)}{\sum\_{i=1}^{K} \exp \left( \boldsymbol{\hat{f}}\_{1} (\boldsymbol{x})^{\top} \boldsymbol{W}\_{\boldsymbol{i}} \right)} \Big] \Big] \\ &= \mathbb{E}\_{\boldsymbol{z}} \Big[ \mathbb{E}\_{\boldsymbol{x} \sim \mathcal{P}\_{\boldsymbol{z}}} \Big[ -f\_{1}(\boldsymbol{x})^{\top} \boldsymbol{W}\_{\boldsymbol{S}(\boldsymbol{z})} + \log \sum\_{i=1}^{K} \exp \left( f\_{1}(\boldsymbol{x})^{\top} \boldsymbol{W}\_{\boldsymbol{i}} \right) \Big] \Big] . \end{split}$$

We substitute in the definition of the mean classifier to ge<sup>t</sup>

$$\begin{split} \mathcal{L}(\mathbf{x}, y, f\_1) &= \mathbb{E}\_{\mathbf{z}} \Big[ - \sum\_{z' \in S\_{S(z)}} p(z'|S(z)) \mathbb{E}\_{\mathbf{z}} [f\_1(\mathbf{x})]^\top \mathbb{E}\_{z'} [f\_1(\mathbf{x})] \\ &+ \mathbb{E}\_{\mathbf{x} \sim \mathcal{P}\_{\mathbf{z}}} \Big[ \log \sum\_{i=1}^K \exp \left( \sum\_{z' \in S\_i} p(z'|S\_i) \hat{f}\_1(\mathbf{x})^\top \mathbb{E}\_{z'} [\hat{f}\_1(\mathbf{x})] \right) \Big] \Big]. \end{split}$$

We can rewrite the dot product between mean embeddings per strata in terms of the distance between them:

$$\begin{split} \mathcal{L}(x, y, \hat{f}\_1) &= \mathbb{E}\_z \left[ \sum\_{z' \in S\_{S(z)}} p(z'|S(z)) \cdot \left( \frac{1}{2} ||\mathbb{E}\_{\overline{z}}[\hat{f}\_1(\mathbf{x})] - \mathbb{E}\_{z'}[\hat{f}\_1(\mathbf{x})]||^2 - 1 \right) \\ &+ \mathbb{E}\_{\mathbf{x} \sim \mathcal{P}\_z} \left[ \log \sum\_{i=1}^K \exp \left( \sum\_{z' \in S\_i} p(z'|S\_i) \hat{f}\_1(\mathbf{x})^\top \mathbb{E}\_{\overline{z}'}[\hat{f}\_1(\mathbf{x})] \right) \right] \right]. \end{split}$$

We can write <sup>E</sup>*z*[ ˆ*f*1(*x*)] − <sup>E</sup>*z*[ ˆ*f*1(*x*)] in the above expression as *δ*( ˆ*f*1, *z*, *<sup>z</sup>*), which we have analyzed:

$$\begin{split} \mathcal{L}(\mathbf{x}, y, f\_1) &= \mathbb{E}\_z \Big[\sum\_{z' \in \mathcal{S}\_{\mathcal{S}(z)}} p(z'|S(z)) \cdot \left(\frac{1}{2}\delta(f\_1, z, z')^2 - 1\right) \\ &+ \mathbb{E}\_{\mathbf{x} \sim \mathcal{P}\_z} \Big[\log \sum\_{i=1}^K \exp\left(\sum\_{z' \in \mathcal{S}\_i} p(z'|S\_i)\hat{f}\_1(\mathbf{x})^\top \mathbb{E}\_{z'}[\hat{f}\_1(\mathbf{x})] \right) \Big] . \end{split}$$

From our previous proof, there exists *λ* > 0 such that this is at most

$$\begin{split} \mathcal{L}(x, y, \hat{f}\_1) &\leq \mathbb{E}\_{\boldsymbol{z}} \Big[\sum\_{\boldsymbol{z}' \in S\_{S(\boldsymbol{z})}} p(\boldsymbol{z}' | S(\boldsymbol{z})) \cdot \Big(\frac{1}{2} \delta(\hat{f}\_1, z, \boldsymbol{z}')^2 - 1\Big) \\ &\quad + \log \left(\sum\_{i=1}^K \exp\left(\sum\_{\boldsymbol{z}' \in S\_i} p(\boldsymbol{z}' | S\_i) \lambda \mathbb{E}\_{\boldsymbol{z}} [\hat{f}\_1(\boldsymbol{x})]^\top \mathbb{E}\_{\boldsymbol{z}'} [\hat{f}\_1(\boldsymbol{x})]\right)\right) \Big] \\ &= \mathbb{E}\_{\boldsymbol{z}} \Big[\sum\_{\boldsymbol{z}' \in S\_{S(\boldsymbol{z})}} p(\boldsymbol{z}' | S(\boldsymbol{z})) \cdot \Big(\frac{1}{2} \delta(\hat{f}\_1, z, \boldsymbol{z}')^2 - 1\Big) \\ &\quad + \log \left(\sum\_{i=1}^K \exp\left(\sum\_{\boldsymbol{z}' \in S\_i} p(\boldsymbol{z}' | S\_i) \left(-\frac{\lambda}{2} \|\mathbb{E}\_{\boldsymbol{z}}[\hat{f}\_1(\boldsymbol{x})] - \mathbb{E}\_{\boldsymbol{z}'}[\hat{f}\_1(\boldsymbol{x})] \|^2 + \lambda\right)\right)\right) \Big]. \end{split}$$

We can write each weighted summation over *p*(*z*|*S*(*z*)) and *p*(*z*|*Si*) as an expectation and use the definition of *δ*( ˆ*f*1, *z*, *z*) to obtain our desired bound.

### **Appendix E. Additional Experimental Details**

*Appendix E.1. Datasets*

We first describe all the datasets in more detail:


land birds with land backgrounds, but 5% of the water birds are on land backgrounds, and 5% of the land birds are on water backgrounds. These form the (imbalanced) hidden strata.


### *Appendix E.2. Hyperparameters*

For all model quality experiments for *Lspread*, we first fixed *τ* = 0.5 and swept *α* ∈ [0.16, 0.25, 0.33, 0.5, 0.67]. We then took the two best-performing values and swept *τ* ∈ [0.1, 0.3, 0.5, 0.7, 0.9]. For *LSC* and *LSS*, we swept *τ* ∈ [0.1, 0.3, 0.5, 0.7, 0.9]. Final hyperparameter values for (*<sup>τ</sup>*, *α*) for *Lspread* were (0.9, 0.67) for CIFAR10, (0.5, 0.16) for CIFAR10-coarse, (0.5, 0.33) for CIFAR100, (0.5, 0.25) for CIFAR100-Coarse, (0.5, 0.25) for CIFAR100-Coarse-U, (0.5, 0.5) for MNIST, (0.5, 0.5) for MNIST-coarse, (0.5, 0.5) for ISIC, and (0.5, 0.5) for waterbirds.

For coarse-to-fine transfer learning, we fixed *τ* = 0.5 for all losses and swept *α* ∈ [0.16, 0.25, 0.33, 0.5, 0.67]. Final hyperparameter values for *α* were 0.25 for CIFAR10-Coarse, 0.25 for CIFAR100-Coarse, 0.25 for CIFAR100-Coarse-U, and 0.5 for MNIST-Coarse.

### *Appendix E.3. Applications*

We describe additional experimental details for the applications.

### Appendix E.3.1. Robustness Against Worst-Group Performance

We follow the evaluation of [5]. First, we train a model on the standard class labels. We evaluate different loss functions for this step, including *Lspread*, *LSC*, and the cross entropy loss *LCE*. Then we project embeddings of the training set using a UMAP projection [45], and cluster points to discover unlabeled subgroups. Finally, we use the unlabeled subgroups in a Group-DRO algorithm to optimize worst-group robustness [14].

### Appendix E.3.2. Robustness Against Noise

We use the same training setup as we use to evaluate model quality, and introduce symmetric noise into the labels for the contrastive loss head. We train the cross entropy head with a fraction of the full training set. In Section 5.3, we report results from training with 20% labels to cross entropy. We report additional levels in Appendix F.

We detect noisy labels with a simple geometric heuristic: for each point, we compute the cosine similarity between the embedding of the point and the center of all the other points in the batch that have the same class. We compare this similarity value to the average cosine similarity with points in the batch from every other class, and rank the points by the difference between these two values. Points with incorrect labels have a small difference between these two values (they appear to be small strata, so they are far away from points of the same class). Given the noise level  as an input, we rank the points by this heuristic and mark the  fraction of the batch with the smallest scores as noisy. We then correct their labels by adopting the label of the closest cluster center.

### Appendix E.3.3. Minimal Coreset Construction

We use the publicly-available evaluation framework for coresets from [18] (https:// github.com/mtoneva/example\_forgetting, accessed on 1 October 2021). We use the official repository from [19] (https://github.com/mansheej/data\_diet, accessed on 1 October 2021) to recreate their coreset algorithms.

Our coreset algorithm proceeds in two parts. First, we give each point a difficulty rating based on how likely we are to classify it correctly under partial training. Then we subsample the easiest points to construct minimal coresets.

First, we mirror the set up from our thought experiment and train with *Lspread* on random samples of *t*% of the CIFAR10 training set, taking three random samples for each of *t* ∈ [10, 20, 50] (and we train the cross entropy head with 1% labeled data). For each run, we record which points are classified correctly by the cross entropy head at the end of training, and bucket points the training set by how often the point was correctly classified. To construct a coreset of size *t*%, we iteratively remove points from the largest bucket in each class. Our strategy removes easy examples first from the largest coresets, but maintains a set of easy examples in the smallest coresets.

### **Appendix F. Additional Experimental Results**

In this section, we report three sets of additional experimental results: the performance of using *Lattract* on its own to train models, sample complexity of *Lspread* compared to *LSC*, and additional noisy label results (including a bonus de-noising algorithm).

### *Appendix F.1. Performance of Lattract*

In an early iteration of this project, we experienced success with using *Lattract* on its own to train models, before realizing the benefits of adding in an additional term to prevent class collapse. As an ablation, we report on the performance of using *Lattract* on its own in Table A2. *Lattract* can outperform *LSC*, but *Lspread* outperforms both. We do not report the results here, but *Lattract* also performs significantly worse than *LSC* on downstream applications, since it more direclty encourages class collapse.

**Table A2.** Performance of *Lspread* compared to *LSC* and using *Lattract* on its own. Best in bold.


### *Appendix F.2. Sample Complexity*

Figure A1 shows the performance of training ViT models with various amounts of labeled data for *Lspread*, *LSC*, and *LSS*. In these experiments, we train the cross entropy head with 1% labeled data to isolate the effect of training data on the contrastive losses themselves.

*Lspread* outperforms *LSC* and *LSS* throughout. At 10% labeled data, *Lspread* outperforms *LSS* by 13.9 points, and outperforms *LSC* by 0.5 points. By 100% labeled data (for the contrastive head), *Lspread* outperforms *LSS* by 25.4 points, and outperforms *LSC* by 10.3 points.

**Figure A1.** Performance of training ViT with *Lspread* compared to training with *LSC* and *LSS* on CIFAR10 at various amounts of labeled data. *Lspread* outperforms the baselines at each point. The cross entropy head here is trained with 1% labeled data to isolate the effect of training data on the contrastive losses.

### *Appendix F.3. Noisy Labels*

In Section 5.3, we reported results from training the contrastive loss head with noisy labels and the cross entropy loss with clean labels from 20% of the training data.

In this section, we first discuss a de-noising algorithm inspired by [23] that we initially developed to correct for noisy labels, but that we did not observe strong empirical results from. We hope that reporting this result inspires future work into improving contrastive learning.

We then report additional results with larger amounts of training data for the cross entropy head.

### Appendix F.3.1. Debiasing Noisy Contrastive Loss

First, we consider the triplet loss and show how to debias it in expectation under noise. Then we present an extension to supervised contrastive loss.

### Noise-Aware Triplet Loss

Consider the triplet loss:

$$L\_{\text{triplet}} = \mathbb{E}\_{\substack{\mathbf{x} \sim \mathcal{P}, \mathbf{x}^+ \sim p^+(\cdot | \mathbf{x}),\\ \mathbf{x}^- \sim p^-(\cdot | \mathbf{x})}} \left[ -\log \frac{\exp(\sigma(\mathbf{x}, \mathbf{x}^+))}{\exp(\sigma(\mathbf{x}, \mathbf{x}^+)) + \exp(\sigma(\mathbf{x}, \mathbf{x}^-))} \right]. \tag{A7}$$

 

 

 

 

Now suppose that we do not have access to true labels but instead have noisy labels denoted by the weak classifier *y* := *h*(*x*). We adopt a simple model of symmetric noise where *p* = Pr(noisy label is correct).

We use *y* to construct P + and P − as *p*(*x*<sup>+</sup>| *h*(*x*) = *<sup>h</sup>*(*x*+)) and *p*(*x*−| *h*(*x*) = *h*(*x*<sup>−</sup>)). For simplicity, we start by looking at how the triplet loss in (A7) is impacted when *noise is not addressed* in the binary setting. Define *Ltriplet noisy* as *Ltriplet* used with P<sup>+</sup> and P<sup>−</sup>.

**Lemma A3.** *When class-conditional noise is uncorrected, Lnoisy tripletis equivalent to*

$$\begin{split} (\hat{p}^{3} + (1 - \hat{p})^{3}) L\_{triplet} + \hat{p} (1 - \hat{p}) \mathbb{E}\_{\begin{subarray}{c} \mathbf{x}^{\sim \mathcal{P}} \\ \mathbf{x}^{+}\_{1}, \mathbf{x}^{+}\_{2} \sim p^{+}(\cdot | \mathbf{x}) \end{subarray}} \left[ -\log \frac{\exp(\sigma(\mathbf{x}, \mathbf{x}^{+}\_{1}))}{\exp(\sigma(\mathbf{x}, \mathbf{x}^{+}\_{1})) + \exp(\sigma(\mathbf{x}, \mathbf{x}^{+}\_{2}))} \right] \\ &+ \hat{p} (1 - \hat{p}) \mathbb{E}\_{\begin{subarray}{c} \mathbf{x}^{-}\_{1}, \mathbf{x}^{-}\_{2} \sim p^{-}(\cdot | \mathbf{x}) \end{subarray}} \left[ -\log \frac{\exp(\sigma(\mathbf{x}, \mathbf{x}^{-}\_{1}))}{\exp(\sigma(\mathbf{x}, \mathbf{x}^{-}\_{1})) + \exp(\sigma(\mathbf{x}, \mathbf{x}^{-}\_{2}))} \right] \\ &+ \hat{p} (1 - \hat{p}) \mathbb{E}\_{\begin{subarray}{c} \mathbf{x}^{-}\_{-} \sim p^{+}(\cdot | \mathbf{x}) \\ \mathbf{x}^{+} \sim p^{-}(\cdot | \mathbf{x}) \end{subarray}} \left[ -\log \frac{\exp(\sigma(\mathbf{x}, \mathbf{x}^{-}))}{\exp(\sigma(\mathbf{x}, \mathbf{x}^{+})) + \exp(\sigma(\mathbf{x}, \mathbf{x}^{-}))} \right]. \end{split}$$

**Proof.** We split *Lnoisy triplet* depending on if the noisy positive and negative pairs are truly positive and negative.

*Lnoisy triplet* = E *x*∼P *<sup>x</sup>* <sup>+</sup>∼*p*+(·|*x*) *x* − ∼*p* <sup>−</sup>(·|*x*) − log exp(*σ*(*<sup>x</sup>*, *x*+)) exp(*σ*(*<sup>x</sup>*, *x*+)) + exp(*σ*(*<sup>x</sup>*, *x*−)) = *p*(*h*(*x*) = *<sup>h</sup>*(*<sup>x</sup>*+), *h*(*x*) = *h*(*x* <sup>−</sup>))<sup>E</sup> *x*∼P *<sup>x</sup>*+∼*p*+(·|*x*) *x* − ∼*p* <sup>−</sup>(·|*x*) − log exp(*σ*(*<sup>x</sup>*, *x*+)) exp(*σ*(*<sup>x</sup>*, *x*+)) + exp(*σ*(*<sup>x</sup>*, *x*−)) + *p*(*h*(*x*) = *<sup>h</sup>*(*<sup>x</sup>*+), *h*(*x*) = *h*(*x* <sup>−</sup>))<sup>E</sup> *x*∼P *x*+1 ,*x*+<sup>2</sup> ∼*p*+(·|*x*)<sup>−</sup> log exp(*σ*(*<sup>x</sup>*, *x*+1 )) exp(*σ*(*<sup>x</sup>*, *x*+1 )) + exp(*σ*(*<sup>x</sup>*, *x*+2 )) + *p*(*h*(*x*) = *<sup>h</sup>*(*<sup>x</sup>*+), *h*(*x*) = *h*(*x* <sup>−</sup>))<sup>E</sup> *x*∼P *x* − 1 ,*x* − 2 ∼*p* <sup>−</sup>(·|*x*)<sup>−</sup> log exp(*σ*(*<sup>x</sup>*, *<sup>x</sup>*<sup>−</sup>1 )) exp(*σ*(*<sup>x</sup>*, *<sup>x</sup>*<sup>−</sup>1 )) + exp(*σ*(*<sup>x</sup>*, *<sup>x</sup>*<sup>−</sup>2 )) + *p*(*h*(*x*) = *<sup>h</sup>*(*<sup>x</sup>*+), *h*(*x*) = *h*(*x* <sup>−</sup>))<sup>E</sup> *x*∼P *<sup>x</sup>*+∼*p*+(·|*x*) *x* − ∼*p* <sup>−</sup>(·|*x*) − log exp(*σ*(*<sup>x</sup>*, *x*<sup>−</sup>)) exp(*σ*(*<sup>x</sup>*, *x*+)) + exp(*σ*(*<sup>x</sup>*, *x*−)) .

Define *p* = *p*(noisy label is correct). Note that

$$p(h(\mathbf{x}) = h(\hat{\mathbf{x}}^{+}), h(\mathbf{x}) \neq h(\hat{\mathbf{x}}^{-})) = \hat{p}^{\hat{3}} + (1 - \hat{p})^{\hat{3}},$$

(i.e., all three points are correct or all reversed, such that their relative pairings are correct). In addition, the other three probabilities above are all equal to *p*(<sup>1</sup> − *p*).

We now show that there exists a weighted loss function that in expectation equals *Ltriplet*.

**Lemma A4.** *Define*

 *Ltriplet* = E *<sup>x</sup>*∼P, *x* + 1 ,*x* + 2 ∼P +(·|*x*) *x* − 1 ,*x* − 2 ∼P <sup>−</sup>(·|*x*) − *<sup>w</sup>*+*σ*(*<sup>x</sup>*, *x*+1 ) + *<sup>w</sup>*<sup>−</sup>*σ*(*<sup>x</sup>*, *x*−1 ) + *w*1 log exp *σ*(*<sup>x</sup>*, *x*+1 ) + exp *σ*(*<sup>x</sup>*, *x*−1 ) − *w*2 log (exp(*σ*(*<sup>x</sup>*, *x*+1 )) + exp(*σ*(*<sup>x</sup>*, *x*+2 ))) · (exp(*σ*(*<sup>x</sup>*, *x*−1 )) + exp(*σ*(*<sup>x</sup>*, *<sup>x</sup>* <sup>−</sup>2 ))) ,

*where*

$$\begin{aligned} w^{+} &= \frac{\hat{p}^{2} + (1 - \hat{p})^{2}}{(2\check{p} - 1)^{2}} & w^{-} &= \frac{2\hat{p}(1 - \hat{p})}{(2\check{p} - 1)^{2}} & w\_{1} &= \frac{\hat{p}^{2} + (1 - \hat{p})^{2}}{(2\check{p} - 1)^{2}} & w\_{2} &= \frac{\hat{p}(1 - \hat{p})}{(2\check{p} - 1)^{2}}. \end{aligned} \quad \text{we} \quad \begin{aligned} w\_{1} &= \frac{\hat{p}^{2} + (1 - \hat{p})^{2}}{(2\check{p} - 1)^{2}}. \end{aligned}$$

**Proof.** We evaluate E-<sup>−</sup>*w*1*<sup>σ</sup>*(*<sup>x</sup>*, *x*+1 ) + *<sup>w</sup>*2*<sup>σ</sup>*(*<sup>x</sup>*, *x*−1 ) and the other terms separately. Using the same probabilities as computed in Lemma A3,

$$\begin{split} & \mathbb{E}\left[-w\_{1}\sigma(\mathbf{x},\hat{\mathbf{x}}\_{1}^{+})+w\_{2}\sigma(\mathbf{x},\hat{\mathbf{x}}\_{1}^{-})\right] = -(\hat{p}^{2}+(1-\hat{p})^{2})w\_{1}\mathbb{E}\left[\sigma(\mathbf{x},\mathbf{x}\_{1}^{+})\right] \\ & -2\hat{p}(1-\hat{p})w\_{1}\mathbb{E}\left[\sigma(\mathbf{x},\mathbf{x}\_{1}^{-})\right] + (\hat{p}^{2}+(1-\hat{p})^{2})w\_{2}\mathbb{E}\left[\sigma(\mathbf{x},\mathbf{x}\_{1}^{-})\right] + 2\hat{p}(1-\hat{p})w\_{2}\mathbb{E}\left[\sigma(\mathbf{x},\mathbf{x}\_{1}^{+})\right] \\ & = -\mathbb{E}\left[\sigma(\mathbf{x},\mathbf{x}\_{1}^{+})\right]. \end{split}$$

We evaluate the remaining terms:

$$\begin{split} & \mathbb{E} \left[ w\_{3} \log \left( \exp \left( \sigma(\mathbf{x}, \overline{\mathbf{x}}\_{1}^{+}) \right) + \exp \left( \sigma(\mathbf{x}, \overline{\mathbf{x}}\_{1}^{-}) \right) \right) \right] = \\ & \mathbb{E} \left[ \tilde{p}^{2} + (1 - \overline{p})^{2} \right) w\_{3} \mathbb{E} \left[ \log \left( \exp \left( \sigma(\mathbf{x}, \mathbf{x}\_{1}^{+}) \right) + \exp \left( \sigma(\mathbf{x}, \mathbf{x}\_{1}^{-}) \right) \right) \right] \\ & + \overline{p} (1 - \overline{p}) w\_{3} \mathbb{E} \left[ \log \left( \left( \exp(\sigma(\mathbf{x}, \overline{\mathbf{x}}\_{1}^{+}) \right) + \exp(\sigma(\mathbf{x}, \overline{\mathbf{x}}\_{2}^{+})) \right) \cdot \left( \exp(\sigma(\mathbf{x}, \overline{\mathbf{x}}\_{1}^{-}) \right) + \exp(\sigma(\mathbf{x}, \overline{\mathbf{x}}\_{2}^{-})) \right) \right]. \end{split}$$

and

$$\begin{split} &\mathbb{E}\left[w\_{4}\log\left(\exp\left(\sigma(\mathbf{x},\hat{\mathbf{x}}\_{1}^{+})\right)+\exp\left(\sigma(\mathbf{x},\hat{\mathbf{x}}\_{2}^{+})\right)\right)\right] \\ &+\mathbb{E}\left[w\_{4}\log\left(\exp\left(\sigma(\mathbf{x},\hat{\mathbf{x}}\_{1}^{-})\right)+\exp\left(\sigma(\mathbf{x},\hat{\mathbf{x}}\_{2}^{-})\right)\right)\right] = \\ & (\hat{p}^{2}+(1-\hat{p})^{2})w\_{4}\mathbb{E}\left[\log\left(\exp\left(\sigma(\mathbf{x},\mathbf{x}\_{1}^{+})\right)+\exp\left(\sigma(\mathbf{x},\mathbf{x}\_{2}^{+})\right)\right)\right] \\ &+4\hat{p}(1-\hat{p})w\_{4}\mathbb{E}\left[\log\left(\exp\left(\sigma(\mathbf{x},\mathbf{x}\_{1}^{+})\right)+\exp\left(\sigma(\mathbf{x},\mathbf{x}\_{1}^{-})\right)\right)\right] \\ &+\left((1-\hat{p})^{2}+\hat{p}^{2}\right)w\_{4}\mathbb{E}\left[\log\left(\exp\left(\sigma(\mathbf{x},\mathbf{x}\_{1}^{-})\right)+\exp\left(\sigma(\mathbf{x},\mathbf{x}\_{2}^{-})\right)\right)\right]. \end{split}$$

Examining the coefficients, we see that

$$\begin{aligned} (\bar{p}^2 + (1-\bar{p})^2)w\_3 - 4\bar{p}(1-\bar{p})w\_4 &= \frac{(\bar{p}^2 + (1-\bar{p})^2)^2}{(2\bar{p}-1)^2} - \frac{4\bar{p}^2(1-\bar{p})^2}{(2\bar{p}-1)^2} = 1 \\ (\bar{p}(1-\bar{p})w\_3 - (\bar{p}^2 + (1-\bar{p})^2)w\_4 &= \frac{\bar{p}(1-\bar{p})(\bar{p}^2 + (1-\bar{p})^2)}{(2\bar{p}-1)^2} - \frac{(\bar{p}^2 + (1-\bar{p})^2)\bar{p}(1-\bar{p})}{(2\bar{p}-1)^2} = 0, \end{aligned}$$

which shows that only the term Elog exp *σ*(*<sup>x</sup>*, *x*+1 ) + exp *σ*(*<sup>x</sup>*, *<sup>x</sup>*<sup>−</sup>1 ) persists. This completes our proof.

We now show the general case for debiasing *Lattract*, which uses more negative samples. **Proposition A1.** *Define m* = *n* + 1 *(as the "batch size" in the denominator), and*

$$\mathcal{I}\_{\text{attract}} = \mathbb{E}\_{\substack{\mathbf{x} \sim \mathcal{P} \\ \{\tilde{\mathbf{x}}\_i^+\}\_{i=1}^m}} \left[ -w^+ \sigma(\mathbf{x}, \tilde{\mathbf{x}}\_1^+) + w^- \sigma(\mathbf{x}, \tilde{\mathbf{x}}\_1^-) \right. \tag{A8}$$

$$\left[1+\sum\_{k=0}^{m} w\_k \log\left(\sum\_{i=1}^{k} \exp\left(\sigma(\mathbf{x}, \hat{\mathbf{x}}\_i^+)\right) + \sum\_{j=1}^{m-k} \exp\left(\sigma(\mathbf{x}, \hat{\mathbf{x}}\_j^-)\right)\right)\right].\tag{A9}$$

*w*<sup>+</sup> *and w*<sup>−</sup> *are defined in the same was as before. w*- = {*<sup>w</sup>*0, ... *wm*} ∈ R*m*+<sup>1</sup> *is the solution to the system* **P***w* = **e**2 *where* **e**2 *is the standard basis vector in* R*m*+<sup>1</sup> *where the* 2*nd index is* 1 *and all others are* 0*. The i*, *jth element of* **P** *is* **P***ij* = *<sup>p</sup>* **Q***<sup>i</sup>*,*<sup>j</sup>* + (1 − *<sup>p</sup>*)**Q***<sup>m</sup>*−*i*,*<sup>j</sup> where*

$$\mathbf{Q}\_{i,j} = \begin{cases} \sum\_{k=0}^{\min\{j, m-i\}} \binom{j}{k} \binom{m-j}{i-j+k} (1-\hat{p})^{i-j+2k} \hat{p}^{m+j-i-2k} & j \le i \\\sum\_{k=0}^{\min\{i, m-j\}} \binom{m-j}{k} \binom{j}{j-i+k} (1-\hat{p})^{j-i+2k} \hat{p}^{m-j+i-2k} & j > i \end{cases}$$
  $Then, \,\mathbb{E}\left[\overline{\mathbb{L}}\_{\text{attract}}\right] = L\_{\text{attract}}$ 

We do not present the proof for Proposition A1, but the steps are very similar to the proof for the triplet loss case. We also note that a different form of E *Lattract* must be computed for the multi-class case, which we do not present here (but can be derived through computation).

**Observation 4.** *Note that the values of* **Qi**,**<sup>j</sup>** *have high variance in the noise rate as m increases. Additionally, note that the number of terms in the summation of* **Qi**,**<sup>j</sup>** *increase combinatorially with m. We found this de-noising algorithm very unstable as a result.*

### Appendix F.3.2. Additional Noisy Label Results

Now we report the performance of denoising algorithms with additional amounts of labeled data for the cross entropy loss head. We also report the performance of using *Lattract* to debias noisy labels.

Figure A2 shows the results. Our geometric correction together with *Lspread* works the most consistently. Using the geometric correction with *LSC* can be unreliable, since *LSC* can learn memorize noisy labels early on in training. The expectation-based debiasing algorithm *Lattract* occasionally shows promise but is unreliable, and is very sensitive to having the correct noise rate as an input.

**Figure A2.** Performance of models under various amounts of label noise for the contrastive loss head, and various amounts of clean training data for the cross entropy loss.
