Contributions

We address RQ-2 by proposing a contrastive language model objective that *unifies supervised learning with self-supervised pretraining* to produce a *small model, with strong longtail retention* that is cheap to compute, thereby avoiding the need for compressing a large model. This takes inspiration from supervised contrastive learning, which is known to improve long-tail learning in NLP [8,14,15]. However, we add *self-supervised contrastive learning* since its effect has not been studied in the context of language models for longtail learning, especially not with the requirement of producing small models. We call this unified learning objective: Contrastive Long-tail Efficient Self-Supervision or CLESS. The method constructs pseudo-labels from input text tokens to use them for contrastive self-supervised pretraining. During supervised fine-tuning on real (long-tail) labels, the model directly reuses the self-supervision task head to predict real, human-annotated, text labels. Thus, we unify self-supervised and supervised learning regimes into a 'textto-text' approach. This builds on ideas for large PLMs that use 'text-to-text' prediction like T5 [16] and extends them to contrastive self-supervision to ensure long-tail retention in small language models that pretrain efficiently, even under strong data limitations. Using a 'text-to-text' prediction objective allows for modeling arbitrary NLP tasks by design, though in this work we focus exclusively on improving the under-studied field of long-tail language modeling. We evaluate RQ-1 and RQ-2 by comparing RoBERTa against CLESS regarding long-tail prediction in Section 5.1. To address RQ-3, we study three long-tail learning performance aspects. (RQ-3.1) We study how well our contrastive self-supervised pretraining generalizes to long-tail label prediction without using labeled examples, i.e. zero-shot, long-tail prediction in Section 5.2. (RQ-3.2) We evaluate how zeroshot performance is impacted by increased model size and pseudo-label amount during self-supervised pretraining (Section 5.2). (RQ-3.3) Finally, we investigate our models' fewshot learning capabilities during supervised long-tail fine-tuning and compare the results to the RoBERTa model in Section 5.3.

### **2. Related Work**

In this section, we summarize related work and how it influenced our method design and evaluation strategy decisions.

### *2.1. Long-Tail Compression*

Works by Hooker et al. [1,6] raised awareness of the disproportionate loss of long-tail information during model compression and the undesirable rise in algorithmic bias and fairness issues this may cause. Other works such as Liu et al. [3] pointed out that real-world

learning is always long-tailed and that few-shot and zero-shot learning settings naturally arise in tailed, real-world distributions. To make matters worse, real-world long-tail data is highly vulnerable to noise, which creates drastic learning and evaluation challenges, especially for self-supervised learning methods. For example, D'souza et al. [4] identify types of noise that especially impact long-tail data prediction and Zhuang et al. [7] find that noise disproportionately affects long-tail metrics. In fact, all the aforementioned show that top-k metrics hide long-tail performances losses. This means that we need long-tail sensitive evaluation, which inspired us to use Average Precision as a measure. In addition, we split tail analysis into 5 buckets that all contain an equal amount of positive labels, where each bucket contains increasingly more and rarer classes—see Section 4. These label imbalances in long-tail tasks make manual noise treatment very cumbersome, but fortunately, contrastive objectives are naturally robust to label noise as we will detail in the paragraph below.

### *2.2. Contrastive Learning Benefits*

Contrastive objectives like Noise Contrastive Estimation (NCE), have been shown to be much more robust against label noise overfitting than the standard cross-entropy loss [17]. Additionally, Zimmermann et al. [18] found that contrastive losses can "recover the true data distribution even from very limited learning samples". Supervised contrastive learning methods like Chang et al. [8], Liu et al. [14], Pappas and Henderson [15], Zhang et al. [19] have repeatedly demonstrated improved long-tail learning. Finally, Jiang et al. [11] recently proposed contrastive long-tail compression into smaller models. However, this still leaves the research question (RQ-1), whether large models learn long-tail well enough in the first place, unanswered. These observations, learning properties and open research questions inspired us to forgo large model training and the subsequent compression by instead training small contrastive models and extending them with contrastive self-supervision to combine the benefits of language model pretraining and contrastive learning. This imbues a small (contrastive language) model with strong long-tail retention capabilities, as well as with data-efficient learning for better zero to few-shot learning—as is detailed in the results Section 5.

### *2.3. Long-Tail Learning*

Long-tail learning has prolific subfields like extreme classification, which is concerned with supervised long-tail learning and top-line metric evaluation. The field provides varied approaches for different data input types like images [3], categorical data, or text classification using small supervised [14] or large supervision fine-tuned PLMs like Chang et al. [8] for supervised tail learning. However, these methods only explore *supervised* contrastive learning and limit their evaluation to *top-line metrics*, which, as mentioned above, mask long-tail performance losses. This naturally leads us to explore the effects of *self-supervised contrastive* learning (or pretraining) as one might expect such pretraining to enrich long-tail information before tail learning supervision. Additionally, as mentioned above, we use Average Precision over all classes, rather than top-k class, to *unmask long-tail performance losses*.

### *2.4. Negative and Positive Generation*

As surveys like Musgrave et al. [20], Rethmeier and Augenstein [21] point out, traditional contrastive learning research focuses on generating highly informative (hard) **negative samples**, since most contrastive learning objectives only use *a single positive learning sample* and *b* (bad) negative samples—Musgrave et al. [20] give an excellent overview. However, if too many negative samples are generated they can collide with positive samples, which degrades learning performance [22]. More recent computer vision works like Khosla et al. [23], Ostendorff et al. [24] propose generating multiple **positive** samples to boost *supervised contrastive learning* performance, while Wang and Isola [25] show that, when generating positive samples, the representations of positives should be close (related) to each other. Our method builds on these insights and extends them to *self-supervised*

*contrastive learning* and to the language model domain using a straightforward extension to NCE. Instead of using only one positive example the like standard NCE by Mnih and Teh [26], our method uses *g* good (positive) samples (see Section 3). To ensure that positive samples are representationally close (related) during self-supervised contrastive pretraining, we use words from a current input text as positive 'pseudo-labels'—i.e., we draw self-supervision pseudo-labels from a related context. Negative pseudo-labels (words) are drawn as words from other in-batch text inputs, where negative sample words are not allowed not appear in the current text to avoid the above-mentioned collision of positive and negative samples.

### *2.5. Data and Parameter Efficiency*

Using CNN layers can improve data and compute efficiency over self-attention layers as found by various works [27–29]. data-efficiency is paramount when pretraining while data is limited, which, for (rare) long-tail information, is by definition, always the case. Radford et al. [30] find that replacing a Transformer language encoder with a CNN backbone increases zero-shot data-efficiency 3 fold. We thus use a small CNN text encoder, while for more data abundant or short-tail pretraining scenarios a self-attention encoder may be used instead. **Our method is designed to increase self-supervision signal, i.e., by sampling more positive and negatives, to compensate for a lack of large pretraining data (signal)— since rare and long-tailed data is always limited**. It is our goal to skip compression and still train small, long-tail prediction capable models. Notably, CLESS pretraining does not require special learning rate schedules, residuals, normalization, warm-ups, or a modified optimizer as do many BERT variations [13,31,32].

### *2.6. Label Denoising*

Label dropout of discrete {0, 1} labels has been shown to increase label noise robustness by [33]. We use dropout on both the dense text and label embeddings. This creates a 'soft', but dense label noise during both self-supervised and supervised training, which is also similar to sentence similarity pretraining by Gao et al. [34], who used text embedding dropout rather than label embedding dropout to generate augmentations for contrastive learning.

#### **3. CLESS: Unified Contrastive Self-supervised to Supervised Training and Inference**

As done in natural language usage, we express labels as words, or more specifically as word embeddings, rather than as {0, 1} label vectors. CLESS then learns to contrastively (mis-)match <text embedding, (pseudo/real) label embedding> pairs as overviewed in Figure 1. For self-supervised pretraining, we in-batch sample *g* (good) positive and *b* (bad) negative <text, pseudo label> embedding pairs per text instance to then learn good and bad matches from them. Positive pseudo labels are a sampled subset of words that appear in the current text instance. Negative pseudo labels are words sampled from the other texts within a batch. Crucially, negative words (pseudo labels) can not be the same words as positive words (pseudo labels)—i.e. **<sup>w</sup>**+*i*∩ **<sup>w</sup>**<sup>−</sup>*j*= ∅.

This deceptively simple sampling strategy ensures that we fulfill two important criteria for successful *self-supervised contrastive learning*. One, using multiple positive labels improves learning if we draw them from a similar (related) context, as Wang and Isola [25] proved. Two, we avoid collisions between positive and negative samples, which otherwise degrades learning when using more negatives as Saunshi et al. [22] find. Similarly, for supervised learning, we use *g* positive, real labels and undersample *b* negative labels to construct <text, positive/negative real label> pairs. A text-2-label classifier 5 learns to match <text, label> embedding pairs using a noise contrastive loss [35], which we extend to use *g* positives rather than just one. This unifies self-supervised and supervised learning as contrastive 'text embedding to (label) text embedding matching' and allows direct transfer like zero-shot predictions of real labels after pseudo label pretraining—i.e. without prior training on other real labels as required by methods like [15,19,36]. Below,

we describe our approach and link specific design choices to insights from existing research in steps 1 - 6 .

**Figure 1.** Contrastive <text, pseudo/real label> embedding pair matcher model: A word embedding layer *E* 1 embeds text and real/pseudo labels, where labels are word IDs. CLESS embeds a text ('measuring variable interaction'), real positive (R) or negative (p-value) labels, and positive (variable) or negative (median) pseudo labels. A sequence encoder *T* 2 embeds a single text, while a label encoder *L* 3 embeds *c* labels. Each text has multiple (pseudo) labels, so the text encoding **t***i* is repeated for, and concatenated with, each label encoding **<sup>l</sup>**◦*i*,*l*. The resulting batch of <text embedding, label embedding> pairs [[**<sup>t</sup>***i*, **<sup>l</sup>**◦*i*,<sup>1</sup>], ... , [**<sup>t</sup>***i*, **<sup>l</sup>**◦*i*,*<sup>c</sup>*]] 4 are fed into a 'matcher' classifier 5 that is trained in 6 as a binary noise contrastive estimation loss *LB* [35] over multiple label (mis-)matches {0, 1} per text instance **t***i*. Unlike older works, we add contrastive self-supervision over pseudo labels as a pretraining mechanism. Here, the word 'variable' is a positive self-supervision (pseudo) label for a text instance **t***i*, while words from other in-batch texts, e.g. 'median', provide negative pseudo labels.

We give the model a text instance *i* of words **w***i* and a set of positive and negative label words **<sup>w</sup>**◦*i* = **<sup>w</sup>**+*i* ⊕ **<sup>w</sup>**<sup>−</sup>*j* ∈ R*<sup>c</sup>*=*g*+*b*. We also construct a label indicator I*i* as ground truth labels for the binary NCE loss in 6 . This label indicator contains a *g*-sized vector of ones **1** ∈ N*g*0 to indicate positive (matching) <text, label> embedding pairs and a *b*-sized zero vector **0** ∈ N*b*0 to indicated mismatching pairs, resulting in the indicator

$$\mathbb{I}\_i = \{ \mathbf{1} \oplus \mathbf{0} \} \in \mathbb{N}\_0^{c = \mathcal{S}^{+b}} \tag{9}$$

CLESS then encodes input text and labels in three steps 1 - 3 . First, both the input text (words) **w***i* and the labels **<sup>w</sup>**◦*i* are passed through a shared embedding layer 1 to produce *<sup>E</sup>*(**<sup>w</sup>***i*) as text embeddings and *<sup>E</sup>*(**w**◦*i* ) as label embeddings. Then, the text embeddings are encoded via a text encoder *T* 2 , while labels are encoded by a label encoder *L* as follows:

$$E(\mathbf{w}\_i) , E(\mathbf{w}\_i^\circ) \tag{9}$$

$$\mathbf{t}\_{i} = T(E(\mathbf{w}\_{i})) \tag{9}$$

$$\mathbf{L}\_{i}^{\diamond} = L(E(\mathbf{w}\_{i}^{\diamond})) = [\mathbf{l}\_{i,1'}^{+} \dots, \mathbf{l}\_{i,\emptyset'}^{+} \mathbf{l}\_{i,1'}^{-} \dots, \mathbf{l}\_{i,\flat}^{-}] \tag{3}$$

To make model learning more data-efficient we initialize the embedding layer *E* with fastText word embeddings that we train on the 60MB of *in-domain text data*. Such word embedding training only computes a few seconds, while enabling one to make the text encoder architecture small, but well initialized. The text encoder *T* consists of a single, k-max-pooled CNN layer followed by a fully connected layer for computation speed and data-efficiency [30,37,38]. As a label encoder *L*, we average the embeddings of words in a label and feed them through a fully connected layer—e.g. to encode a label 'p-value' we simply calculate the mean word embedding for the words 'p' and 'value'.

To learn whether a text instance embedding **t***i* matches any of the *c* label embeddings **l**◦*i*,· ∈ **L**◦*i* , we repeat the text embedding **t***i*, *c* times, and concatenate text and label embeddings to ge<sup>t</sup> a matrix **M***i* of <text, label> embedding pairs:

$$\mathbf{M}\_i = [[\mathbf{t}\_{i\prime} \mathbf{l}\_{i\prime 1}^+]\_{\prime} \dots \iota\_{\prime} [\mathbf{t}\_{i\prime} \mathbf{l}\_{i\prime c}^-]] \tag{9}$$

This text-label paring matrix **M***i* is then passed to the matcher network *M* 5 , which first applies dropout to each text-label embedding pair and then uses a three layer MLP to produce a batch of *c* label match probabilities:

$$\mathbf{p}\_{i} = \{ \sigma(M(\mathbf{M}\_{i,1})), \dots, \sigma(M(\mathbf{M}\_{i,\varepsilon})) \}\tag{5}$$

Here, applying dropout to label and text embeddings induces a *dense version of label noise*. Discrete {0,1} label dropout has been shown to improve robustness to label noise in Szegedy et al. [33], Lukasik et al. [39]. Because we always predict correct pseudo labels in pretraining, this forces the classifier to learn to correct dropout induced label noise.

Finally, we use a binary noise contrastive estimation loss as in [35], but extend it to use *g* positives, not one.

$$L\_B = -\frac{1}{c} \sum\_{l=1}^{\mathcal{S} + b - c} \mathbb{I}\_{i,l} \cdot \log(\mathbf{p}\_{i,l}) + (1 - \mathbb{I}\_{i,l}) \cdot \log(1 - \mathbf{p}\_{i,l}) \tag{6}$$

Here, *LB* is the mean binary cross-entropy loss of *g* positive and *b* negative labels—i.e. it predicts *c* = *b*+*g* label probabilities **p***i*, where the label indicators I*i* from 1 are used as ground truth labels.

Though we focus on evaluating CLESS for long-tail prediction in this work, other NLP tasks such as question answering or recognizing textual entailment can similarly be modeled as contrast pairs <*X* = '*text* 1 [*sep*] *text* 2',*Y* = '*is answer*'>. Unlike T5 language models [16], this avoids translating back and forth between discrete words and dense token embeddings. Not using T5s' softmax objective, also allows for predicting unforeseen (unlimited) test classes (label). We provide details on hyperparameter tuning of CLESS for self-supervised and supervised learning in Appendix C.

**Figure 2. Head to long-tail as 5 balanced class bins:** We bin classes by label frequency. Each bin contains equally many active label occurrences. Classes within a bin are imbalanced and become few-shot or zero-shot towards the tail, especially after train/dev/test splitting. Class frequencies are given in log scale—task data details in Section 4.

#### **4. Data: Resource Constrained, Long-Tail, Multi-Label, Tag Prediction**

To study efficient, small model, long-tail learning for 'text-to-text' pretraining models, we choose a multi-label question tag prediction dataset as a testbed. We use the "Questions from Cross Validated" dataset, where machine learning concepts are tagged per question—– https://www.kaggle.com/stackoverflow/statsquestions, accessed on 30 August 2021. This dataset is small (80MB of text), and entails solving a challenging 'text-to-text' long-tailed prediction task. The dataset has 85k questions with 244k positive labels, while we do not use answer texts. As with many real-world problems, labels are vague, since tagging was crowd-sourced. This means that determining the correct amount of tags per question (label

density) is hard, even for humans. The task currently has no prior state-of-the-art. As seen in Figure 2, the datasets' class occurrence frequencies are highly long-tailed, i.e. the 20% most frequently occurring classes result in 7 'head' classes, while the 20% least frequent (rightmost) label occurrences cover 80% or 1061/1315 of classes. Tags are highly sparse—at most 4 out of 1315 tags are labeled per question. We pretrain fastText word embeddings on the unlabeled text data to increase learning efficiency, and because fastText embeddings only take a few seconds to pretrain. The full details regarding preprocessing can be found in Appendix A.

Long-tail evaluation metrics and challenges:

Long-tail, multi-label classification is challenging to evaluate because (i) top-k quality measures mask performance losses on long-tailed minority classes as Hooker et al. [6] point out. Furthermore, (ii) measures like *ROCAUC* overestimate performance under class imbalance [40,41], and (iii) discrete measures like F-score are not scalable, as they require discretization threshold search under class imbalance. Fortunately, the Average Precision score *AP* = ∑*n*(*Rn* − *Rn*−<sup>1</sup>)*Pn* addresses issues (i-iii), where *Pn* and *Rn* are precision and recall at the *n*th threshold. We choose *APmicro* weighting as this score variant is the hardest to improve.
