**5. Results**

In this section, we analyze the three research questions: (RQ-1) Does RoBERTa learn long-tail tag prediction well? (RQ-2) Can a 12.5x smaller CLESS model achieve good long-tail prediction, and at what cost? (RQ-3) How does CLESS compare in zero to fewshot prediction and does its model size matter. We split the dataset into 80/10/10 for training, development, and test set. *Test scores or curves are reported for models that have the best development set average precision score APmicro over all 1315 classes.* RoBERTa has 125 million parameters and is pretrained on 160GB of text data. CLESS has 8-10 million parameters and is pretrained on just 60MB of in-domain text data. We use a ZeroR classifier, i.e. predicting the majority label per class, to establish imbalanced guessing performance. The ZeroR *APmicro* on this dataset is 0.002 since a maximum of 4 in 1315 classes are active per instance—i.e., which underlines the challenge of the task.

**Figure 3.** Long-tail performance (RQ-1, RQ-2), over all five head to tail class bins—see Figure 2. The tail class bin contains 80.7% or 1062/1315 of classes. The non-pretrained CLESS (2) underperforms, while RoBERTa performs the worst on the 80.7% of tail classes. The largest pretrained CLESS model (3.XL) outperforms RoBERTa in tail and mid class prediction, while performing nearly on par for the 7/1315 = 0.5% (most common) head classes.

### *5.1. (RQ-1+2): Long-Tail Capture of RoBERTa vs. CLESS*

Here we compare the long-tail prediction performance of RoBERTa (1) vs. CLESS setups that, either were pretrained (3, 3.XL), or not pretrained (2). Plotting individual scores for 1315 classes is unreadable. Instead, we sort classes from frequent to rare and assign them to one of five '20% of the overall class frequency' bins, such that all bins are balanced. This means all bins contain the same amount of positive real labels (label occurrences) and are directly comparable. As seen in Figure 2, this means that the head bin (left) contains the most frequent 7/1315 = 0.5% classes, while the tail contains the most rarely occurring 1061/1315 = 80.7% classes.

#### 5.1.1. RoBERTa: A Large Pretrained Model Does not Guarantee Long-Tail Capture

Figure 3 shows how a tag prediction fine-tuned RoBERTa performs over the five class bins as described above or in Section 4. RoBERTa learns the most common (0.5% head) classes well, but struggles with mid to tail classes. On the tail class bin, i.e., on 1061/1315 = 80.7% of classes, RoBERTa performs worse than a CLESS model that did not use contrastive pretraining (2). This allows multiple insights. One, *a large PLM should not implicitly be assumed to learn long-tail information well*. Two, large-scale pretraining data should not be expected to contain enough (rare) long-tailed domain information for an arbitrary end-task, since in the tail-domain, data is always limited. Three, even a small supervised contrastive model, without pretraining, can improve long-tail retention (for 80.7% of classes). Together these results indicate that compressing a large PLM may not be the optimal approach to training a small, long-tail prediction capable model.

#### 5.1.2. CLESS: Contrastive Pretraining Removes the Need for Model Compression

Model (3) and (3.XL) use our contrastive pretraining on the end-tasks' 60MB of unlabeled text data before supervised fine-tuning. We see that models with contrastive pretraining (3, 3.XL) noticeably outperform RoBERTa (1) and the non-pretrained contrastive model (2), on all non-head class bins, but especially on the 80.7% tail classes. We also see that the pretraining model parameter amount impacts CLESS performance as the 10 million parameter model (3.XL) outperforms the 8M parameters model (3) over all class bins and especially the tail bin. The above observations are especially encouraging as they tell us that contrastive in-domain pretraining can produce small, long-tail learning capable models without the need for compressing large models. It also tells us that model capacity matters in long-tail information retention, but not in the common sense that large PLMs are as useful as they have proven to be for non-long-tail learning applications. This also means that contrastive self-supervised LM pretraining can help reduce algorithmic bias caused by long-tail information loss in smaller models, the potential fairness impact of which was described by [1,2,6].

### 5.1.3. Practical Computational Efficiency of Contrastive Language Modeling

Though the long-tail performance results of CLESS are encouraging, its computational burden should ideally be equal or less than that of fine-tuning RoBERTa. When we analyzed training times we found that RoBERTa took 126 GPU hours to fine-tune for 48 epochs, when using 100% of fine-tuning labels. For the same task we found that CLESS (3.XL) took 7 GPU hours for self-supervised pretraining (without labels) and 5 GPU hours for supervised fine-tuning over 51 epochs—To bring CLESS to the same GPU compute load as RoBERTa (≈ 96%) we parallelized our data generation—otherwise our training times double and the GPU load is only ≈ 45%. As a result, pretraining plus fine-tuning takes CLESS (3.XL) 12 h compared to 126 for fine-tuning RoBERTa. This means that the proposed contrastive in-domain pretraining has both qualitative and computational advantages, while remaining applicable in scenarios where large collections of pretraining data are not available—which may benefit use cases like non-English or medical NLP. Additionally, both methods benefit from parameter search, but since CLESS unifies self-supervised pretraining and supervised fine-tuning as one objective we can reuse pretraining hyperparameters during fine-tuning. A more in-depth account of computational trade-offs is given in Appendix B, while details of hyperparameter tuning are given in Appendix C.

It is of course possible to attempt to improve the long-tail performance of RoBERTa, e.g. via continued pretraining on the in-domain data [42] or by adding new tokens [43,44]. However, this further increases the computation and memory requirements of RoBERTa, while the model still has to be compressed—which requires even more computation. We also tried to further improve the embedding initialization of CLESS using the method described in [45], to further boost its learning speed. While this helped learning very small models ( <2M parameters), it did not meaningfully impact the performance of contrastive pretraining or fine-tuning.

### *5.2. (RQ-3.1-2): Contrastive Zero-Shot Long-Tail Learning*

Thanks to the unified learning objective for self-supervised and supervised learning, CLESS enables zero-shot prediction of long-tail labels after self-supervised pretraining, i.e. without prior training on any labels. Therefore, in this section, we analyze the impact of using more model parameters (RQ-3.1) as well as using more pseudo labels (RQ-3.2) during self-supervised contrastive pretraining.

**Figure 4.** Zero-shot pretraining data-efficiency: by model size, pseudo label amount and pretraining text amount. Left: The zero-shot (data-efficiency) performance of the self-supervised pretraining base model (3) is increased when, adding more self-supervision pseudo labels (3.PL+) and when increasing model parameters (3.XL). Right: When only using only a proportion of the pretraining input data texts to pretrain model (3), its zero-shot learning is slowed down proportionally, but *still converges towards the 100% for all but the most extreme pretraining data reductions*.

### 5.2.1. (RQ-3.1): More Self-supervision and Model Size Improve Zero-Shot Long-Tail Capture

In Section 4, we study how CLESSs' zero-shot long-tail retention ability is impacted by: (left) using more pseudo labels (learning signal) during pretraining; and (right) by using only portions of unlabeled text data for pretraining. To do so, we pretrain CLESS variants on pseudo labels and evaluate each variant's zero-shot *APmicro* performance over all 1315 classes of the real-label test set from Section 4. As before, we show test score curves for the models with the best *APmicro* dev set performance.

The left plot of Figure 4, shows the effect of increasing the number of self-supervision pseudo label and model parameters. The CLESS 8M model (3), pretrained with 8 million parameters and 150 pseudo labels, achieves around .10*APmicro* on the test real labels as zero-shot long-tail performance. When increasing the pseudo label number to 500 in model (3.PL+), the model gains zero-shot performance (middle curve), without requiring more parameters. When additionally increasing the model parameters to 10M in (3.XL), the zeroshot performance increases substantially (top curve). Thus, both increasing self-supervision signal amount and model size boost zero-shot performance.

5.2.2. RQ-3.2: Contrastive pretraining Leads to Data-Efficient Zero-Shot Long-Tail Learning

Further, in the *right plot* of Figure 4 we see the CLESS 8M model (3) when trained on increasingly smaller portions (100%, ... , 10%) of pretraining text. For all *but the smallest pretraining data portions (<25%)* the model still converges towards the original 100% performance. However, as expected, its convergence slows proportionally with smaller pretraining text portions since each data reduction implies seeing less pseudo label selfsupervision per epoch. As a result, the data reduced setups need more training epochs, so we allowed 5*x* more waiting-epochs for early stopping than in the *left* plot. Thus, our contrastive self-supervised objective can pretrain data-effectively from very limited data. Similar data-efficiency gains from using contrastive objectives were previously only observed in computer vision applications by Zimmermann et al. [18], which confirms our initial intuition that contrastive self-supervision is generally useful for self-supervised learning from limited data.

Methods like Pappas and Henderson [15], Jiang et al. [36], Augenstein et al. [46] required supervised pretraining on real labels to later predict other, unseen labels in a zero-shot fashion. CLESS instead uses self-supervised pretraining to enable zero-shot prediction without training on real labels. This 'text-to-text' prediction approach is intentionally reminiscent of zero-shot prediction approaches in large PLMs like GPT-3 [47], but is instead designed to maximize zero-shot, long-tail prediction for use cases that strongly limit pretraining data amounts and model size. Hooker et al. [6] hypothesized that long-tail prediction depends on the model capacity (parameter amount). Additionally, Brown et al. [47] found that zero-shot prediction performance depends on model capacity, but [48,49] experimentally showed or visualized how inefficiently model capacity is used by common models, especially after fine-tuning. From the above observations, we can confirm the impact of model size for the doubly challenging task of *long-tail, zero-shot prediction*, but we can also confirm that contrastive pretraining allows a model to much more efficiently use its capacity for long-tail capture, i.e., requiring 12.5x fewer parameters (capacity) than common the RoBERTa model. Perhaps more encouragingly, we also observed that cheap, contrastive in-domain pretraining boosts zero-shot prediction, even when pretraining data is very limited—i.e. either by lack of large domain text data or due to data limitations caused by a long-tail distribution.

**Figure 5.** (RQ-3.3) Few-shot label-efficiency: (1) RoBERTa. (2) CLESS without pretraining. (3) CLESS with pretraining. (3.XL) CLESS pretrained with more pseudo labels and model parameter as described in (Section 5.2). *APmicro* \_ *test* scores for few-shot portions: 100%, 50%, 10% of training samples with real labels. CLESS 10M outperforms RoBERTa, and retrains 93.5% of its long-tail performance using only 10% of fine-tuning label texts.

### *5.3. (RQ-3.3): Few-Shot Long-Tail Learning*

Since CLESS models allow direct transfer (reuse) of the pretrained prediction head for supervised label prediction one would also expect the models' few-shot long-tail prediction performance to benefit from self-supervised pretraining. We thus study the few-shot learning performances of both CLESS and RoBERTa, to understand differences in large pretrained language models (PLMs) and small contrastive language model (CLM) pretraining in more detail. For the few-shot setup, we use 100%, 50% and 10% of labeled text instances for supervised training or fine-tuning of all models. This implies that if labels were common in the 100% setup, they now become increasingly rare or few-shot in the 10% setup, since the smaller label sets are still long-tail distributed. We again use *APmicro* test set performance over all 1315 classes to compare models.

In Figure 5, we see that when using full supervision (100%), all models perform similarly, with CLESS (3.XL) slightly outperforming RoBERTa (0.493 vs. 0.487) *APmicro*\_*test*. For few-shot learning (10%, 50%), we see that CLESS 3.XL retrains 0.461/0.493 = 0.935% of its original performance when using only 10% of fine-tuning labels, while RoBERTa and CLESS 8M each retain around 77%. This demonstrates that even a sightly larger contrastive pretraining model, with increased self-supervision signal (3.XL), not only improves zeroshot learning performance as was seen in Figure 4, but also markedly boosts few-shot performance. Noticeably, the only non-pretrained model (2), performs much worse than the others in the more restricted few-shot scenarios. Since models (2) and (3) use the same hyperparameters and only differ in being pretrained (3) or not being pretrained (2), this demonstrates that contrastive self-supervised pretraining largely improves label efficient learning.

### **6. Conclusion**

We introduce CLESS, a contrastive self-supervised language model (CLM), that unifies self-supervised pretraining and supervised fine-tuning into a single contrastive 'text embedding to text embedding' matching objective. Through three research questions (RQ-1 to RQ-3) we demonstrate that this model learns superior zero-shot, few-shot, and fully supervised long-tail retention in *small models without needing to compress large models*. In RQ-1, we first show that a fine-tuned, large pretrained language model like RoBERTa should not implicitly be expected to learn long-tail information well. Then, in RQ-2, we demonstrate that our contrastive self-supervised pretraining objective enables very text data-efficient pretraining, which also results in markedly improved (label efficient) few-shot or zero-shot long-tail learning. Finally, in RQ-3, we find that using more contrastive self-supervision signals and increasing model parameter capacity play important roles in boosting zero to few-shot long-tail prediction performance when learning from very limited in-domain pretraining data. We also find that the very low compute requirements of our method make it a viable alternative to large pretrained language models, especially for learning from limited data or in long-tail learning scenarios, where tail data is naturally limited. In future work, we envision applying CLESS to low-data domains like medicine [27] and factchecking [50], or to tasks where new labels emerge at test time, e.g. hashtag prediction [51]. Code and setup can be found at https://github.com/NilsRethmeier/CLESS (accessed on 30 August 2021).

**Author Contributions:** Conceptualization, N.R., I.A.; methodology, N.R.; software, N.R.; validation, N.R., I.A.; formal analysis, N.R., I.A.; investigation, N.R.; resources, N.R.; data curation, N.R.; writing— original draft preparation, N.R., I.A.; writing—review and editing, N.R., I.A.; visualization, N.R.; supervision, I.A.; project administration, N.R., I.A.; funding acquisition, N.R., I.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was conducted within the Cora4NLP project, which is funded by the German Federal Ministry of Education and Research (BMBF) under funding code 01IW20010.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable. **Data Availability Statement:** The dataset used in this study can be found as described in the data section. The exact training data setup is available online and upon request to the corresponding author.

**Acknowledgments:** We thank Yanai Elazar, Pepa Atanasova, Christoph Alt, Leonard Hennig and Dustin Wright for providing feedback and discussion on earlier drafts of this work.

**Conflicts of Interest:** The authors declare no conflict of interest.
