**1. Introduction**

Long-tail information has been found to be disproportionately affected during model compression, which has in turn been linked to reducing aspects of algorithmic fairness for minority information [1,2]. Additionally, real-world data is subject to long-tail learning challenges such as imbalances, few-shot learning, open-set recognition [3], or feature and label noise [4,5]. Crucially, works by Hooker et al. [6], Zhuang et al. [7] find that common long-tail evaluation measures like top-k metrics mask tail prediction performance losses. Current works on long-tail preservation in smaller models are focused on compressing large, supervised computer vision models [3,8–11], while general long-tail learning methods only study supervised contrastive learning.

In this work, we extend the field of 'long-tail preservation in compact models' to (self-supervised) *pretrained language models* (PLMs), and investigate whether contrastive language modeling (CLM) can be used to train a small, long-tail preserving model which does not require compression or large pretrained models. In this context, large PLMs are an important point of reference since they are often assumed to be base models for use in arbitrary NLP downstream tasks, as a trade-off for their large pretraining costs. These models are pretrained over many text domains in the hopes of achieving partial in-domain

**Citation:** Rethmeier, N.; Augenstein, I Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data. *CSFM* **2022**, *3*, 10. https://doi.org/10.3390/ cmsf2022003010

Academic Editors: Kuan-Chuan Peng and Ziyan Wu

Published: 20 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

pretraining that later overlaps with arbitrary downstream applications. This works well except in cases where fine-tuning data is limited [12]. Unfortunately, training data and sub-domains in the tail of a distribution are always limited and diverse by definition, which foreseeably increases the domain distribution mismatch between large PLMs and long-tail distributed end-task data. Hence, in order to train long-tail preserving models, it is useful to study small-scale, but in-domain pretraining, which ideally, is similarly or more compute efficient than fine-tuning a large PLM, while still achieving superior long-tail prediction performance. Thus, we first evaluate a large PLM in a challenging long-tail tag prediction setup (see Section 4) and then move on to propose a small contrastive language model (CLM) to answer the following three research questions.

