**Appendix B**

Here we will discuss the time and transfer complexity of CLESS vs. Self-attention models. We do so since time complexity is only meaningful if the data-efficiency of two methods is the same, because the combination of convergence speed, computation speed, and end-task performance makes a model effective and efficient.

**Table A1.** Time complexity O(Layer), data-efficiency, number of trainable parameters, number of all parameters. The data-efficiency of Convolutions (\*) is reported in various works to be superior to that of self-attention models [28,30,52–56]. *d* is the input embedding size and its increase slows down convolutions. *n* is the input sequence length and slows down self-attention the most [57]. There exist optimizations for both problems.


**Time complexity:** Our text encoder uses a single 1D CNN encoder layer which has a complexity of *<sup>O</sup>*(*n* · *k* · *d* · *f*) vs. *<sup>O</sup>*(*n*<sup>2</sup> · *d*) for vanilla self-attention as outlined in Vaswani et al. [57]. Here *n* is the input sequence length, k is the convolution filter size, *d* is the input embedding dimension [*d* = 512 in [57] vs. *d* = 100 for us], and *f* is the number of convolution filters (at maximum *f* = 3 · 100 for our (3.XL) pretraining model). Since we use kernel sizes {1, 2, 3} we ge<sup>t</sup> for the largest configuration (3.XL) an *<sup>O</sup>*(*n* · *k* = 6 · *d* = 1 · *f* = 3*d*) ≈ *<sup>O</sup>*(*n* · 3*d*<sup>2</sup>) vs. *<sup>O</sup>*(*n*<sup>2</sup> · 5*d*) in a vanilla (2017) self-attention setup where d = 512. Furthermore Transformer self-attention runs an *n*-way soft-max computation at every layer (e.g. 16 layers), while we run *g* · *b* single-class predictions at the final output layer using a noise contrastive objective NCE. We use NCE to undersample both: true negative learning labels (label=0) as well as positive and negative pseudo labels (input words). If the goal is to learn a specific supervised end-task, more informed sampling of positive and negative pseudo labels can be devised. However, we did not intend to overfit the supervised task by adding such hand-crafted human biases. Instead we use random sampling to pretrain a model for arbitrary downstream tasks (generalization), which follows a similar logic as random masking does in masked language modeling.

**Transfer complexity**: Traditional transfer NLP approaches like RoBERTa [13] need to initialize a new classification head per task which requires either training a new model per task or a joint multi-task learning setup. CLESS however can train multiple tasks, even if they arrive sequentially over time, while reusing the same classifier head from prior pretraining or fine-tuning. Thus, there is no need to retrain a separate model each time as in current Transformer transfer models. Once pretrained a CLESS model can zero-shot transfer to any new task since the match classifier is reused.
