Extracting Sentence Embeddings from Pretrained Transformer Models

Stankevičius, Lukas; Lukoševičius, Mantas

doi:10.3390/app14198887

Open AccessArticle

Extracting Sentence Embeddings from Pretrained Transformer Models

by

Lukas Stankevičius

^*

and

Mantas Lukoševičius

Faculty of Informatics, Kaunas University of Technology, LT-51368 Kaunas, Lithuania

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 8887; https://doi.org/10.3390/app14198887

Submission received: 14 August 2024 / Revised: 18 September 2024 / Accepted: 23 September 2024 / Published: 2 October 2024

(This article belongs to the Special Issue Advances in Large Language Models: Techniques, Applications and Challenges)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Pre-trained transformer models shine in many natural language processing tasks and therefore are expected to bear the representation of the input sentence or text meaning. These sentence-level embeddings are also important in retrieval-augmented generation. But do commonly used plain averaging or prompt templates sufficiently capture and represent the underlying meaning? After providing a comprehensive review of existing sentence embedding extraction and refinement methods, we thoroughly test different combinations and our original extensions of the most promising ones on pretrained models. Namely, given 110 M parameters, BERT’s hidden representations from multiple layers, and many tokens, we try diverse ways to extract optimal sentence embeddings. We test various token aggregation and representation post-processing techniques. We also test multiple ways of using a general Wikitext dataset to complement BERT’s sentence embeddings. All methods are tested on eight Semantic Textual Similarity (STS), six short text clustering, and twelve classification tasks. We also evaluate our representation-shaping techniques on other static models, including random token representations. Proposed representation extraction methods improve the performance on STS and clustering tasks for all models considered. Very high improvements for static token-based models, especially random embeddings for STS tasks, almost reach the performance of BERT-derived representations. Our work shows that the representation-shaping techniques significantly improve sentence embeddings extracted from BERT-based and simple baseline models.

Keywords:

BERT; embeddings; large language models; natural language processing; text embeddings; sentence vector representation; semantic similarity; transformer models; prompt engineering; unsupervised learning

1. Introduction

Early work on learnable word-level representations [1] showed that semantic meaning can be embedded in numerical vector representations. Arithmetic operations such as king − man + woman ≈ queen were valid. But can similar or higher abilities also be achieved for whole sentences, not just individual words? They would enable better clustering, classification, and other tasks depending on the whole meaning of the word sequence.

At the core of recent advancements in artificial intelligence is the transformer architecture [2]. Compared to previous RNN-based models, it is fully parallelizable, allowing faster training and greater generalization acquisition from the data. On their 10th birthday, a child is expected to have already encountered and understood the meaning of more than 100 million words [3], and now large language models surpass such scales by multiple orders of magnitude. In addition to improved throughput, transformer models are based on a self-attention mechanism [4] that allows the model to attend to relevant parts of the input sequence. That is similar to how humans understand and experience the meaning of words: using context. The success of the transformer architecture in Natural Language Processing (NLP) tasks, starting with the BERT model [5], was also repeated in other fields such as vision [6], speech [7], and reinforcement learning [8]. Transformer models enabled the solving of more sophisticated tasks such as sentiment analysis or question answering, as well as became the state-of-the-art in almost every NLP task. Therefore, it should possess the representation of the whole text sequence.

Transformer models generally have at least 12 layers and their parameters are counted in hundreds of millions or much more. Condensing the sentence representation using these weights into a fixed 768-length vector (a common length for token vectors in base transformer models) is a challenge. It turns out that simply extracting features from a transformer model’s last layer activations yields even worse results than much simpler models [9]. There were multiple works that tried to optimize the architecture parameters (see [10], which evaluated multiple proposed modifications), and multiple investigative works probing the properties of different parts of the model (see a critical review [11] of such approaches). Only the following methods gave tangible results: (1) the plain embedding averaging of all tokens comprising the sequence; (2) engineering a prompt template to condensate sentence representation into a single token; and (3) using a specially dedicated fine-tuned model designed to produce such vectors.

Fine-tuning is definitely the most efficient option. Resulting models, such as InferSent [12] or Sentence-BERT [9], showed state-of-the-art performance in sentence-level tasks at the time. However, the success of fine-tuning depends on several factors. It requires high-quantity and high-quality target domain data, as well as computational resources, which may not always be available. Sometimes, even the existing target domain data cannot be used, as they are very expensive to label. Other difficulties emerge if the data contain sensitive or private information and present a risk of it surfacing during inference. Having such constraints, one has to resort to the first two options of using raw encoded features to produce a vector for a text sequence.

Using feature aggregation instead of fine-tuning also allows us to better explain the inner workings of the state-of-the-art black-box models. Different parts of a transformer model may be responsible for different levels of representation, which favor different tasks. It is also important which levels in the representation hierarchy are easier to shape or process. A better understanding of the inner workings could help address hallucinations or other problems that current large consumer-grade language models face.

One of the two pre-training tasks of the famous BERT model [5] was next sentence prediction. It was optimized through a special [CLS] token, which sought to capture the whole-sequence-level representation. But in later works, such as [9], it was revealed that such a representation is very poor, not better than the classic ones, and the authors opted for simple averaging of the last layer tokens instead. The authors of [13] proposed using averaging of tokens of the first and last layers, and the authors of [14] also included hidden token representations of the second layer. But is that really the best way to obtain a numerical representation of the text’s meaning?

Pre-trained models like BERT capture a lot of useful representations, yet it is not that trivial to extract them. The authors of [15] showed that further improvements can be achieved by removing the most frequent, sub-word, uppercase, and punctuation tokens before averaging. Furthermore, even larger gains can be achieved by using a prompt template, “This sentence: “[X]” means [MASK]”, where the target sentence is placed instead of [X] and the representation of the token [MASK] is used as the final representation of the whole sentence. Such a method is a good example of representation extraction without any specific fine-tuning.

Inspired by the above findings, we hypothesize that there may be more ways to distill relevant sentence-level embeddings without directly fine-tuning the pre-trained BERT model. More concretely, in our analysis, we try to find a function that would shape and extract representations of bert base-uncased along its layers, target tokens, and additional corpus, so that the best performance in multiple tasks would be achieved. We evaluate our approach on short text clustering, semantic textual similarity, and classification tasks.

Considering the difficulties of acquiring representations from the state-of-the-art transformer models and following the success of recent works demonstrating it, our approach offers these main contributions:

We provide an extensive and organized review of related work on producing sentence-level embeddings from transformer models.
We experimentally test how multiple combinations of various of the most promising token aggregation and sentence representation post-processing techniques impact the performance of three classes of different tasks and properties of representations on several models.
We propose two competitive and simple static token models as baselines: random embeddings and averaged representations (“Avg”).
We propose an improvement for BERT: the BERT + Avg combined model. We experimentally test many weights and layers of how the representations of BERT and Avg can be most effectively mixed.

The rest of this paper is organized as follows. We provide a review of related work in the literature on composing word vectors and representation reshaping to obtain sentence- or text-level embeddings in Section 2. In Section 3, we outline the experimental setting and give a detailed background on our chosen approach, models, and datasets. In Section 4, we present the results. Finally, we summarize the findings of this work in Section 5.

2. Related Work

Taking the BERT model as an example, we are interested in how representation can be aggregated over tokens, layers, and possibly modified, and which models produce the best representations. Finally, we look at the evaluation options.

2.1. Composing Word Vectors

Thousands of works are being carried out on word-level vectors. Now, you can easily download popular publicly accessible Word2Vec [16] and GloVe [17] embeddings. They are lightweight and usually perfectly fit various word-level tasks. On the other hand, modeling sequences of words is considered much more challenging. Count-based approaches lose information on word order and are sparse. Learning higher-order n-gram vector space models, not just words, but phrases or sentences, leads to sparsity, as frequencies vanish for target n-grams and contexts. The latter, in particular, is the main driving force for distributed representations. We will cover special methods dedicated to sentence level in Section 2.3. Nevertheless, the easiest solution is to reuse individual word vectors of the sequence.

The Principle of Semantic Compositionality (usually called Frege’s principle) states that “the meaning of an expression is a function of, and only of, the meanings of its parts together with the method by which those parts are combined” [18]. Many scholars use this as a guide to how a sentence/paragraph/document vector should be formed. According to the principle, for much easier acquisition of word vectors, only the method of combination needs to be found. Therefore, in this subsection, we review the most popular candidate combination methods that should give us a sequence vector from its multiple word vectors.

2.1.1. Formal Semantics

Historically, the first methods were formal and based on logic. Here, the meaning of a sentence lies in the conditions under which it is true. A semantic parser, such as Boxer [19], is used to produce semantic representations of the given raw text. One can easily imagine the parse tree as a result of this analysis. As the structure is converted to first-order logic, resulting in a formula, it can then be checked with a theorem prover or interpreted with respect to a model, which is an abstract representation of a situation or setting [20,21,22]. Formal representations ensure that both semantic and syntactic information is preserved.

Unfortunately, formal methods have many practical shortcomings. A logic-based system must explicitly maintain the lexical knowledge necessary for the inference. Therefore, expensive human labor is involved in the construction and maintenance of these knowledge resources. More importantly, it must be domain-specific. One cannot simply use all the knowledge bases of the entire community, as this would hinder complex inferences of theorem provers and model building. There is an area of research on reducing processing time, given the large amount of knowledge resources [23]. Due to this property, logic-based systems have been criticized for their lack of robustness and scalability; implemented systems tend to be small-scale and domain-specific [24]. Although being white-box, theoretically clear, and promising, in practice, formal semantics methods are often surpassed by unsupervised distributional approaches capable of utilizing huge amounts of data. As shown in [25], symbolic representations can at least provide additional features for neural approaches.

2.1.2. Tensor Products

In [24,26], it was proposed to combine two representations, each in vector form, using tensor products, i.e., every element of the first vector is multiplied by every element of the second vector, retaining the products of all the pair combinations. This way, two rank-1 tensors result in a rank-2 tensor. Combining more vectors results in higher-rank tensors. It allows the tensors to represent the relations and role-filler bindings in a distributed fashion. However, it raises problems due to the dimensionality growing exponentially in size as more constituents are composed [27]. Furthermore, as noted in [28], tensor-based models can only efficiently handle sentences of a fixed structure. Unfortunately, in most practical applications, this is not the case.

Some methods try to solve the mentioned problems by sticking to the original vector dimensionality. It can be accomplished using convolution methods [29,30]. For example, the circular convolution, as presented in [31], achieves compression by summing along the transdiagonal elements of the tensor product. The compression is lossy, but the noisy version of the original vector can be recovered using circular correlation [27] and matched to the original one by comparing with all known component vectors. Due to the same mathematical principles as light holography, these models are also referred to as holographic.

Some works related to tensor products stem from formal semantics in a way that syntax drives the compositional process. In particular, it was expected that one of the most important parts of the sentence is the pair of a noun and an adjective. Machine learning, in particular regression-based, can be used to find the composition method. The idea is that vectors for AN (adjective–noun) pairs can be learned in the same fashion as for regular words. Then, given the constituents and the final AN vector, a linear mapping is learned. In [32], the mapping is learned as a separate matrix for each adjective, while in [33], a generic “AN-slot” function is trained. These methods do not increase the dimensionality, yet it is not clear how they would work with more words than only the AN pairs. A later work [34] extended the idea of the adjective matrix of [32] to other types (in addition to adjectives and nouns). They emulate formal semantics by representing functions as tensors and arguments as vectors. This way, subjects and objects are rank-1 tensors, while verbs are rank-3. However, as the authors trained only nouns, verbs, subject–verb pairs, and subject–verb–object triplets, their method is still limited to phrases no longer than three words.

Although tensor products share some exotic mathematical properties with quantum mechanics, for example, quantum entanglement [30], the framework relies heavily on formal semantics (see [35]). Therefore, it also shares the same weaknesses and is outperformed by pure machine learning approaches.

2.1.3. Averaging

As neural vectors became more effective than traditional approaches [28], it turned out that it was common to derive a vector for a sequence of words simply by summing or averaging individual vectors. In 2009, the authors of [27] reported that for composing a phrase representation, “averaging is the most common form of vector combination”. Despite its simplicity, the basic rule of averaging word vectors was shown to work very well [36,37,38]. Later work, such as [39,40,41], even showed that simple pooling methods, such as basic vector addition or averaging, match or outperform much more sophisticated methods for encoding the meaning of a text sequence. The authors of [42] reported that in out-of-domain scenarios, simple architectures such as word averaging outperform complex Long Short-Term Memory (LSTM) models. In addition, they are even competitive with systems tuned for particular tasks. Later work by [43] evaluated various compositional models and found that word vector averaging performed reasonably well in most supervised benchmarks. According to the authors of [9], even for the currently trending BERT transformer model, to map the sentence to a single vector, the most popular approach is to average the word embeddings of the BERT output layer.

As averaging is well-suited to acquiring sentence optimization, some methods take averaging into their structure to further optimize it. One such example is the deep average network (DAN) [44]. It takes the input as an averaged text sequence and passes it to one or more feedforward layers to finally perform linear classification. Another, C-PHRASE model of [45], is trained to predict the contexts of phrases from the additive combination of their elements. Such a design results in a useful property of C-PHRASE, that summing word vectors yields sequence representation. The authors of [46] found that one should use an average of all hidden states of LSMT, rather than using representation only from the last one. In a similar fashion, the averaging or summing is optimized in the works of the Siamese CBOW model [47], Sent2Vec model [48], and in other works [49,50,51].

A model derived from BERT, specially designed for sentence embeddings, SBERT, also uses averaging to pool words from two sequences to a pair of fixed-size sentence embeddings. Later, using the pair and its element-wise difference, the representation is passed to the softmax layer for classification and regression tasks. The authors of [52] tried to improve this configuration by replacing the averaging with a convolutional neural network. However, their improvement was limited to only better scores for the SALBERT model, which was still lower than SBERT based on averaging. Therefore, calculating the mean embedding from multiple word vectors is still a solid approach to the inner workings of current sentence models.

Averaging is used in various scenarios. Some authors employ it to derive static embeddings from contextualized models. In [49], word vectors are learned by predicting them from the average of all contextual embeddings of words (except the target word) returned by the BERT encoder. Other authors of [53] derived static word embeddings from contextualized transformer representations by averaging word embeddings for each word in 100k different contexts. Such derived static versions are of better quality than classic Word2Vec and share their benefits, such as having tens of millions of times lower computational cost than using standard contextual embedding models [49]. The other scenario is to compose a new more expressive meta-embedding from different models and domains. As a means of composition, the authors of [54] proposed concatenation with some trainable methods. However, a later work [55] found that averaging word vectors and padding them with zeros to compensate for dimensionality mismatches is a surprisingly effective method.

Despite its popularity, averaging has inherent deficiencies. The most prominent is the loss of word order information. For this reason, it is sometimes called a bag-of-words representation (that is, imagine that words lay in the bag in no specific order). As shown in [44], models based on averaging are weaker in tasks that require syntax information. The mean pooled static representations of the words “dog chased human” and “human chased dog” will be the same. In addition, simple summation can cause destructive interference, affecting valuable information. This is especially relevant in long documents. In addition to neglected sentence structure and word order, the authors of [43] also note that individual word identities are lost and noninformative words are more prominently represented than essential ones.

However, to some extent, drawbacks can still be overcome. Most tasks do not usually rely on word order, and the number of occurrences of order-sensitive elements, such as double negations, is generally low. Word vectors have a rich representation and some related knowledge can still be found in them. For example, the authors of [56] showed that the average of the word vectors retains information about the original sequence length. It is even more the case with contextualized transformer representations, where starting from the second layer, each hidden vector is expected to “know” every other token vector and, therefore, to be aware of the word order.

2.1.4. Weighted Average

Not all constituent words are of the same importance; therefore, the corresponding vectors should be weighted. As a surprising event in Information Theory has higher information content than an expected event [57], some specific words should also add more weight to the sentence vector than others.

In an early work [27], we can already find different weights for the members of the pairs of adjective–noun, noun–noun, or verb–object phrases. Later, work went from formal semantics to statistics-based measures, in particular the tf-idf scores. They can be used as features on their own, such as in [58], but it is better to use them to weight neural word vectors. For example, the authors of [59] proposed the Composite Document Vector as a concatenation of idf-graded weighted Word2Vec vectors and tf-idf features. The authors of [60] proposed another better weighting approach, called Smooth Inverse Frequency (SIF). Here, higher-frequency words are down-weighted smoothly. The authors of [43] evaluated various compositional models and found that weighted averaging, in particular SIF, resulted in better performance in unsupervised similarity tasks that outperformed all other models. The author of [61] presented the uSIF method, an improvement to SIF, omitting hyperparameter tuning and constructing weighting with both word frequency and word vector length information. The authors of [62] also showed that the mean pooled output of transformers using tf-idf weights is better for clustering than the only regularly averaged output.

A drastic case of weighting—a total removal of some words—is also shown to improve results. The authors of an older work [37] removed the stop words from the sequence before averaging. Meanwhile, as shown in [15,63], the current performance of the BERT transformer model is also significantly improved in Semantic Text Similarity (STS) tasks if the most frequent tokens are removed. In other cases, the task itself focuses only on a few words, and others become redundant. The authors of [64] found that for the relation extraction task, some sentence embedding methods work better with shorter spans of words than the original sentences. They performed sentence segmentation in a way that a (sub)sentence would cover the identified entity mentions. Other authors of [64] also noted that SentenceBERT and Quickthought on spans or short segments containing two entity mentions are more clusterable than on the original sentences. These works also highlight the unequal contribution that each word has to the final sentence representation vector.

Some works use more complex word weighting schemes. For example, the authors of [46] proposed a form of attention to weight each hidden state of LSTM. Other authors of [65] proposed the idea that each word brings a novel orthogonal basis to the sentence. Therefore, the length of a projection in this direction can be converted into a word’s weight for use in the averaging. They come up with the final weight, consisting of three scores: novelty, significance, and corpus-wise uniqueness. Authors of [64] found that for the methods analyzed, such weighting of GloVe embeddings made them the most clusterable. Similar work by [66] incorporated alignment and novelty scores and applied them to contextualized representations of SBERT. Here, the alignment measures how well the word aligns with the neighboring ones, i.e., a well-aligned one is less informative and should be weighted less. The novelty, similar to [65], is expressed as the magnitude of the orthogonal component of the word to the subspace of neighboring words. Furthermore, the authors of [66] perform average weighting through all SBERT layers and also weight each layer-aggregated vector via

l_{1}

-normalized variance. This means that words that evolve faster across layers will receive higher weights since they have greater variance. Although complex to implement, such methods are reported to provide some minor performance improvements.

Surprisingly, the norm of word vectors has a large dispersion, as observed by the authors of [67,68], and during the average pooling operation, it acts as a weight of the word vector. The authors of [48] observed that their trained word vector in this way down-weights frequent tokens by itself, and this weighting follows the hypothesis of Luhn [69], a well-known information retrieval paradigm, stating that mid-rank terms are the most significant for discriminating content. Using these insights, the authors of [70] used the norm of word vectors as a proxy for the importance of words.

2.1.5. Clustering

As shown by [71], averaging word vectors leads to a loss of information. It is in particular significant for longer text sequences, such as documents. Therefore, the idea was developed to aggregate vectors in a more smooth and information-preserving manner. The authors of [72] clustered the word vectors into k groups using k-means. Then, for a document, each cluster vector is obtained as a sum of its constituent vectors. The final representation is then the concatenation of the cluster vectors and the inverse cluster frequency (icf) values, which are calculated using the idf values of the words present in the document. Other authors of [73] proposed a similar method, but viewed it as one that reduces the dimensionality from words to concepts. Concepts (as clusters) are created by clustering word vectors generated from Word2Vec, and frequencies of these concept clusters are used to represent document vectors. Finally, similarly to tf-idf applied to bag-of-words, a weighting scheme, concept frequency-inverse document frequency (cf-idf), is applied to acquired bag-of-concepts. The authors of [74] proposed the concepts of word containers and document containers to explain how such procedures bring benefits. Similarly to other works, they perform clustering of words into distinct clusters. Vectors belonging to the same cluster are averaged, while resulting representations among different clusters are concatenated. Using standard contextualized pre-trained transformer models with frozen weights, the authors of [75], instead of a simple mean pooling, proposed training a categorical variational autoencoder and reported the performance improvement on some STS tasks. This way, similar to clustering, the lower intermediate compressed representation would carry the essential representation of the sequence. Overall, using clustering of independent word vectors allows the document representation to be expressed in a smaller space of concepts, whose concepts are otherwise erased due to destructive interference during the global averaging.

The authors of [76] improved the aforementioned method of [72] by using soft clustering and allowing a single word vector to participate in multiple clusters. More specifically, each word is represented as a

K \times d

dimensional embedding, where each

k^{th}

row corresponds to the original word vector weighted by its probability distribution in the

k^{th}

cluster. The embedding of this word is weighted with the inverse document frequency of the word and summed with the embeddings of other words to form a document vector. As many values in such a vector were observed to be close to zero, sparseness is induced with a given minimal value threshold. The authors of [76] named their method Sparse Composite Document Vector (SCDV).

The success of SCDV was supplemented by numerous other contributions. The authors of [77], instead of the Gaussian Mixture Models (GMMs), used K-SVD (an algorithm for designing overcomplete dictionaries for sparse representation) [78] for the topic modeling. Furthermore, they used a newer word vector algorithm Doc2VecC and a modern SIF weighting and removal of the top principal component of [60]. Other authors of [79], with their SCDV-MS method, additionally performed word sense disambiguation to dissect polysemous words into distinct vectors. In this way, the quality of the clusters improved. To induce sparsity, the authors also applied hard thresholding in an earlier stage in word cluster assignments and trained word vectors with Doc2vecC. Next, with the emergence of transformer models, the authors of [80] utilized contextualized representations in their SCDV + BERT (ctxd) method. First, the corpus is contextualized following the technique of [81]. That is, each corpus word is clustered among its individual recurrence contexts into distinct clusters corresponding to the different meanings of the same word. After such a procedure is repeated for all words, the vector for each word will then be the centroid vector of the closest cluster. The authors of [80] then repeated the original SCVD procedure and the result was an average improvement for STS tasks. Therefore, the idea of turning global averaging to local inside-cluster averaging did not change; only newer representation techniques and models were employed.

2.1.6. Spectral Methods

A text sequence comprising word embeddings can be interpreted as a multidimensional signal over time. Therefore, temporal summarization techniques can be employed.

One such method is the Discrete Cosine Transform (DCT), originally presented in [82]. This invertible function maps an input sequence of N real numbers to the coefficients of the N number of orthogonal cosine basis functions. The DCT components are arranged in order of significance, with the first one being proportional to the simple average of the sequence. DCT is used in data compression by preserving only the most important coefficients.

In [83,84], DCT was shown to outperform simple averaging of word embeddings. First, it can be attributed to the fact that DCT is structure-sensitive, as it captures signal dynamics. Second, the DCT has multiple coefficients, with the first one already corresponding to averaging; thus, it should capture more information and enrich the performance of probing classifiers. The authors of [83] apply DCT along the sequence of words for each embedding dimension. Then, they retain lower-order coefficients and concatenate them to obtain overall feature patterns in the word sequence. One drawback of this technique is that short sentences must be padded with zeroed vectors. The following work [84] also showed good results for DCT in multilingual and cross-lingual settings.

Concerning the spectral decomposition of DCT, the authors of [85] proposed EigenSent, which utilizes Higher-Order Dynamic Mode Decomposition (HODMD). The method summarizes transitions in a sequence of words into one representative sequence embedding. The authors found that the best performance is achieved when such an embedding is concatenated with simple word vector averaging. This way, information on both the dynamics and the scale of the sequence is captured.

2.1.7. Using Special Tokens

Modern transformer models have several special tokens that can play an important role during representation aggregation. For example, BERT has [CLS] designed for the next sentence prediction task, and it is supposed to carry the meaning of the input text sequence. The authors of [86] tried to use this property and trained the model to learn to aggregate everything in this [CLS] token. Other special tokens are [SEP], used to separate two text inputs, [MASK], to concentrate the representation of the masked word, padding, and various sentinel tokens (as in T5). Despite its intended purpose, special tokens are generally dismissed, as in [14,87,88], and a simple average is used instead. However, such tokens are important, as the authors of [89], after fine-tuning BERT, found a clear tendency for earlier layers to pay attention to [CLS] and for later layers to pay attention to [SEP].

Recently, a prompt-based learning paradigm has emerged (see the survey [90]). Here, instead of fine-tuning the model to the downstream task, one reformulates it to look like the one that was being solved during the model pre-training process. Therefore, the main effort goes to prompt engineering, i.e., finding the most appropriate prompt for the given task. Such a strategy is especially suitable for very large, even colossal language models that are difficult to train.

The authors of [15] successfully used prompt engineering to better capture sentence embeddings with the BERT model. During the manual prompt search, the best prompt found was “This sentence: “[X]” means [MASK].”, where [X] is replaced by the input sentence, and the output vector of token [MASK] is used as a final representation of the input sequence. Then, they further optimized the template by fine-tuning it on Natural Language Inference (NLI) data with the contrastive objective, while the BERT model weights were frozen. The authors used the continuous template technique of [91], where each template token is treated as a vector and optimized by gradient descent. The final template was shown to outperform multiple untrained baselines. The authors also showed that different templates can be used effectively to represent the same sentence with different points of view during supervised contrastive learning. The following work by [92] adapted the idea of using prompt instead of mean pooling and also used this representation extraction during contrastive learning.

2.1.8. Aggregating through Layers

One special aggregation that is possible with newer contextualized models is through deep neural network layers. For simplicity, we refer to transformer model blocks as layers.

One can imagine that “each layer will increasingly magnify small but significant differences” [44]. Therefore, the early layers capture more fundamental and low-level information [89], which dominates the learning of shallow lexical and meaning-related knowledge [93]. The authors of [75] reported that the word embeddings of the lower layers of BERT perform better than their upper layers on a word analogy task, and other authors of [53] showed that the first quarter of the models’ layers perform best in lexical semantic understanding. In the middle layers of the transformers, the hidden states are the most transferable [94] and contain the most relevant information [95]. The authors of [96,97] report that the middle layers of the multilingual transformers are more multilingually aligned. According to the authors of [93], with the inclusion of context in the upper layers, the encoded concepts evolve into a linguistic hierarchy where the middle and upper layers have a better representation of the core linguistic and semantic concepts. Finally, as the author of [98] explains, upper-layer representations become more context-specific.

A general tendency, as shown by [99], is that the STS performance increases until the middle layers before decreasing toward the final ones. The authors of [99] even report that the final layer of most transformer models produces the worst-performing representations. Most scholars agree that this is due to the overspecialization of the last layers to the pre-training task [100]. Therefore, the final layers of BERT are the most task-specific [101] (but not as much as LSTM, as analyzed in [94]). Even during the fine-tuning phase, the authors of [89] determined that the last two layers encoded the highest share of task-specific features attributed to the score gain. Therefore, the last layers are specific to the training objective and cause problems if it differs a lot from the downstream task objective.

However, some work in [102,103] found that it was beneficial to apply post-processing to the final layers. This enabled them to restore the apparent drop in representational quality in later layers and gave large performance improvements. It is thought that anisotropy, known as the narrow cone of embeddings in vector space, may be to blame, which in [98] was found to increase from the earlier to the later layers. However, successful post-processing reveals that the last layers also possess relevant knowledge, but it is somewhat obscured in the raw outputs.

The question is how to compose a universal representation from the relevant pieces in multiple layers. One way is to try various combinations. The authors of [14] analyzed all possible two-layer combinations for BERT. They found that combining the top and bottom layers is better than using only the top layers. They tried to expand it to more layers but remained on a combination of three layers (L1 + L2 + L12). From the older works, we can mention the AdaSent model [104]. It operates on a hierarchy of representations derived from a pyramid-like multilayer network structure. The gating network is trained to adaptively select the most appropriate representation in the hierarchy for the given task. A similar gating system for dynamically deciding which intermediate transformer layers to use was proposed by the authors of [105].

The contribution of each layer can also be accounted for by weighted average. The authors of ELMO [106], a contextualized model based on LSTM, present this configuration with additional weight parameters for each layer. Other authors investigating representations [94,95,107] call this weighting a scalar mix technique. The authors of [66], instead of learning layer weights for each word, compute them using complex alignment and novelty measures. For a few-layer model like ELMO, other authors of [108] found it useful to concatenate representations across layers while averaging across tokens, rather than using only the top layer. Other authors of [86] extract relevant information from intermediate hidden representations of BERT by treating them as positive samples during contrastive learning and determining which final sentence embedded in the tuned model should be close.

In conclusion, there is no generally accepted method for taking advantage of representations across all layers in an unsupervised setting. In the other, supervised case, the whole network is fine-tuned, and the whole hierarchy of representations is taught to surface representations according to the task at hand.

2.1.9. Other Means of Composition

Many other composition functions can be used to derive sentence embeddings. In an early work [27] on phrase composition, the authors also found that multiplicative and dilation models perform well. However, there are caveats; element-wise multiplication of sparse representations results in loss of information (this effect is especially strong for more than two words), while dilation, just like weighted average, requires additional parameters. It should also be mentioned that max pooling is quite a popular operation to aggregate the output of the LSMT encoder, as shown in [12,109], and also by the authors of [86], who found it to work best for all BERT layers for the STS benchmark task. According to the authors of [110], max pooling is also competitive with boundary-based methods (such as those that use embeddings of words in the boundaries) for the representation of the text span.

The input text sequence can be combined hierarchically. The authors of [111] used a recursive autoencoder to contract two child words to a single parent one repeatedly, to arrive at a final single common parent representation of the whole sequence. Later, the authors of [112] extended this combination to specifically follow the structure of the parse tree. Such a recursive model allowed them to compose word vectors in a bottom-up approach up to a sentence level and outperform the simple averaging method. However, a drawback is that such a composition is limited to only sentences. Other authors of [41] used similar ideas for combinations, but did not learn any composition function. Instead, given a specific window length, the vectors of words in these windows are averaged, and the resulting embeddings are then max-pooled. Such a setup better preserves the spatial information.

The common average operation can be generalized as a power mean, as proposed for information retrieval in [113]. The authors of [114] introduced such an idea for combining word vectors. They showed that using the document vector as a concatenation of the power mean with different power values, particularly

p = \pm \infty

(maximum and minimum),

p = 1

(regular arithmetic mean), and other

p > 0

, substantially improves the representation of the document. Other authors of [115] also proposed a long vector representation. They employ self-attention to replace the max (or average) pooling from either RNN hidden states or convolved n-grams, resulting in the sentence embedding of the matrix form. However, as the authors of [116] criticize, concatenating representations is problematic. According to them, bigger embeddings will always increase the performance of the linear model on top of them. Therefore, embeddings of the same size should always be compared. In light of this, concatenation is an undesirable operation.

2.2. Reshaping Representation Spaces

Word and sentence vectors obtained by older Word2Vec [1], GloVe [17], and current state-of-the-art transformer techniques are not ideal. Therefore, additional processing techniques are applied to eliminate side effects and improve the performance of downstream tasks. In this subsection, we will describe anisotropy, a phenomenon believed to be responsible for this, the reasons for its appearance, and the techniques that attempt to improve the representation space.

2.2.1. Isotropy

Isotropy is defined as uniformity in all orientations. In this work, we talk about embedding spaces of words and sentences. According to the authors of [117], “distribution is isotropic when the variance is uniformly distributed across all dimensions”.

The anisotropic properties of independent word embeddings were first observed by the authors of [118]. They analyzed common embedding models and found that, in all cases, word vectors share a nonzero common vector; therefore, they are not zero-centered. The authors also found that the variance ratios of the first few components derived by PCA decay nearly exponentially. Furthermore, the variances explained by the leading components “encode the frequency of the word to a significant degree”. Therefore, they proposed eliminating the common mean vector and then removing the top principal components, computed on representations from the entire vocabulary. A similar procedure was also shown in an earlier work [60] to be a very good baseline. Here, words were combined by a frequency-weighted average, and then just the first principal component (yet dataset-specific, as computed on the entire dataset) was removed. As shown by the All-But-The-Top (ABTT) method of [118], such a simple procedure, eliminating the common mean vector and a few top dominating directions from the word vectors, greatly improves both the isotropy and the downstream task performance (for sentence tasks, a simple average of preprocessed word vectors is used). As found by [119], it is already known that the isotropy of the target embedding is critical for the alignment of embeddings, which is important for areas such as domain adaptation, word embedding evaluation, and machine translation. However, the work [118] became the main stimulus for the research direction that attempts to improve the isotropy of word embeddings.

Word representation anisotropy is also observed in the latest contextualized models. The authors of [120] first observed this effect on the transformer machine translation model [2], while the author of [98] found that this is the case for all ELMo [106], BERT [5], and GPT-2 [121] layers. All representations for randomly sampled words from these contextualized anisotropic spaces were observed to be highly similar, far from zero average cosine similarity. Furthermore, the embeddings of any two words are positively correlated. Finally, it was observed that all word vectors in the representation space tend to occupy a narrow cone.

The research community offered some explanations for why anisotropy may occur. We will cover them in more detail below.

Word Frequencies

The authors of [122] were among the first to observe that word embeddings encode a surprising degree of information about word frequencies. In [60], it was shown that simply representing sentences as a weighted average of words by their frequencies is a very competitive method. In particular, the authors computed the weighted average of the word vectors in the sentence (the weight of a word w is a/(a + p(w)), with a being a parameter and p(w) the frequency) and then removed the projections of the average vectors on their first singular vector (“common component removal”). Later, the authors of [118] suggested removing more than one component. They found that the top PCA directions encode word frequency information; therefore, the method was proposed to null these directions. As this bias is very strong in representation space, it is inevitably related to the anisotropy phenomena.

Similar encoding of frequency information is also evident in contextualized embeddings of modern transformer models. The authors of [123] investigated BERT embeddings and found that high-frequency words are all close to the origin and densely concentrated, while low-frequency words are far away and scattered sparsely. The work in [124] highlighted that anisotropy is the most pronounced in rare words. The authors of [125] also showed that the same applies to multilingual BERT embeddings; they have a biased structure towards word frequency.

This bias to word frequency reveals a weakness. The authors of [126] constructed a dataset using semantic relations extracted from WordNet [127] to test the semantic properties of words. Using that, they showed that the ability of BERT to understand words depends highly on their frequency, and therefore rare words are neglected. It was depicted in [63] that removing the 34 most frequent tokens before averaging can greatly boost the performance on semantic textual similarity tasks. The authors of [124] conducted a simple experiment on the CNN News corpus with popular WordPiece tokenization and found that 30% of the corpus can be accounted for using the 13 most frequent tokens, while to cover at least 98% of the corpus, nearly 15,000 tokens are needed. Therefore, rare words are a constituent part of modern NLP pipelines and, unfortunately, induce deficiencies.

The authors of [120] proposed an explanation for generation models that related observations of the narrow cone form in the representation space and the frequencies of words. They argue that the main culprit is the way models are trained on the language modeling task. During the training process, the ground-truth word embedding will be pushed toward the direction of the hidden state to obtain a larger likelihood. Meanwhile, the embeddings of most words in the vocabulary (with non-appearing and rarely appearing frequencies) will be pushed towards similar directions negatively correlated with most hidden states and thus end up grouped together in the local region of the embedding space. The following work [124] complemented the given explanation with the “common enemies effect”—the effect of the target words producing gradients of the same direction for all of the non-target words at each step of training with cross-entropy loss, and rare tokens are the most affected by it. The authors also found that the embeddings learned by GPT-2, BERT, and RoBERTa do not degenerate into a narrow cone (they only appear as a cone when projected to a lower-dimensional space), but instead drift in one shared direction. Therefore, the “common enemies effect” is argued to be the main culprit behind the anisotropy of the representation space.

Outliers

There is evidence from multiple authors that contextualized representations of transformer models (in particular, the BERT family) contain undesirable outliers (interestingly, abnormal dimensions were also previously observed for GloVe vectors in [128]). These are certain positions in word vectors with unusually high values. The authors of [129] called it outlier dimensions; in [130], it is referred to as outlier neurons, while the authors of [102] call it rogue dimensions. In all of these works, it is agreed that it is a significant contributor to the anisotropy of the representation space.

There is debate about what causes outliers. The authors of [129] found that outliers are essential for good downstream performance. They regard it as a distinct model property that emerges during training and even encourage one to consider it during weight initialization. In contrast, in [102], it was found that the behavior of the model is not driven by outliers; rather, only a small subset of linguistic abilities is handled there (such as positional information; its relationship with outliers was revealed in [130]). Even more strangely, the authors of [125] found that the embedding space of the multilingual BERT model does not have outliers as the English BERT does, but is still anisotropic.

There are several options to account for outliers. The most obvious solution is to remove them. The authors of [130] showed that this brings improvements for tasks that directly use the geometry of the embeddings, such as semantic textual similarity; however, ref. [129] showed that disabled outliers significantly degrade both Masked Language Modeling (MLM) loss and downstream task performance. The authors of [102] suggest that rogue dimensions be accounted for through the standardization transformation. Both options aim to reduce the dominating effect of the highest-value dimensions during the similarity measure, such as cosine similarity or Pearson correlation. On the other hand, one can argue that the similarity measure we use, not the representation space, needs improvements.

Criticism

Many works try to improve the isotropy of representation spaces, but is that truly the right way to seek performance improvements? Some recent publications, such as [124], expressed doubts about the role of isotropy in model performance, and the authors of [131] even observed a negative relation between isotropy calibration and downstream performance. The works in [117,124] ask whether there is truly a “narrow cone” in the embedding space. Other authors of [15] argue that the main problems are various biases unrelated to isotropy. Furthermore, the authors of [117] criticized multiple popular measures of isotropy and warned that they are misleading and inaccurate.

The authors of [132] analyzed the contextual embedding space and found isolated clusters and low-dimensional manifolds. They concluded that although the embeddings are globally anisotropic, local isotropy exists. The authors of [133] continued this observation and applied the known preprocessing techniques of [118], not globally but locally, in clustered regions of representations. They observed that sense-level information is shadowed by structural, syntactic, and tense biases (for verbs), which their method of dominant direction removal helps to reduce. Another work [134] showed that during the fine-tuning phase, the anisotropy becomes even worse, while, in contrast, the performance of STS increases. They found that removing the top dominant directions for fine-tuned representations becomes detrimental, as the most essential information resides there. These findings indicate that the representation space of transformer models is quite complicated and that a simple fixation on isotropy can destroy its native features. These also include biases, the usefulness of which may not be obvious to us, but models have found them helpful during the pre-training. In the end, brute-forcing the plain isotropy may harm some intricate semantic details, which we seek to capture.

2.2.2. Post-Processing Methods

Many post-processing methods have been proposed to improve the embeddings. Whether they improve the isotropy of the representation space or help surface-relevant information [135], multiple authors show the benefits of using them. However, these methods generally favor unsupervised tasks, such as semantic textual similarity, with limited improvements for downstream tasks. If enough data on the target task are available, supervised training will inevitably be a better choice, as the transformation relevant to the task is learned. However, post-processing may be the only option if the training samples are obscure.

The simplest is the centering operation. It is especially important for the STS task that involves cosine similarities. The authors of [124] found that it restores a nearly perfectly isotropic distribution. Subtracting the mean is also the first step in several other post-processing methods. In some sense, such operations as centering can be viewed as fine-tuning to the target domain, as all target embeddings participate in providing the mean coordinates.

We already mentioned some operations related to isotropy enhancement in the previous section, such as zeroing target outliers or ABTT. In the following, we will mention the remaining important ones.

Z-Score Normalization

The Z-score describes the position of a point in terms of its distance from the mean when measured in units of standard deviation (in the distribution of the target samples). Z-score normalization, also called standardization, transforms the distribution of embeddings to have a zero mean and a unit standard deviation in all dimensions. The works in [102,103] recommend that you consider it as a post-processing step. The authors of [116] advise using normalization, such as the z-score, in supervised settings as a binary hyperparameter, because it can lead to rank changes.

All-But-The-Top (ABTT)

Many works are based on the influential ABTT algorithm [118]. It has a hyperparameter, which tells the number of dominant directions to remove. Principle components are either completely removed or left intact. As a result, the main weakness is finding the right number of top components to remove, which can either cause information loss or eliminate an insufficient amount of noise. The authors of [136] proposed post-processing through variance normalization (PVN) to normalize the variance of leading principal components to the same level instead of total component disposal. The authors of [137] performed the removal in a softer way by employing matrix conceptors [138] in an unsupervised way. A similar technique that employs conceptors can also target specific components, such as those that incorporate social biases, and diminish them, as in [139]. Other authors of [140] proposed learning weights for each dominant direction removal. All these approaches try to preserve useful information residing in the top principal components while narrowing the target noise.

Authors of [141] combined ABTT and dimensionality reduction. They found that the best pipeline is to perform ABTT twice while performing dimensionality reduction in between. In another line of work, the authors of [125,133], instead of global post-processing, remove local dominant directions in separate clusters in the representation space. In this way, the structure of the representation space is accounted for. Similarly, the authors of [65], in light of the predecessor method of ABTT [60], proposed to eliminate the sentence-dependent principal component, where they rerank the top principal vectors based on correlation with each sentence. This individual removal of dominant directions was shown to improve performance on the STSB task.

Whitening

The authors of [13] proposed the whitening approach to alleviate the anisotropy problem of sentence embedding. The whitening operation involves centering embeddings at the origin and making different dimensions have a unit variance and be uncorrelated, turning their covariance matrix into the identity matrix. It was also found to be useful in earlier work for the alignment of bilingual embeddings [142]. As a side effect, the authors of [13] showed that whitening can also be used with a dimension reduction operation. A concurrent work [14] also used a whitening algorithm to improve performance on STS tasks. In that work, the authors combined the first and last layers of BERT embeddings and then normalized them with whitening. Despite the initial success, subsequent work criticized the whitening operation. The authors of [143] showed that whitened representations greatly improve uniformity, but also suffer degeneration in the alignment property (for alignment and uniformity properties, see [144]). To make matters worse, the authors of [145] called whitening a “trick” that only helps with similarity tasks (by partially overfitting) and harms downstream task performance.

2.2.3. Retrofitting

This word vector space specialization method can incorporate semantic knowledge from external resources into word embeddings. The authors of [146] were the first to use the term “retrofitting” for their post-processing step of vector space word representations. Their idea was to incorporate rich relational information from semantic lexicons to word vectors, which are trained in a data-driven fashion using plain texts. The authors showed that the new vectors are similar in both their purely distributional representations and related word types. In addition, they showed improvements in several benchmark tasks.

Retrofitting can be useful for fitting a desired domain that is different from the original one used to pre-train the word vectors. An example is [147] where the authors retrofitted to linkage information in biomedical taxonomy. Another related field is the alignment of cross-lingual word embeddings. The authors of [148] used retrofitting to align the vectors of the source and target languages in the dictionaries. This way, translation pairs are pulled closer while minimizing deviation from the original embeddings and preserving the existing representation. In all these cases, the retrofitting is based on external relational data.

According to the authors of [149], the weakness of retrofitting is that only the vectors of words present in the external constraints (resources) are modified. Therefore, they proposed an explicit retrofitting model in which external knowledge relations are turned into supervised training examples. Here, the distance between synonyms is supposed to be as low as possible, and between antonyms as high as possible, while between remaining words not present in the external knowledge base, it should remain the same as in the original representation space. This way, on top of word vectors, a deep feedforward neural network is trained to retrofit all the word embeddings.

Some work also found success in applying retrofitting to contextualized representations. The authors of [108] observed that for ELMo [106] representations, the distance between the shared word in the paraphrases is even greater than the distance between “large” and “small” in random contexts. Therefore, to improve the representation capabilities, they minimized the difference in contextualized representations of the shared word in paraphrased contexts while differentiating between those in other contexts. This retrofit resulted in improved performance of downstream applications. Other authors of [150] proposed a two-step process: first to train static vectors from contextualized ones, and then to perform retrofitting. They showed that compared to baselines, such an approach gives the best results in a range of intrinsic and extrinsic tasks.

2.2.4. Other Methods

Some methods are proposed to reduce anisotropy during the training process. In particular, the authors of [120] added the specific regularization. According to them, the aperture of the narrow cone of representations can be improved by minimizing the cosine similarities between any two word embeddings. Such regularization encourages the vectors to be more evenly spread and expressive. This method is highly related to the now-popular contrastive learning approach (see Section 2.3.3). Other authors of [151] improve the isotropy of the output embedding matrix using the spectrum control method. They guide the singular value distribution of the embedding matrix throughout the training process and control the decay rate of these singular values.

The authors of [123] suggest learning transformation of the embedding space of the transformer, sometimes called a flow technique. Their method transforms the BERT sentence embedding distribution into a smooth and isotropic Gaussian distribution. During unsupervised training, only the flow network is optimized, while the BERT parameters remain unchanged. Although BERT-flow showed performance improvements on multiple tasks, the technique is criticized in the literature. First, it needs a specialized implementation and has multiple additional parameters. The authors of [131] report that BERT-flow requires on average 4.2 times more time per training epoch. Furthermore, for the STS benchmark task, the authors of [145] report that BERT-flow with the

l_{2}

similarity metric performs even worse than the baseline of the BERT average.

Different post-processing can help surface the different information residing in the embeddings. The authors of [135] proposed a simple unsupervised singular value decomposition to reassign the feature weights. By changing the similarity order of their transformation, they tailored word embeddings in the semantics/syntax (tasks focusing on sing–chant or sing–singing) and similarity/relatedness (tasks solving car–automobile or car–road relationships) axes. Other authors of [152] used spectral filters to dissect BERT representations at different temporal scales. They showed that a low-pass filter yields the highest probing accuracy in topic classification, a high-pass filter in speech tagging, and a middle-pass filter is best in dialog act speech tasks.

2.2.5. Similarity Measures

Some most basic tasks for sentences require measuring the distances between the corresponding embeddings. Usually, cosine similarity is employed. However, if the continuous representation space is curved or a text sequence is treated as a set of words, different metrics or representation space manipulations can be beneficial. The authors of [153] proposed Mutual Information (MI), well known in information theory and statistics, as a candidate for a similarity measure. They managed to successfully estimate MI for continuous random variables by the use of Kraskov–Stögbauer–Grassberger (KSG) estimator [154], which is based on elementary nearest-neighbor statistics. The other authors of [155] advised comparing sentence embeddings consisting of word vectors averaged by rank correlation, such as Kendall’s

τ

. They argued that such measures enable mean pooled representations to rival modern deep ones, used with cosine similarity. The authors of [156] proposed treating sentences as fuzzy sets of words and showed good performance with the specially adapted DynaMax similarity measure. They showed that word vectors alone are sufficient to achieve excellent performance on semantic textual similarity tasks when sentence embeddings and similarity measures are derived using ideas from fuzzy set theory. Other authors of [157] suggest processing each sentence with respect to its found k-nearest neighbors. Then, they find an optimal Euclidean subspace of the sentence manifold where cosine similarity would work best. The authors of [158] also project the sentences onto a fixed-dimensional manifold with the objective of preserving local neighborhoods in the original space. All these works mentioned employ various strategies to enhance the comparison of document representations, tailoring their approaches to the given task.

Many similarity measures between sets employing earth mover’s distance were proposed. The authors of [159] were the first to frame the similarity of documents as a transportation problem. The idea is that the distances between similar but different words in two documents should be small. Their Word Mover’s Distance (WMD) is the cost of transporting a set of word vectors in an embedding space. This approach was effective, as it managed to exploit similarities between different Word2Vec word vectors, such as the relation of analogies. However, the main deficiency of WMD was that the distances were expensive to compute. Moreover, an output, the single number of a distance between two given documents, can only be combined with k-nearest neighbors or k-means, while applications usually require a whole feature vector. The authors of a subsequent work [160] managed to derive vectors for documents using WMD. They constructed a positive-definite word mover’s kernel using a feature map given by the WMD to random documents and then derived document embedding via a Random Features approximation of the kernel. Other authors of [70] proposed Word Rotator’s Distance (WRD). It is designed so that the norm and angle of word vectors correspond to the probability mass and transportation cost in the earth mover’s distance, respectively. Therefore, the norm of vectors, which is associated with the vector’s significance, does not interfere with the calculation of transportation cost, as in WMD. Finally, the authors of [161] attempted to include structural information absent in the WMD’s distance estimation between two sets of words. They represented the sentence vector as a weighted average of substructure vectors at a lower level in a recursive way, while a transport plan at a different level explains how the different substructures align.

2.3. Learning Sentence Embeddings Directly

We already mentioned some models that learn the aggregation of tokens or their combination rules. However, in this section, we will look at more direct approaches to learning a representation vector for a given text.

2.3.1. Paragraph Vectors

The method presented in [162], and sometimes called Doc2Vec, is one of the first successful Word2Vec [16] adaptations to sequences of tokens. The authors presented two methods of how document representation could be trained.

In the Distributed Memory model of Paragraph Vectors (PV-DM), every paragraph is mapped to a unique vector, which, together with context words, participates in predicting the next word. This way, such a unique vector acts as a memory that stores the topic of the paragraph and bears its representation. The second is the Distributed Bag Of Words version of the Paragraph Vector (PV-DBOW) model. It leaves only a unique paragraph vector in the input and predicts words randomly sampled from that paragraph in the output. However, unlike the first method, it does not account for word ordering.

In the original article [162], authors represented sentences as a concatenation of DM and DBOW vectors, as they saw such a setup as more consistent. Later works [163,164] also reported that individual model performance had only marginal differences, yet deviations were usually task-dependent.

The authors of [165] proposed a modification to Paragraph Vectors and called it Doc2VecC. Differently from Doc2Vec, a document vector is derived not as a unique vector, but as an average of sampled constituent word embeddings. The other notable contribution is the added data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. At the time, the authors showed Doc2VecC to match state-of-the-art in multiple tasks.

2.3.2. To RNN- and Transformer-Based Models

Shallow Doc2Vec-like models soon met competition from more complicated Recurrent Neural Networks (RNNs). In a sequence-to-sequence [166] (like machine translation) setting, such a model usually has two parts: encoder and decoder. The encoder, going through each token one by one, encodes the input sequence into a fixed hidden representation, which is later used by the decoder to generate the target sequence (translation in NMT) token by token. Meanwhile, only the decoder part is needed to generate sentence representation. Words of the sentence are sequentially fed as input to the RNN, and the final hidden state is interpreted as its representation.

The benefit of RNN models is that the sequential consumption of tokens allows the final representation to account for the word order and process the arbitrary number of tokens. However, the effectiveness of this mechanism is questionable. It turns out that information has a hard time propagating all the way to the final hidden state. As compensation for that, a bidirectional [167] setting was used, which connects two hidden layers of opposite directions to the same output, simultaneously getting forward and backward information. The second solution was to use an attention mechanism [4], which, during decoder token generation, takes a weighted average of the encoder hidden states from all the input tokens. But it was only helpful for sequence-to-sequence tasks, like translation, as such a mechanism removed the last hidden state bottleneck, which was required to compress a long sequence of tokens into a single vector for a whole text representation.

One of the first famous recurrent models for sentence representation is the SkipThou-ght [168]. This model is trained to predict neighboring sentences from the source one. The center sentence is encoded by a bidirectional GRU (Gated Recurrent Unit [169]) that concatenates the last hidden state of a forward GRU and the last hidden state of a backward GRU, and then decodes it into the two target sentences. This way, the encoder is trained to map a sentence to a single representative vector. The authors showed that the model is robust and performed well on all tasks considered. The main drawbacks of this model are the month-long training, huge vector size of 4800 (resulting from the concatenation of 2400 vectors from two separate models), and support of only sentence-level embeddings, as well as the need for training text with a coherent inter-sentence narrative.

As recurrent GRU or LSTM [170] architectures became the standard, the most recent advances in text representation were due to how data and training objectives were chosen. Here, we have to mention a famous InferSent [12] model. The authors showed that the high-quality supervised data, although in low quantity, can hugely increase the performance. They trained an encoder based on a bi-directional LSTM architecture with max pooling, on the Stanford Natural Language Inference (SNLI) [171] dataset. The resulting model outperformed SkipThought after less than a day of training on a single GPU. Other authors of [172] modified the SkipThought model to QuickThoughts by framing the training task as a classification. Instead of generating the target sentence, the model encodes the likely candidates and chooses one of them. Authors of [173] leveraged several data sources with multiple training objectives. Their GenSen model training tasks included Neural Machine Translation, Constituency Parsing, Natural Language Inference, and SkipThought-like training. Finally, authors of [174] proposed even greater utilization of training data by constructing a discourse marker prediction task, predicting tokens such us tbecause, and, if, etc. Such framing of the task allowed the authors to mine vast amounts of text pairs together with the connecting markers. Their DisSent model performed similarly well to InferSent on various transfer tasks. Overall, one can see that how the model is trained and data quality and quantity play a huge role in the resulting text representation performance.

Many drawbacks of the recurrent models were solved by the transformer architecture [2]. The authors decided to get rid of recurrence and leave only the attention mechanism [4] itself. The resulting model (1) is highly parallelizable; (2) only after the first layer, each token has already attended to every other; and (3) it is faster than RNN models when the sequence length is smaller than the representation dimensionality. These properties allowed us to train transformer models with staggering amounts of text much faster and they became state-of-the-art models on multiple tasks.

Previous advances in sentence representation using RNNs were quickly applied to the new architecture. The authors of [175] presented a Universal sentence encoder model. It used only the encoder part of the original transformer model trained with multiple tasks: Skip-Thought-like training, SNLI, and conversational response prediction [176]. After the appearance of the pre-trained BERT model, the authors of [9] proposed SBERT: a pre-trained BERT model further fine-tuned with NLI data. It was a huge success, similar to the earlier InferSent model, yet this time both the new architecture and the general pre-training were the new factors for additional performance advancement. Other authors of [177] tried to further utilize the unsupervised data. Their PAUSE model learned sentence embeddings from a partially labeled dataset and showed that this way, the same performance can be achieved with a smaller fraction of labeled NLI data. Here, we see the same tendency as with the RNN models: new efforts to construct better fine-tuning tasks and better utilize existing data.

2.3.3. Contrastive Learning Approaches

The search for better training tasks and data utilization culminated in a new area of contrastive learning.

Distance-based contrastive loss [178] is more attractive for learning sentence-level embeddings than the more conventional error prediction losses. It allows for a simple construction of self-supervised learning using pairs of positive and negative examples. Embeddings of the first group of samples are encouraged to be the same (by utilizing a loss on a distance function between the text vectors), while the negative ones are encouraged to be different.

In this way, the burden of expensive labeling can be relieved. Moreover, it can be used as an intermediate step between pre-training and fine-tuning to inexpensively align the model to the target task domain. These properties are especially useful for the “data-hungry” transformer models.

The models in this class differ mainly in how they construct the positive example pairs. This can be achieved in many different ways. We will describe them below.

Feature/Vector/Embedding-Level Augmentations

These are the modifications to the sentence in its vector representation space.

Dropout [179] is one of the most popular feature-level augmentation techniques. It was originally used as a regularization of neural networks to increase the robustness to noise from the vector space. However, the same can be applied to derive an augmented text sequence representation.

Popular BERT and RoBERTa pre-trained transformer models already have dropout layers and require almost no modifications to employ the dropout augmentation. Two positives are acquired by passing the same sample through the dropout-enabled network twice [143,180,181]. This does not require additional preprocessing and can be applied on the fly during the training.

Despite the effectiveness reported, dropout augmentation has the disadvantage of being biased toward sentence length. This was observed in [181]. Two augmented versions of the same sentence are of the same length and can be easily discriminated against randomly drawn negatives of varying lengths. Therefore, instead of learning the general sentence representation to match the same sentence samples, the model now just takes the shortcut by only comparing their lengths. To counteract this, the authors of [181] added token-level augmentations in addition to dropout so that the length of positives would be different. Other authors of [182] also managed to avoid this problem by using negatives produced by a much higher dropout rate than those used for positives.

Another feature-level (also regarded as token-level) augmentation technique is the shuffle of tokens. Usually in pre-trained language models, positional information of each sequence element is brought about by the addition of special positional embeddings to the existing token ones. As a result, the shuffling of positions can be implemented simply as the shuffling of position IDs and the corresponding positional vectors.

Similarly to dropout, there is a cutoff augmentation [183]. Dropout can also be considered as random erasing of weights in

L \times d

text sequence embedding matrix with L tokens of length d vectors. The token cutoff erases randomly selected rows, while the feature cutoff erases columns of the

L \times d

matrix. In the work of ConSERT [63] it was found that the shuffle and token cutoff are the two most effective augmentation strategies, significantly outperforming the feature cutoff and dropout.

There are more sophisticated approaches to alter embeddings than a dropout randomly replacing weights with zero values. Specifically, adversarial attacks were shown to be successful. They aim to add worst-case perturbations to the input samples. The authors of [63] used the Fast Gradient Value (FGV) [184] method, which unfortunately relies on supervised loss to compute adversarial perturbations. Other authors of [185] employed the Fast Gradient Sign Method (FGSM) [186]. Perturbation is obtained by applying the sign function to the derivative of the model with respect to the input by contrastive loss. The authors of [185] composed their best augmentation mix as a dropout with FGSM.

Token-Level Augmentations

Language gives us many possibilities to convey almost the same meaning through similar or very different sequences of words. This can be easily utilized for augmentation purposes.

Using synonym replacement, random swapping of two consequent words, random insertion and deletion, the so-called Easy Data Augmentation (EDA) the authors of [187] managed to reduce the required dataset size by half for the same performance in multiple classification tasks. In contrastive learning experiments for the biomedical relation extraction task, the authors of [188] found that synonym replacement outperformed random swap and random deletion. Meanwhile, one of the first contrastive works for text [189] showed that back-translation [190] augmentation outperformed the four mentioned techniques. The authors took a pre-trained BERT [5] and pre-trained it further on the input text of GLUE [191] tasks in a contrastive fashion, to be later fine-tuned and evaluated in GLUE. The resulting CERT model achieved slight improvements in addition to the regular BERT. CLEAR [192] was also an attempt to improve the regular BERT and RoBERTa pre-training by jointly using MLM loss and constrastive loss. The authors found that the different combinations of different augmentation strategies (in particular, substitution, span-deletion, and reordering) favor different improvements for GLUE and various SentEval tasks. A recent work [185] tried token-level augmentations such as typos, synonym replacement, paraphrasing, and back-translation but found that they underperformed feature-level ones.

As the popular BERT model is pre-trained with a masked language modeling task, i.e., predicting the true masked token behind the special input [MASK] token, the same [MASK] can also be used to induce augmentations by randomly masking tokens. The mirror-BERT model of [180] used such random span masking for input augmentation, which, together with feature augmentation, showed gains over off-the-shelf models in both lexical and sentence-level tasks, across different domains and different languages.

Token-level augmentations can greatly contribute to existing vector-level ones. Feature-level augmentations do not affect the length of the text sequence and are therefore susceptible to length bias. ESimCSE [181] showed that simple word repetition augmentation can effectively counteract it. The authors also tested the insertion of the [MASK] token with small improvements, while the insertion of stop words slightly decreased the effect.

A better distinction of negatives can also be obtained by token augmentations. The authors of [92] proposed Bidirectional Margin Loss (BML) to incorporate soft-negative samples that are generated using a simple rule-based method. According to its dependency syntax tree, the positive sentence is converted to its negation with correct syntax and clear semantics. The proposed setup saw improvements in the semantic textual similarity tasks from SentEval.

One drawback of token-level augmentations is the chance of producing false positives. It is especially risky in stochastic modifications, such as back-translation, deletion, or insertion, where exact output cannot be controlled. Furthermore, token-level augmentations are more complicated than feature-level ones, and thus they cannot be performed on-the-fly during the training and must be prepared in advance.

Positives by Relative Placement

Similar to the older idea of skip-thoughts [168], the placement of sentences in a text can be exploited. In such a setting, text segments near or overlapping each other in the long text should also be near each other in the sentence embedding space. Therefore, text parts near the same anchor in the text can be regarded as positive, while further away as negative pairs for contrastive learning. DeCLUTR [193] for each document in a mini-batch samples multiple anchor spans. The corresponding positives are further sampled for each anchor and can be partially overlapping, adjacent, or subsumed by the anchor. This is accomplished by sampling the anchor span length to be mostly longer than the positive spans. At the time, DeCLUTR showed improvements in SentEval [194] transfer tasks in an unsupervised setting. Location exploitation is a simple approach, but it has some drawbacks. As the authors used the dataset consisting of documents of at least 2048 tokens in length, it clearly reveals that sentence placement methods have limited applications for short text domains. Moreover, random spans of DeCLUTR are subject to fragmentation in semantics.

Positives by Two Networks

The augmented version of the sample can also be constructed using two encoders. Different weight initialization and training data order are the most common causes of fluctuations in neural network performance. Although the performance is generally comparable, each model learned differently represents a different local minimum. Such a subtle difference can act as an augmentation. One of the first such setups was the CT model [99]. The simple objective required two models to retain similar representations for identical sentences while distinguishing their representations from different ones.

The authors of [86] presented an approach of two BERT models, one fixed and the other tunable. The authors imposed a contrastive tension between the [CLS] token of the tunable model and a holistic representation of the fixed one. The latter is aggregated from all BERT layers, thus encouraging the tunable model to learn to concentrate all the relevant information to the [CLS] token.

The authors of [195] proposed using multiple diverse positives instead of the usual two. The authors claim that this increases the probability of “at-least-one” effective positive during training. To implement diverse positives, the agreement of two similarity distributions of samples between two encoder models had to be maximized.

Another setup incorporating two networks can be used for Knowledge Distillation (KD). In [196], after the regular KD step with the original pre-training task, an additional contrastive pre-training was added. The pair of positives for a single text segment consisted of a representation from the teacher model and another from the student. After the third stage of fine-tuning on semantic textual similarity tasks, [196] showed that the 110 M model outperformed the one with 11B parameters.

MoCoSE [185] combines both feature-level augmentations and two branches based on asymmetric BERT encoders. While the online branch is updated through the loss gradient, the second, the target branch, is updated by the Exponential Moving Average (EMA). These discrepancies between the two branches prevented the model from collapsing and allowed the achievement of competitive results in SentEval.

Although two-network approaches are conceptually similar, they require extra memory or time, which is a big drawback. Transformer models already require the latest state-of-the-art GPU with the largest available memory. Techniques such as gradient accumulation are often used to process large mini-batches sequentially, but in the contrastive learning approach, the requirement for large in-batch negatives gives additional complexity.

Positives and Negatives from Supervised Data

Up to now, we have described various augmentation techniques that modify existing unlabeled samples for use in contrastive learning. However, there are several labeled datasets that can also be turned into positives or negatives by reusing the label information.

One of the first works to implement this idea was SimSCE [143]. The authors analyzed six candidate datasets. They found that the Natural Language Inference (NLI) datasets, SNLI [171] and MNLI [197], together performed the best. The combined dataset contains sentence pairs in the form of (premise, hypothesis, label), labeled as entailment, neutral, or contradiction, 314k for each class. The SimSCE authors constructed positive pairs from entailment samples and negatives using contradiction ones. The resulting supervised approach achieved state-of-the-art results at the time for SentEval tasks. The authors also tried to incorporate neutral pairs as less weighted negatives but did not observe improvements.

The authors of [198] note that the use of NLI labels to construct positive and negative pairs can contradict apparent semantic information. Elements of the negative pair may not be negative in semantic space. To address this issue, the authors train the PairSupCon model by incorporating an instance discrimination objective, which is claimed to have an implicit grouping effect. The objective discriminates both hypothesis and premise sentences from the positive pair separately from all other sentences in the batch. The authors also incorporated importance weighting on negatives to facilitate the better effect of hard ones. In total, the overall loss consisted of two instances of discrimination (one for hypothesis and the other for premise) with negative weighting and a third cross-entropy predicting NLI labels. Significant improvements were shown for clustering tasks, while for semantic textual similarity tasks in SentEval, only moderate gains were shown.

A novel use of NLI dataset labels was implemented in the PairSCL [199] model. First, the pair of hypothesis and premise sentences from the NLI dataset is passed through an encoder, and the unified representation of such pair is aggregated by a cross-attention module. Then a supervised contrastive loss is applied between positive and negative samples. What is interesting is that positives are regarded as hypothesis and premise pairs from the same class, i.e., whether contradiction, neutral, or entailment. Pairs labeled in different classes are considered negatives. In addition to such supervised contrastive loss, an additional cross-entropy loss is applied that predicts the actual class label. This setup showed improvements in both the NLI and SentEval transfer tasks.

ST5 [200] adopted the two-stage training strategy. In the first, 2B mined question-answering data from Community QA sites were used, framing the question and answer into the positive pair. During the second stage, NLI data were used, similar to SimSCE [143]. Such a more data-rich training allowed the authors to outperform the previous approaches for the SentEval task. Furthermore, using larger models with up to 11B parameters brought even further gains.

Supervised datasets are difficult to produce. If the domain of the task in question differs from publicly available NLI datasets, the easiest solution is to use unsupervised methods. Otherwise, labeled sources are indispensable. Exploiting multiple datasets and label information can make the benefits even more obvious.

Direction of Contrastive Learning for NLP

The classic setup of contrastive learning is becoming more nuanced. The usual setup of two positives and multiple in-batch negatives (just any other sentence) is not perfect. As a result, the exact distance to the negatives or between the positives varies, and learning produces only a coarse approximation. The distinction between soft, hard, and weighted negatives begins to arise [92,143,198], as well as the use of multiple positives [195]. The classic NT-Xent (also called InfoNCE) loss is thus often modified or additional losses added to incorporate finer training signals. The desire to scale models and datasets is also observed, as in [200], yet we think that more memory-efficient approaches, as in [196], should be prioritized. Nonetheless, contrastive learning currently produces state-of-the-art sentence embeddings.

3. Methods

Our extensive literature review allowed us to see the big picture of the sentence embedding research.

Currently, the evolution of models for sentence embeddings and related NLP tasks is settled at transformers. In this work, we use existing models to source the raw, token-level embeddings. In Section 3.2, we describe the main model that we use in detail, some baselines we used for comparisons, as well as some of our original extensions. In particular, in T1–T4 models, we extend original prompting templates by incorporating more than one [MASK] token. Next, we present a new Avg. model where we derived sentence embeddings first by averaging tokens in different contexts and then by averaging the tokens themselves. Finally, we present our BERT + Avg. model, which combines both contextual and multiple-context averaged representations, all derived from the same BERT transformer model.

In addition to these extensions, we found two main directions that can be used to improve sentence embeddings from transformer models: token aggregation and post-processing techniques. Note, however, that our main contribution here is not the methods, as we reuse most of them from the existing works, but the combinations of them on the transformer model and extensive evaluation on multiple tasks. We thoroughly describe the techniques used (and minor extensions) for token aggregation in Section 3.3 and post-processing of embeddings in Section 3.4.

We have noticed that most works confine themselves to a small subset of evaluation tasks, which limits their results’ comparability to others. Papers from top conferences always include classification tasks on top of semantic textual similarity, which is usually the only evaluation. In this work, we evaluate sentence embeddings on three different types of tasks: semantic textual similarity, downstream classification, and clustering. We present these tasks in Section 3.5.

We now proceed with the formal definition of the problem we address in this work.

3.1. Problem Formulation

We are given a text sequence

s_{0}

, which was tokenized into N individual pieces (i.e., tokens)

\begin{matrix} t_{1} & t_{2} & \dots & t_{N} \end{matrix},

(1)

and a model representing each token as a d-length vector

v \in R^{d}

in each of its L layers. We also use static token embeddings denoted as the −1st layer and input embeddings denoted as the 0th layer. They are sums of token, segment, and position vectors, with layer normalization applied on top. Therefore, we obtain the following 3-dimensional representation space

R^{(L + 2) \times N \times d}

:

\begin{matrix} v_{1}^{- 1} & v_{2}^{- 1} & \dots & v_{N}^{- 1} \\ v_{1}^{0} & v_{2}^{0} & \dots & v_{N}^{0} \\ v_{1}^{1} & v_{2}^{1} & \dots & v_{N}^{1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ v_{1}^{L} & v_{2}^{L} & \dots & v_{N}^{L} \end{matrix} .

(2)

We want to find such an aggregation function f that the

((L + 2) \times N \times d)

-dimensional output from a chosen transformer model would be reduced to as meaningful a d-dimensional vector as possible in

R^{d}

.

As a baseline for the bert-base-uncased model with

L = 12

layers, we consider the average over the N tokens in text and over the first and last layers.

f_{first + last} (s_{0}) = \sum_{n = 1}^{N} \frac{v_{n}^{1} + v_{n}^{12}}{2 N} .

(3)

Additional Context Data

Generally, a corpus has S number of text samples. Therefore, other text sequences can also be used to derive the representation of the target sentence. In this way, our aggregation function f operates on an

(S \times (L + 2) \times N \times d)

-dimensional output from the model. S can also be enlarged by using additional datasets. In particular, we used Wikitext-2 from [201]. This can be used for various post-processing techniques such as centering, standardization, PCA, and others, where transformations are learned on corpora other than the target corpus.

3.2. Models

We use multiple text representation methods, focusing mainly on BERT-based ones. Prompting method T4, Averaged BERT (Avg.), BERT + Avg., B2S-100, and Random embeddings (RE) are our proposed models or modifications, while BERT, T0, and B2S are plain adaptations of existing ones. We tested token aggregation and sentence representation post-processing techniques on all eight models. We will describe them in more detail below.

3.2.1. BERT

BERT is a very popular transformer model. When first introduced in [5], it spectacularly outperformed multiple other models on a wide range of tasks.

BERT is based on the original transformer architecture of [2]. The main difference is that instead of sequence-to-sequence workings and encoder–decoder structure, it is composed only of an encoder side. This allows it to solve multiple classification and regression tasks, maintaining the same pre-trained base while only changing the classifier heads on top.

In this work, we use the BERT version called bert-base-uncased, a 110 M parameter model containing 12 layers (blocks) and working with lowercase text. We employ the Hugging Face implementation [202] of BERT. It was pre-trained on BooksCorpus (800 M words) [203] and English Wikipedia (2500 M words) datasets, which take 13 GB of plain text combined. The model was trained for 40 epochs (or passes through the corpus). Back then, the training time, model size, and data used were considered to be very large, yet now they are only a small fraction of what the current state-of-the-art models use.

Input

Input to the BERT encoder consists of token, positional, and token type embeddings. BERT uses WordPiece tokenization [204]. Splitting less frequent words into sub-words (e.g., “transformer” → “transform” and “##er”) rather than splitting everything on word boundaries allows one to reach a manageable vocabulary size of 30,000 tokens. BERT, like other transformer models, perceives input as a set; therefore, order information must be supplied in addition. To achieve this, additional positional embeddings are used, with a unique value for each position in a token sequence and feature dimension. Token type embeddings allow the model to distinguish between the two (if there are two) separate sentences (e.g., 〈 Question, Answer 〉) in one token sequence. Overall, token, positional, and token type embeddings are added, and then layer normalization [205] and dropout [179] are applied. This results in the input to the BERT model blocks (for simplicity, we will call them layers).

There are 3 special tokens. [CLS] starts every token sequence, [SEP] ends token sequences and acts as a separator in a pair of sentences, and [MASK] is used during pre-training to mask some percentage of the input tokens at random for the model to predict (known as a “Cloze” task [206]). The [CLS] token is also used for the next sentence prediction task—predicting whether the second sentence in a pair actually follows the first one in the training dataset or is a random one—the second unsupervised pre-training task of BERT.

Transformer Block

Each transformer block (or layer) has two sublayers. The first is a multi-head self-attention mechanism and the second is a position-wise fully connected feed-forward network. The output of each sublayer Sublayer(x) is added to the input x that bypasses the sublayer through residual connection [207] and the resulting signal ends in layer normalization [205], LayerNorm(x + Sublayer(x)) becoming the input to the second sublayer or the next layer.

Multi-Head Self-Attention

An input tensor

X \in R^{b \times t \times f}

with b samples in a batch, t tokens, and f features is first linearly projected into queries

Q \in R^{b \times t_{Q} \times h \times f_{Q}}

, keys

K \in R^{b \times t_{K} \times h \times f_{K}}

, and values

V \in R^{b \times t_{V} \times h \times f_{V}}

tensors with h heads using learned weights. Note that in the standard case,

t = t_{Q} = t_{K} = t_{V}

and

f = h f_{Q} = h f_{K} = h f_{V}

. Then, for each sample text in a batch and each head dot, products between feature vectors of every combination of query and key tokens are calculated, resulting in an attention tensor

(Q K) \in R^{b \times h \times t_{Q} \times t_{K}}

. It is then scaled by dividing it by

\sqrt{f_{K}}

, and softmax is applied along

t_{K}

so that all the dot products for any given query token and all the key tokens would sum to 1. Now, a dot product is calculated again for the probability-like scores and values tensor, resulting in a weighted selection of V, with the form of

R^{b \times h \times t_{Q} \times f_{V}}

. Finally, the heads are concatenated back and a linear transformation layer

W^{O} \in R^{h f_{V} \times f}

is applied. The resulting tensor is the same shape as the input one: MultiHead

(X) \in R^{b \times t \times f}

.

For a more detailed explanation of the BERT and transformer architecture, please read the original papers [2,5].

3.2.2. Prompting Method (T0, T4)

The classical use of the BERT model involves two steps: (1) a general pre-training on a very large corpora with unsupervised tasks, and (2) a supervised fine-tuning step on a small target task dataset. The first step is expensive, it requires a lot of data, time, and computing resources, but it is carried out only once. Weights that are produced in an unsupervised way contain a lot of useful representations for the fine-tuning to only perfect them.

Prompt-based methods take advantage of the first step and can completely avoid the fine-tuning stage. The idea is to frame the target task in the original pre-training format, for which the model is essentially optimized. To extract a general sentence representation, the authors of [15] proposed using “This sentence: “[X]” means [MASK]” template (which we will call T0), where the target sentence is placed instead of [X] and the final layer representation of [MASK] is used. This way, the model itself tries to predict the meaning of a given sentence, and we can extract prediction weights before they are turned into probabilities over tokens.

During initial experiments, we manually searched for other more complicated templates than T0, presented in Table 1. We will also use the T4 template, which is much longer and has 3 [MASK] tokens that have to be averaged.

3.2.3. Averaged BERT (Avg.)

We follow the idea of [53] to average the representations of a token in its different contexts to acquire its static embedding. Yet, we take it even further and construct vectors for sentences by again averaging such static token embeddings over the sentences.

We construct static tokens using the Wikitext-103 [201] dataset and use the same tokenization as of BERT version bert-base-uncased. For each of all 28,807 tokens occurring in the Wikitext-103, we sum all vectors produced by the BERT model (which differ due to different contexts) and divide them by their count. Note that some tokens occur very often, while others are rare (for example, the most frequent, “the”, is repeated 6,470,356 times). The process is repeated for every BERT layer, as well as combinations of layers, so that we can observe the performance dependence on this factor as well.

Combining Averaged BERT and Regular BERT (BERT + Avg.)

We also wanted to see how sentence embeddings derived from static averaged BERT tokens can contribute to the original BERT representations. Therefore, we averaged sentence embeddings from the two methods mentioned above.

3.2.4. BERT2Static (B2S, B2S-100)

There are more advanced distillation methods than simple BERT token averaging. For example, the authors of [49] trained static word vectors using BERT’s contextualized representations. They adapted the Sent2Vec [48] model for words and trained it to predict the word given the context element of it by the contextual representation of the BERT model. This way, the authors obtained vectors for the 750,000 most frequent words. They also showed that this approach results in vectors that perform better than existing static embeddings trained from scratch, while still enjoying a small memory and computational footprint.

In our experiments, we used the bert_12layer_para model (downloaded from https://github.com/epfml/X2Static, provided in [49], accessed on 1 May 2023), which was trained from the contexts of a paragraph rather than only a sentence. We evaluated two versions of BERT2Static: a regular one (B2S) and one with the 100 most frequent words filtered out (B2S-100).

3.2.5. Random Embeddings (RE)

Instead of using a complex model to derive embeddings for words, we also tested random vectors as token embeddings. More specifically, we assign each token a random vector drawn from the normal (Gaussian) distribution, centered at 0 with 0.1 standard deviation. To facilitate better comparisons with the BERT model, we use the same tokenizer from the BERT version bert-base-uncased and make the vectors the same 768-size length. The whole sentence representation is then computed as the average of its constituent token vectors.

3.3. Aggregating Tokens

Every text contains multiple tokens, each with a corresponding embedding vector. To obtain one vector for the whole text out of the many, usually, a simple average is calculated, as discussed in Section 2.1. We use it as a baseline here for multiple methods.

We investigate different methods of token weighting and filtering based on their frequencies. In classical bag-of-words approaches, it is usually accounted for with the tf-idf weighting. We, however, use only idf weighting, because the same tokens in BERT cannot be simply counted due to different contextualization. Given a dataset with N documents and document frequency

{df}_{t}

, defined to be the number of documents in the given dataset that contain a token t, we calculate

{idf}_{t}

for a token t as

{idf}_{t} = log \frac{N}{{df}_{t}} .

(4)

To account for long/short text differences, we scale

{idf}_{t}

token weights for each text so that they would sum up to 1. This way, to calculate the average embedding for a given text, one need only sum up the weighted vectors. We calculate two inverse document frequencies:

{idf}_{t}^{W}

for the Wikitext-2 dataset from [201] and

{idf}_{t}^{T}

for all samples from the given target task.

We also adopt the method in [15,63] to drop the most frequent tokens. We choose 33 tokens, which were depicted in the appendix of [15]. Following that article, we also investigate dropping punctuation and subword tokens. We found that dropping all three parts—the most frequent, punctuation, and subword tokens—has the best effect, and following the original work, we name such token aggregation as with removed biases (“-biases”).

3.4. Post-Processing Embeddings

It is common in machine learning to standardize datasets, as most methods are designed to work best with normally distributed data: Gaussian with zero mean and unit variance. Yet it is not that trivial, as data may not follow a smooth distribution and may contain disturbing outliers. In the context of the best practices reviewed in Section 2.2, we investigate the following processing methods of embeddings.

3.4.1. Z-Score Normalization

The most basic is the standard score, also called the z-score. It is the number of standard deviations

σ

by which the value x of a raw score is higher than or below the mean value

μ

of the raw scores of all samples. We normalize our vectors to have z-score = 1:

z = \frac{x - μ}{σ} .

(5)

3.4.2. Quantile Normalization to Uniform Distribution (quantile-u)

This technique works by making two distributions identical in statistical properties, thus reshaping given data values according to some known distribution function. In our case, we found that the uniform distribution worked very well as a reference. For a more detailed description of the technique, see [208].

We also tried other methods that are more robust to outliers; however, we found their performance marginally below quantile-u. This includes quantile normalization using a normal distribution and RobustScaler (as it is called in the scikit-learn preprocessing library [209], which we used). The latter method removes the median value instead of the mean and scales the data according to the selected quantile range.

3.4.3. Whitening

It is a transformation that produces uncorrelated components, each with a variance of 1.

3.4.4. All-But-The-Top (ABTT)

It is a method introduced in [118]. We start with the given embedding matrix

A \in R^{b \times f}

of b sentences, each with f features. First, it is centered by its mean

μ \in R^{f}

into

\tilde{A} \in R^{b \times f}

. Using the centered

\tilde{A}

, and given the number d of the top principal components to remove, we then calculate PCA components

U \in R^{d \times f}

. Now, we project our data into these components to acquire

A^{P C A} \in R^{b \times f}

:

A_{b f}^{P C A} = \sum_{d} {\tilde{A}}_{b f} U_{d f} .

(6)

The final processed embedding matrix will be:

A^{'} = \tilde{A} - A^{P C A} .

(7)

The only difference of our approach from the original authors of [118] is that instead of post processing words, we use all-but-the-top to post-process documents.

3.4.5. Normalization

We also experiment with normalization: scaling individual sentence vectors to have a unit norm.

3.4.6. Learning Post-Processing

Some target task datasets can also be too scarce to calculate accurate statistics, such as mean, standard deviation, or others, used in post-processing. Therefore, instead, we also experiment with learning these weights on the Wikitext-2 dataset. We indicate such techniques with a superscript

\cdot^{W}

.

3.5. Evaluation

We evaluate the investigated methods on multiple clustering, semantic textual similarity, and classification tasks.

3.5.1. Clustering Tasks

We assess the performance on 6 benchmark datasets for short text clustering. Compared to the usual ones, short datasets impose a challenge due to the weak signal caused by sparsity, which is a big problem for classic count-based approaches such as bag-of-words or tf-idf. Table 2 provides an overview of the main statistics, and the details of each dataset are as follows.

Agnews

It is a subset of news titles [210], which contains 4 topics selected by [211].

Biomedical

It is a subset of PubMed data distributed by BioASQ (http://participants-area.bioasq.org/, accessed on 1 May 2023), where 20,000 paper titles from 20 groups are randomly selected by [212].

GoogleTS

It contains titles and snippets of 11,109 news articles related to 152 events [213]. We use the full version of the dataset, which includes both titles and snippets, named GoogleNews-TS in [211].

Searchsnippets

It is extracted from web search snippets and contains 12,340 snippets associated with 8 groups [214].

Stackoverflow

It is a subset of the challenge data published by Kaggle (https://www.kaggle.com/competitions/predict-closed-questions-on-stack-overflow/data, accessed on 1 May 2023), where ref. [212] selected 20,000 question titles associated with 20 different categories.

Tweet

It consists of 89 categories with 2472 tweets in total [213].

We perform clustering by running k-means with the scikit-learn [209] package and reported the clustering accuracy, computed using the Hungarian algorithm [215] and averaged over 10 independent runs (we used the codebase from https://github.com/amazon-science/sentence-representations and downloaded clustering datasets from https://github.com/rashadulrakib/short-text-clustering-enhancement/tree/master/data, both were accessed on 1 May 2023).

3.5.2. Semantic Textual Similarity (STS) Tasks

STS assesses the degree to which two sentences are semantically equivalent to each other. A single sample consists of two sentences and a score ranging from 0 for no meaning overlap to 5 for meaning equivalence. The semantic textual similarity shared task has been held annually since 2012 up to 2017 [216,217,218,219,220,221] and formed STS12-STS17 datasets. A total of 8628 carefully collected samples from these contests formed the STS benchmark [221]. Table 3 shows details of the STS datasets, including SICK-Relatedness [222] and STS-B test sets, which we also use. Similarly to [145], we also add (STR) [223], a recent semantic relatedness dataset created by comparative annotations.

STS is a very popular choice for evaluating textual embeddings in an unsupervised way. Without any fine-tuning, one can calculate the distance (usually cosine) between two vectors of two sentences, which should correlate to the target equivalence score. It is so widely adopted that almost all work on semantic representations assesses the model performance on this task.

We follow the STS evaluation settings from [143]. First, the evaluation is kept unsupervised by not applying any additional regressors; cosine similarity between embeddings in a pair is taken directly as a model score for similarity. To find the degree of correlation between the annotated and model-given labels, Spearman’s rank correlation is used because it measures the rankings instead of the actual scores. Finally, for annual STS challenges, we concatenate all subsets and report the general Spearman correlation for that year (referred to in [143] as “all”). As we used the SentEval toolkit [194], we had to implement concatenation ourselves since it only had “mean” and “wmean” settings available.

3.5.3. Downstream Classification Tasks

Differently from STS, these tasks are evaluated in a supervised way. Following the SentEval [194] benchmark suite, the commonly used evaluation protocol is to train a logistic regression or an MLP classifier with a cross-validation setup on top of the frozen representations, and the testing accuracy is used as a measure of the representation quality. We went after the logistic regression classifier and the 10-fold cross-validation scheme, the setting which is the most commonly reported in the literature. We evaluated various binary, ternary, and fine-grained classification as well as regression tasks. A more detailed description of each is presented below, while statistics are presented in Table 3.

Binary Classification

It includes sentiment prediction from Stanford Sentiment Treebank dataset SST2 [112], movie reviews MR [224], and customer product reviews CR [225]. In SUBJ [224], binary subjectivity status is labeled for sentences from movie reviews and plot summaries, and in MPQA [226,227], phrase-level opinion polarity from news articles is predicted. The last is MRPC [228], the Microsoft Research Paraphrase Corpus, from parallel news sources for the paraphrase detection task.

Ternary Classification

SCICITE [229] is a domain-specific classification task that assigns one of three intent labels (“background information”, “method”, “result comparison”) to sentences collected from scientific articles citing other articles. Meanwhile, the SICK-E [222] dataset has labels for sentence pairs as “entailed”, “contradiction”, or “neutral”.

Fine-Grained Classification

It includes SST5 [112], a 5-level sentiment analysis dataset, and TREC [230], comprising classification of 6 types of questions. We also evaluate regular semantic textual relatedness and similarity tasks SICK-R [222] and STS-B [221] with classification, by splitting the real-valued similarity targets into 5 discrete class labels. For example, a [0, 5] score of 3.6 goes to class 3 with weight 0.4 and also to class 4 with weight 0.6.

3.5.4. Isotropy

We use the IsoScore [117] metric to measure the uniformity of the utilization of the embedded space. As shown by the authors of [117], IsoScore has stronger properties than other isotropy measures.

For each sentence embedding method, we calculated IsoScore for the Wikitext-2 dataset from [201]. We split the dataset into individual sentences (now samples) and omitted texts with less than 10 characters.

3.5.5. Alignment and Uniformity

Explaining the success of contrastive methods, the authors of [144] proposed using alignment and uniformity properties to better quantify the quality of representations. Alignment is calculated between semantically related positive pairs and therefore is an expected distance between embeddings of the paired instances:

L_{align} (f) = \underset{(x, y) \sim p_{pos}}{E} [{∥ f (x) - f (y) ∥}^{2}] .

(8)

Uniformity is computed using representations of the whole space:

L_{uniform} (f) = log \underset{(x, y) \overset{i . i . d .}{\sim} p_{data}}{E} [e^{{- 2 ∥ f (x) - f (y) ∥}^{2}}] .

(9)

Smaller values of both alignment and uniformity indicate better quality of the representations.

As a distribution of positive pairs

p_{pos}

, we used STS-B task training split pairs with a similarity of 5.0, and for

p_{data}

, we used sentences from all pairs.

4. Results

Here, we present the results of our various experiments outlined in Section 3.

4.1. Token Aggregation and Post-Processing Techniques

The results of best-performing token aggregation and embedding post-processing techniques are presented in Table 4. The results here are averaged over all the different tasks in the class. See Appendix A for the individual results.

We see that both aggregation and post-processing methods have a great impact on STS and clustering performance. For semantic textual similarity tasks, the best model was improved from 62.3 to 71.6 average Spearman correlation, while for clustering, the best model was improved from 59.2 to 64.8 average accuracy.

Our techniques can even improve the dedicated SimCSE model [143], which was fine-tuned on NLI data. Its main strength lies in semantic textual similarity tasks, where it leads with over 10% difference. However, for clustering tasks, its average accuracy is similar to the other evaluated models at 59.8% and improves up to 64.0%, if we apply the best-performing techniques. This showcases a general tendency that the top-performing models are very good only in a narrow subset of tasks and highlights the importance of our more general methods.

For maximum performance, usually, both aggregation and post-processing techniques must be used. As an example, sentences placed within the T4 template with no techniques applied have a 62.3 average Spearman correlation. With a better aggregation method of

i d f_{t}^{T}

, it is pushed up to 69.2, while only using quantile-u post-processing gives 67.3 (see Table 5 with only post-processing techniques applied). However, a combined effort of both

i d f_{t}^{T}

and quantile-u gives the average Spearman correlation score of 71.6.

Still, the improvements from both aggregation and post-processing do not exactly add up linearly. This indicates that the improvements to representations of this type may saturate below the perfect scores.

We considered many post-processing techniques for sentence vectors and show the performance of the most popular and best-performing ones in Table 5 separately. Indeed, almost every method somehow improves the average Spearman correlation for STS tasks. For all-but-the-top (abtt), we varied the number of components to remove up to a hundred but settled on removing only the first two, as it was slightly better than the rest. Although highly credited in the literature, abtt-2 still obtained smaller scores than the others. Plainly averaging the token representations, the highest scores for both STS and clustering tasks were achieved using quantile-uniform normalization. We would also like to mention the simple normalization of vectors to the unit length, which was often beneficial for clustering tasks, and whitening normalization, which was learned on Wikitext-2 dataset, and was favorable with the random embeddings model. We trained other post-processing methods on Wikitext-2 too, yet they resulted in similar or slightly smaller scores.

Unlike unsupervised ones, classification tasks do not benefit from the two representation-shaping techniques. There is only negligible improvement for classification tasks, when the biases (punctuation, most frequent, and subword tokens) are eliminated. Otherwise, performance is only decreased.

This result is logical because similarity or clustering tasks are carried out with the resulting text representations directly, while the classification is a supervised task learned on top. In the first case, the most informative components of the representations must be present in the largest principal components (i.e., constitute most of the variance in the data) for high scores, while the supervised logistic regression classifier can learn to extract them from the small principal components or ill-shaped representations on its own.

4.1.1. Avg. versus B2S and B2S-100

An interesting result, as seen in Table 4, is the comparison of simply averaged BERT tokens from many contexts (the Avg. model) and word vectors from a specially trained Word2Vec-style model BERT2Static (the B2S model) using the same BERT contexts, as described in Section 3.2.4. If no token aggregation and post-processing techniques are used, B2S is of similar performance in STS tasks (with average Spearman correlation of 56.7 versus 56.4), worse at clustering (with average clustering accuracy of 53.8 versus 55.2), and comparable in classification tasks (with average B2S accuracy of 76.9 versus 77.3) to the Avg. model. As we can see, the task performance differences are very small.

Apart from training, the second main difference between the Avg. and B2S models is tokenization. The Avg. model uses sub-word tokens, the same as BERT (tokens of which it averages), while B2S models use a vocabulary of the full set of the most frequent words, 20 times the size of the BERT’s. Due to this discrepancy, token aggregations for BERT and B2S models are applied differently, as idf statistics for both tokenizations are different; also, B2S does not have the equivalence of biased tokens such as BERT. However, it is very straightforward for B2S to filter out (do not use during averaging) a portion of the most frequent words. During our experiments, we varied this number and found that removing the 100 most frequent words works best, which improved the average Spearman correlation of STS tasks from 56.7 to 60.6 and the average accuracy of clustering tasks from 53.8 to 55.3 compared to full B2S, both with plain token averaging and no post-processing applied.

Despite the additional option for B2S to remove the 100 most frequent tokens, using additional token aggregation and post-processing techniques allows our simple Avg. model to surpass both B2S and B2S-100 (see Table 4). Note that the removal of the most frequent words for B2S works in a similar way to weighted token aggregations or post-processing, the gains are not additive. Now the best average STS Spearman correlation score becomes 69.5 for Avg. versus 66.4 of B2S and the best average clustering accuracy of 62.2 for Avg. versus 57.2 of B2S-100. Additional techniques here helped a much simpler Avg. model outperform a much more complex learned B2S.

The success of such a simple Avg. model inspired us to seek further improvement by combining it with the parent BERT model. For the combined BERT + Avg. model results, please see Section 4.3.

4.1.2. BERT versus Random Embeddings

The most dramatic increase in performance due to the two techniques is observed for the model of random embeddings. For STS tasks, the average Spearman correlation rises from 50.3 to 66.4, while for clustering tasks, it increases from 36.3 to 49.5. Although accuracy for the latter tasks is still very low, for STS, it is just a mere 3.4 points below what BERT managed to accomplish with both techniques applied.

If we were to look at the detailed performance on individual tasks in Appendix A.1, we would find several where random embeddings with the help of the two shaping techniques score higher than the also-shaped BERT representations. For STS14 (Table A3), it is 68.8 versus 68.3 Spearman correlation, while for the googleTS (Table A11), stackoverflow (Table A13), and tweet (Table A14) datasets, the clustering accuracy is, respectively, 69.5 vs. 68.5, 70.6 vs. 59.9, and 58.5 vs. 55.1. This may indicate a smaller complexity of these particular datasets, where the task can largely be solved based on a small set of keywords. Also, if the texts do not contain the natural language of the type the BERT was pre-trained on (e.g., they contain code), the model cannot properly contextualize the tokens, and the random embeddings without contexts work better. Yet it is important to note that it is achieved only with the help of aggregation and post-processing methods on top of the random embeddings.

The largest clustering accuracy difference between BERT and random embeddings is for the agnews and searchsnippets datasets, with the best scores of 86.8 versus 43.4 and 82.9 versus 36.6, respectively. We think that the observed performance gap may be related to the vocabulary and text sizes of the datasets. Stackoverflow and tweet datasets, with random embeddings ahead of the BERT, have the smallest vocabulary sizes of 7332 and 8091 (see Table 2), while agnews and searchsnippets have the largest vocabulary sizes of 16,140 and 16,334 unique tokens, respectively. Because random embeddings rely only on distinctness of tokens, with the increased amount of them, the probability of having exactly the same keywords in two texts drops. BERT embeddings having non-identical but semantically similar tokens and similar representations helps in this case.

Having shorter text lengths (average text sizes of 12 and 11 for stackoverflow and tweet, respectively, versus 26 and 24 for agnews and searchsnippets, respectively) may also help the random embeddings, because they do not average away so easily, and there is less context to consider.

In contrast, other static models have much more comparable scores. For example, the Avg. clustering accuracy is 83.8 (agnews) and 74.2 (searchsnippets).

4.1.3. Isotropy

Many post-processing methods have previously been proposed to improve the isotropy of representations. It was argued that representations from the BERT model fall into the narrow cone and, therefore, are anisotropic. Thus, by raising isotropy, one can improve regular task performance.

We calculate the IsoScore isotropy metric on the Wikitext-2 dataset after applying various post-processing and token aggregation methods for multiple models. We exclude whitening post-processing, as it always produces representations with the IsoScore at the maximum of 100%, just due to its working principle. Then, we compare the isotropy score to the STS, clustering, and classification task performance by calculating the Pearson correlation coefficient. For semantic textual similarity tasks, results are shown in Figure 1, and for clustering tasks, results are shown in Figure 2.

For each method, the baseline IsoScore is very low: always below 5%. After applying weighted token aggregation techniques or post-processing, it is always increased. The smallest improvements of up to 7-8% score are observed for templated models T0 and T4, while the highest, over 80%, are for random embeddings. For other models, the IsoScore reaches around 17%.

Why is the IsoScore of random embeddings improved so much compared to the other models? The answer may be related to the inner workings of RE. Once we generate the random embeddings, these token-level representations have a maximum possible isotropy. The RE model represents text sequences as averages of these vectors; thus, isotropy going from the token level to the document is reduced. However, token aggregation and post-processing techniques mitigate this isotropy loss, allowing for document embeddings to regain most of it back from the token-level ones. Other models start with already low isotropy for token-level representations and thus have less space to improve it.

IsoScore Correlation with Task Performance

Six out of eight models have a moderate correlation (more than 54%) between Isoscore and the average Spearman correlation of semantic textual similarity tasks (see Figure 1). For clustering (see Figure 2), it is less apparent, with only four models reaching moderate Pearson correlation. However, for both STS and clustering tasks, the best score for each model is always reached by some improvement of IsoScore, compared to the isotropy score of the plain averaging setup.

In contrast, the classification does not improve with token aggregation and post-processing techniques; therefore, we do not observe correlation (and do not show it here).

4.1.4. Alignment and Uniformity

We observed that token pooling and post-processing techniques do not improve the alignment and uniformity properties of the representations. Let us remind the reader that alignment is calculated with only those STS-B pairs with a similarity score of 5, while uniformity with all pairs, as defined in Section 3.5.5, and smaller values are better.

We can observe in Figure 3 that the z-score post-processing always makes alignment worse, while normalization of embeddings almost always does the same for the uniformity of the representations. Excluding the random embeddings model, the best alignment and uniformity properties are almost always with plain averaging and no post-processing applied.

Both alignment and uniformity are sensitive to the scaling of the embedding vectors. Depending on the resulting scaling from the post-processing method, either one or the other is increased. However, decreasing them both using these methods is found to be difficult. This finding strengthens the reputation of these two metrics that only training the transformer model, as shown in [143], is capable of improving them.

As we mentioned, the outlier in its behavior here is the RE model, which improves its uniformity by different weighting of tokens or using quantile-uniform post-processing. Indeed, the latter post-processing is the least harmful for all models considered and disturbs alignment and uniformity properties less.

4.2. Using Prompts

The normal use of prompts, as presented in [15], is to place the sentence [X] inside a template as “This sentence: “[X]” means [MASK]” and use only the vector of [MASK] token from the final layer as the full representation of [X]. Our experiments show that in general, using prompts is beneficial; however, we found some ways to improve performance even more.

First, extracting only the vector of [MASK] is not necessarily the best. We found that a simple average of all tokens, including the ones from the prompt template, and also the [MASK] token, is still a valid approach. Even more, it outperforms only the [MASK] token approach for clustering and classification tasks, with corresponding 55.0 and 79.9 average accuracies versus 54.2 and 76.8 for using only the [MASK] token (see T4 model results in Table 4).

The use of the additional text template around the text enriches the target text representation. Let us compare the performance of 12th layer averaged token representations with and without a template (see Figure 4, T0 avg. and T4 avg. with a template versus BERT avg. without). For all 3 groups of tasks—semantic textual similarity, clustering, and classification—the T0 template achieves 61.3 average Spearman correlation, 53.5 clustering accuracy, and 80.5 classification accuracy; the T4 template reaches 62.3, 55.0, and 79.9, while a regular, non-templated text obtains just 53.2, 50.6, and 79.0, respectively. This shows that the use of the templates allows the model to enrich target text representation. However, this effect peaks at the last, 12th layer, and the achieved performance is similar to the first + last layer combination of the regular non-templated vectors.

Performance of using only the [MASK] token also peaks at the 12th layer, so we investigated what influence it has in the enrichment of templated target text representation. Could it be that this token is the main culprit for better performance of averaged templated text representations? To answer this question, we tried to omit the [MASK] token from averaging (see Figure 4 T0 avg. no [MASK] and T4 avg. no [MASK]). For the T4 template and 12th layer representations, averaging all tokens yields 62.3 average Spearman correlation, 55.0 clustering accuracy, and 79.9 classification accuracy; dropping the [MASK] token from averaging yields 58.7, 53.3, and 79.8; non-templated performance is 53.2, 50.6, and 79.0, respectively. We see that omitting the [MASK] token in averaging indeed hurts the performance. On the other hand, the results show that it is responsible only for approximately half of the improvements, with the other half coming from the other tokens in the templated text.

One good reason to use averaging instead of only the [MASK] token is that then token weighting can be also applied. As we already showed in Table 4), the T4 template together with

i d f_{t}^{T}

token weighting and uniform quantile post-processing allowed us to reach the average Spearman correlation of 71.6, which was the best among the tried methods. Also, we observed that post-processing on text representations from [MASK] token was not as effective as from averaged ones. This also suggests that all tokens in templated texts have richer representations.

4.3. BERT + Avg. Model

One of the ways we sought to improve BERT transformer model representations is to combine embeddings of the regular BERT with the ones averaged over multiple contexts. That is, for each token in each BERT layer, we collected many different contexts, and the averaged vector became the vector of the Avg. model. We then combined BERT and Avg. model, according to the parameter w, which shows the fraction of Avg. model representations in the resulting vector v:

v = v_{B E R T} (1 - w) + v_{A v g .} w .

(10)

We have also varied the w parameter to the negative values in (10) to see if subtracting (instead of adding) the context-average representations from the context-aware ones helps.

The impact of the w parameter and the choice of the layer to source the representations (same layer for both BERT and Avg.) is presented in Figure 5.

One can see that for clustering and classification tasks, the combination of both models in equal portions of

w = 0.5

is better than these models alone (

w = 0.0

corresponds to BERT,

w = 1.0

is a single Avg. model). If we look at the best scores with the two techniques applied in Table 4, for the average clustering accuracy this translates to an improvement of 1.4% from the 63.4% of BERT to 64.8% of BERT + Avg.

Meanwhile, as we see in the same table, the average accuracy of classification tasks without additional techniques applied to the combination of the first and last layers is similar between only BERT (79.9%) and BERT + Avg. (79.7%). However, as we see in Figure 5, differently from the first + last optimal layer setting for BERT, BERT + Avg. has a sweet spot in the 10th layer with an average classification accuracy of 80.5%, surpassing that of BERT by 0.6%.

These results suggest a conclusion that for clustering and classification tasks, combining a regular BERT token with the same one but averaged over multiple contexts is beneficial.

As we already mentioned, STS tasks preferred the regular BERT (

w = 0

) to the BERT + Avg. on average. However, for several individual semantic textual similarity datasets, the best weights turned out to be even negative. This is the case for STS15 (

w = - 0.25

) and STS16 (

w = - 0.75

). The same negative weight of

w = - 0.5

preference was also observed for the searchsnippets clustering dataset. Although the Spearman correlation for three STS tasks was only up to several percent higher, for searchsnippets, the clustering accuracy increased to a staggering 80.2% from 72.2% of

w = 0.0

. This shows that the determined values of

w = 0.0

for STS and

w = 0.5

for clustering and classification are not universal, and for a small percentage of tasks, they can differ. For more BERT + Avg. task-wise details, see Appendix A.2 figures for STS (Figure A1), clustering (Figure A2), and classification (Figure A3) performance.

Layers

Figure 5 also depicts the task performance versus BERT layers. Our first observation is that for STS and clustering tasks, there is a strong preference for the first layers. As a result, the best combination of layers also involves the first ones; for semantic textual similarity tasks, it is the average of representations from one, two, and twelve layers, and for clustering, zero, one, two, and twelve layers. Even individual SICK-E, SICK-R, and STS-B classification tasks, which originate from semantic textual similarity ones, have strong first layers (see Figure A3 in Appendix A.2), although on average, the best layer for classification tasks is the tenth. This shows that a lot of performance for tasks that work on similarities between texts (STS and clustering) depends on the first layers, which, according to [231], have more generalized token representations. However, the last layer, which receives recreated token identities [231] is also important, as the best combinations (such as 1 + 2 + 12) also involve it. This first + last cooperation of layers can be distinctly observed as a U shape for clustering performance in layers (see Figure 5 and also Figure 4).

For most classification tasks (excluding those originating from STS), the U shape is inversed. This is very clearly seen for binary classification tasks (see Figure A3 in Appendix A.2), where the highest classification accuracy is concentrated between the ninth and the eleventh layers. This aligns with the explanations of the previous work [100], which argues that the last layer is over-specialized for the training objective.

5. Conclusions

We empirically evaluated the effects of various aggregation and post-processing techniques of token representations in a trained transformer and other models to form good text-, paragraph-, or sentence-level embeddings. We carried out the empirical evaluation of the embeddings on three classes of downstream text-level tasks: Semantic Textual Similarity (STS), clustering, and classification.

We found the techniques to benefit all models studied for the unsupervised STS (the best model average Spearman correlation increased from 62.3% to 71.6%) and clustering (the best model average clustering accuracy increased from 59.2% to 64.8%), while it had no positive effect on the supervised classification tasks (see Table 4).

We present a strong and very simple baseline model of Random Embeddings (RE), where every token is assigned a random vector as its embedding. Combined with token aggregation and post-processing techniques, it also almost matches the average STS performance of the BERT model with the techniques applied, with 66.4% versus 69.8% average Spearman correlation. It also shows very high performance for some tasks, like stackoverflow classification, where BERT token contextualization may not work well on code samples in the texts (see Section 4.1.2 for more details). We encourage future work to use RE as a baseline, due to its mid-level performance, simple implementation, and ability to separate the contribution to the performance of the learned contexts from the aggregation and post-processing techniques.

We found that the aggregation and post-processing techniques tried typically increase the isotropy of the representations, and the isotropy for most models is positively correlated (up to 69% Pearson correlation) with the Spearman correlation of STS tasks (Section 4.1.3). The highest isotropy improvement is observed for our Random Embeddings model, since its token representations have the maximal isotropy to start with. We did not find the token aggregation and post-processing techniques to improve the alignment and uniformity properties of representations.

We question the use of prompts (adding a sentence into a certain text template) for retrieving the representation of the sentence from only the [MASK] token. Our experiments show that averaging all the templated text tokens, with idf weighting and post-processing for the unsupervised tasks, is better. Meanwhile, the average increase in performance due to the added template is only obvious for the STS tasks, gives no improvement in clustering, and is very slight in classification (see Table 4, Section 4.2).

We presented a static vector model Avg., which simply contains BERT tokens averaged over multiple different contexts. Our experiments show that it outperforms a more complex BERT2Static [49], also a static word-level model, yet specially trained on BERT contexts. With the best post-processing and token aggregation techniques, the advantage for unsupervised STS tasks is 69.5 versus 66.4, and for clustering, it is 62.2 versus 57.2., with a negligible difference for classification tasks. Moreover, we show that combining Avg. with the parent BERT model can bring even further improvements. In particular, BERT + Avg. reached the highest average clustering accuracy of 64.8 out of all our considered models, as well as the classification accuracy of 80.5 (Section 4.3). We encourage future work to also use Avg. as a baseline, both due to its upper-level performance and simple implementation.

In our work, we also analyzed prompt and BERT + Avg. models layer-wise. We found that for the STS tasks, taking the representations from the first layers performs better, and for the clustering task, the performance profile forms a “U” shape with tops at the first and last layers. Therefore, for these two task groups, we mostly use the average of first + last layers, harnessing both of the tops for the best performance. On the other hand, classification tasks have an inverted “U” shape with the top in the 10th layer (Figure 5). We did not find token aggregation and post-processing techniques to change such profile curvature.

In this research, we used a pre-trained BERT as a manageable representative of transformer models, but the findings should be transferable to other types of transformers, including large language models. We also specifically did not perform task-specific fine-tuning of the model to keep it universal. This enables the same model to be used in multiple ways. For example, given a prompt, its text-level embedding can be extracted from the model using the techniques investigated here, and this embedding can be used to search an external database for related information to add the prompt to the same model, i.e., we can use the same model for both query encoding and generation in retrieval-augmented generation [232]. Alternatively, one of the simpler baseline models proposed here could be used as the query encoder.

Author Contributions

Conceptualization, L.S. and M.L.; methodology, L.S. and M.L.; software, L.S.; validation, L.S.; formal analysis, L.S. and M.L.; investigation, L.S. and M.L.; resources, M.L.; data curation, L.S.; writing—original draft preparation, L.S.; writing—review and editing, M.L. and L.S.; visualization, L.S.; supervision, M.L.; project administration, M.L.; funding acquisition, M.L. and L.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were used in this study. The details and links to the datasets and preparation codes are presented in Section 3.5.1, Section 3.5.2 and Section 3.5.3 for the clustering, STS, and classification tasks, respectively.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Detailed Performance on Individual Tasks

Appendix A.1. Token Aggregation and Post-Processing

Table A1. Spearman correlation dependence on token aggregation and post-processing techniques for semantic textual similarity STS12 task. The best result for each model is bolded, while underlined result is the best across all the models.

Model	T0	T4	BERT	BERT + Avg.	Avg.	B2S	B2S-100	RE
Layer	Last		First + Last			–
avg.	48.7	53.9	45.1	48.7	48.5	47.2	54.2	34.7
+ normalize	48.7	53.9	45.1	48.7	48.5	47.2	54.2	34.7
+ quantile-u	52.1	56.2	48.7	51.9	51.5	45.8	52.4	40.1
+ quantile- $u^{W}$	50.2	55.1	46.6	50.7	50.3	44.6	52.5	36.9
+ whiten	41.9	42.9	43.2	44.5	46.0	45.0	44.6	44.1
+ ${whiten}^{W}$	46.2	50.4	47.3	52.8	57.7	54.7	55.0	54.2
+ zscore	51.5	54.1	50.2	53.4	53.1	49.1	54.7	43.6
${idf}_{t}^{W}$	56.2	56.8	57.5	58.7	58.3	57.6	56.9	55.4
+ normalize	56.2	56.8	57.5	58.7	58.3	57.7	56.9	55.4
+ quantile-u	56.2	56.5	56.4	57.9	57.5	54.5	53.6	52.0
+ quantile- $u^{W}$	57.0	57.5	57.1	58.7	58.4	55.4	54.4	52.6
+ whiten	43.7	43.8	46.1	46.7	47.2	45.3	44.7	44.6
+ ${whiten}^{W}$	52.6	53.3	57.5	58.8	58.7	57.5	56.8	55.5
+ zscore	56.1	55.9	57.4	58.7	58.2	56.9	56.4	55.6
${idf}_{t}^{T}$	59.2	64.4	57.3	59.0	59.0	47.2	54.2	55.1
+ normalize	59.2	64.4	57.3	59.0	59.0	47.2	54.2	55.1
+ quantile-u	58.5	62.6	56.5	58.2	58.1	45.8	52.4	52.3
+ quantile- $u^{W}$	59.1	63.9	57.0	58.8	58.9	44.6	52.5	52.9
+ whiten	43.8	45.3	46.0	46.7	47.4	45.0	44.6	44.6
+ whiten^W	52.6	56.3	56.9	58.2	58.4	54.7	55.0	55.1
+ zscore	56.9	59.7	57.7	59.3	59.3	49.1	54.7	55.6
-biases	55.1	58.2	58.6	62.2	62.9	45.1	51.9	57.5
+ normalize	55.1	58.2	58.6	62.2	62.8	45.1	51.9	57.5
+ quantile-u	56.0	59.6	57.9	60.9	61.2	44.0	50.3	54.6
+ quantile-u^W	54.7	59.1	58.2	61.5	61.6	42.7	50.3	55.1
+ whiten^W	51.8	54.0	57.2	60.2	61.6	53.2	53.3	58.1
+ whiten	42.5	43.2	45.1	46.2	47.6	44.2	44.2	46.2
+ zscore	55.1	57.5	57.8	60.4	60.6	47.4	52.7	58.2
`[MASK]`	55.2	60.6
+ normalize	55.2	60.6
+ quantile-u	56.0	60.8
+ quantile- $u^{W}$	56.6	61.7
+ whiten	47.7	48.0
+ whiten^W	52.5	57.7
+ zscore	55.4	59.8

Table A2. Spearman correlation dependence on token aggregation and post-processing techniques for semantic textual similarity STS13 task. The best result for each model is bolded, while underlined result is the best across all the models.

Model	T0	T4	BERT	BERT + Avg.	Avg.	B2S	B2S-100	RE
Layer	Last		First + Last			–
avg.	62.6	64.7	64.3	63.4	56.6	61.4	66.2	48.8
+ normalize	62.6	64.7	64.3	63.4	56.6	61.4	66.2	48.8
+ quantile-u	68.4	70.9	68.3	69.0	65.5	60.4	68.9	54.8
+ quantile-u^W	64.0	66.5	65.7	66.8	64.0	59.2	67.2	53.3
+ whiten	76.0	76.6	77.6	78.3	78.0	76.4	76.1	75.1
+ whiten^W	62.2	66.5	64.8	65.8	66.2	60.6	60.8	63.7
+ zscore	70.7	72.5	70.4	70.7	67.1	62.8	68.9	55.9
${idf}_{t}^{W}$	75.4	74.7	77.2	79.1	77.8	77.5	77.3	72.5
+ normalize	75.4	74.7	77.2	79.1	77.8	77.5	77.3	72.5
+ quantile-u	76.7	76.0	77.7	79.3	78.7	77.1	76.3	73.4
+ quantile-u^W	75.6	74.7	77.1	79.3	79.1	77.2	76.7	74.2
+ whiten	76.8	76.0	78.3	78.6	77.6	77.1	76.7	74.6
+ whiten^W	68.1	68.1	74.1	75.2	75.4	73.1	73.9	72.7
+ zscore	76.7	76.4	78.2	79.7	79.3	77.5	78.0	73.0
${idf}_{t}^{T}$	70.8	73.7	74.8	76.5	76.0	61.4	66.2	68.3
+ normalize	70.8	73.7	74.8	76.5	76.0	61.4	66.2	68.3
+ quantile-u	74.7	77.0	76.0	77.2	76.7	60.4	68.9	71.9
+ quantile-u^W	72.0	74.6	74.8	76.4	76.0	59.2	67.2	71.5
+ whiten	77.1	78.0	78.1	78.3	77.3	76.4	76.1	74.0
+ whiten^W	67.2	71.0	70.3	70.4	70.1	60.6	60.8	68.1
+ zscore	75.9	77.7	76.2	76.9	75.8	62.8	68.9	69.8
-biases	66.5	68.7	68.4	69.9	68.1	56.3	60.5	61.3
+ normalize	66.5	68.7	68.4	69.9	68.1	56.3	60.5	61.3
+ quantile-u	70.9	73.5	70.8	71.9	70.3	55.0	63.2	66.2
+ quantile-u^W	67.4	70.3	68.6	69.9	68.2	53.9	61.3	64.4
+ whiten^W	63.1	67.1	65.2	65.7	65.9	56.6	56.7	63.0
+ whiten	76.2	76.8	77.9	78.5	78.1	73.3	73.1	74.7
+ zscore	73.0	74.9	72.5	72.9	70.6	58.5	64.0	64.6
`[MASK]`	63.4	76.2
+ normalize	63.4	76.2
+ quantile-u	68.1	77.3
+ quantile-u^W	65.2	76.0
+ whiten	73.1	77.6
+ whiten^W	61.7	70.9
+ zscore	69.4	76.9

Table A3. Spearman correlation dependence on token aggregation and post-processing techniques for semantic textual similarity STS14 task. The best result for each model is bolded, while underlined result is the best across all the models.

Model	T0	T4	BERT	BERT + Avg.	Avg.	B2S	B2S-100	RE
Layer	Last		First + Last			–
avg.	53.9	54.5	54.6	55.4	51.7	55.9	61.2	48.2
+ normalize	53.9	54.5	54.6	55.4	51.7	55.9	61.2	48.2
+ quantile-u	57.8	60.1	58.1	60.0	58.2	53.5	62.9	52.3
+ quantile-u^W	55.6	57.1	57.0	59.4	58.1	53.5	62.2	52.1
+ whiten	64.0	65.0	66.2	68.1	69.4	67.7	68.5	68.3
+ whiten^W	53.5	57.0	57.6	61.1	64.5	60.7	61.3	64.6
+ zscore	58.6	61.3	58.7	61.2	59.6	55.7	61.3	53.5
${idf}_{t}^{W}$	65.5	65.9	65.2	67.7	67.8	67.2	67.8	67.6
+ normalize	65.5	65.9	65.2	67.7	67.8	67.2	67.8	67.6
+ quantile-u	66.7	66.7	66.5	69.0	69.3	67.2	68.1	66.4
+ quantile-u^W	66.5	66.6	66.2	68.9	69.3	67.1	68.0	67.0
+ whiten	65.4	65.4	67.6	68.5	68.8	68.3	68.7	67.9
+ whiten^W	59.2	59.4	64.7	67.0	68.2	66.6	67.4	68.0
+ zscore	65.7	65.3	66.2	68.8	69.2	66.8	68.0	67.8
${idf}_{t}^{T}$	61.6	63.9	64.3	66.7	67.1	55.9	61.2	65.5
+ normalize	61.6	63.9	64.3	66.7	67.1	55.9	61.2	65.5
+ quantile-u	64.2	66.5	65.8	67.8	68.0	53.5	62.9	65.3
+ quantile-u^W	62.9	65.4	65.3	67.5	67.8	53.5	62.2	65.7
+ whiten	65.2	66.8	67.5	68.3	68.5	67.7	68.5	67.4
+ whiten^W	58.0	61.6	63.0	64.6	65.3	60.7	61.3	65.3
+ zscore	64.5	66.8	65.1	67.1	67.2	55.7	61.3	65.7
-biases	60.7	62.0	63.8	66.9	66.9	53.1	58.1	64.7
+ normalize	60.7	62.0	63.8	66.9	66.9	53.1	58.1	64.7
+ quantile-u	63.4	65.9	65.2	67.9	67.8	50.8	59.8	65.2
+ quantile-u^W	61.9	64.3	64.8	67.7	67.6	50.9	59.2	65.3
+ whiten^W	57.3	60.4	62.8	65.5	67.3	58.5	58.9	66.2
+ whiten	64.9	66.2	68.3	69.6	70.4	65.8	66.4	68.8
+ zscore	64.1	66.3	64.8	67.4	67.2	53.3	58.6	65.3
`[MASK]`	53.9	63.9
+ normalize	53.9	63.9
+ quantile-u	56.1	65.1
+ quantile-u^W	54.8	64.7
+ whiten	60.8	66.3
+ whiten^W	51.5	61.3
+ zscore	56.7	65.1

Table A4. Spearman correlation dependence on token aggregation and post-processing techniques for semantic textual similarity STS15 task. The best result for each model is bolded, while underlined result is the best across all the models.

Model	T0	T4	BERT	BERT + Avg.	Avg.	B2S	B2S-100	RE
Layer	Last		First + Last			–
avg.	70.2	70.6	70.5	70.0	64.2	69.8	74.2	62.1
+ normalize	70.2	70.6	70.5	70.0	64.2	69.8	74.2	62.1
+ quantile-u	71.8	73.7	72.7	74.3	72.4	68.9	73.6	61.4
+ quantile-u^W	71.5	72.5	72.3	74.1	72.6	69.4	74.4	62.7
+ whiten	62.6	62.9	64.0	65.9	69.3	69.3	68.8	67.9
+ whiten^W	70.2	71.8	71.9	74.9	77.0	74.7	75.2	74.7
+ zscore	71.9	73.0	72.6	74.3	72.4	69.3	73.9	64.3
${idf}_{t}^{W}$	75.4	74.4	75.8	75.4	71.5	73.7	74.4	74.4
+ normalize	75.4	74.4	75.8	75.4	71.5	73.7	74.4	74.4
+ quantile-u	75.1	74.2	75.3	75.7	73.7	73.4	71.9	68.5
+ quantile-u^W	76.5	75.6	76.3	76.9	74.8	74.6	73.7	71.1
+ whiten	62.7	61.4	64.3	65.2	66.3	66.8	66.7	65.0
+ whiten^W	72.2	71.6	75.0	76.5	76.6	75.4	75.5	74.4
+ zscore	74.2	73.1	75.2	76.2	75.0	73.8	74.4	73.2
${idf}_{t}^{T}$	74.1	75.2	76.2	76.8	75.1	69.8	74.2	73.8
+ normalize	74.1	75.2	76.2	76.8	75.1	69.8	74.2	73.8
+ quantile-u	75.2	77.1	76.0	76.8	75.7	68.9	73.6	69.5
+ quantile-u^W	75.4	77.0	76.9	77.8	76.7	69.4	74.4	71.6
+ whiten	63.1	64.2	64.5	66.0	68.5	69.3	68.8	67.2
+ whiten^W	72.9	74.6	75.4	76.2	75.8	74.7	75.2	73.1
+ zscore	75.0	76.0	75.9	76.8	75.8	69.3	73.9	72.7
-biases	72.9	73.3	76.2	77.3	75.2	67.3	70.8	74.3
+ normalize	72.9	73.3	76.2	77.3	75.2	67.3	70.8	74.3
+ quantile-u	74.1	76.1	76.3	77.5	76.2	66.3	70.1	69.6
+ quantile-u^W	73.7	75.3	76.8	78.4	77.3	66.9	70.9	71.7
+ whiten^W	72.6	73.7	75.4	77.0	77.3	72.6	72.6	74.8
+ whiten	64.2	64.8	65.2	66.5	69.1	67.4	66.4	66.8
+ zscore	74.4	75.1	75.7	76.9	75.7	67.0	70.7	73.6
`[MASK]`	67.0	74.1
+ normalize	67.0	74.1
+ quantile-u	68.0	74.7
+ quantile-u^W	68.3	75.7
+ whiten	61.0	65.1
+ whiten^W	67.7	74.5
+ zscore	68.0	72.9

Table A5. Spearman correlation dependence on token aggregation and post-processing techniques for semantic textual similarity STS16 task. The best result for each model is bolded, while underlined result is the best across all the models.

Model	T0	T4	BERT	BERT + Avg.	Avg.	B2S	B2S-100	RE
Layer	Last		First + Last			–
avg.	69.1	69.7	67.9	65.2	56.4	55.2	58.5	55.5
+ normalize	69.1	69.7	67.9	65.2	56.4	55.1	58.5	55.5
+ quantile-u	70.1	72.4	70.3	71.0	67.0	55.9	59.5	54.8
+ quantile-u^W	69.7	71.5	69.8	70.0	65.3	54.6	57.9	55.1
+ whiten	65.4	65.9	67.1	68.9	69.9	65.0	63.0	67.1
+ whiten^W	64.7	67.2	69.9	71.5	71.1	63.1	62.8	68.3
+ zscore	69.3	71.0	72.0	73.2	69.8	60.2	62.3	60.4
${idf}_{t}^{W}$	70.1	69.1	73.9	73.0	69.5	68.8	68.0	71.9
+ normalize	70.1	69.1	73.9	73.0	69.5	68.8	68.0	71.9
+ quantile-u	70.5	69.9	73.7	74.2	72.9	68.1	65.8	68.9
+ quantile-u^W	70.5	69.8	74.0	74.4	72.7	67.9	66.2	69.6
+ whiten	64.6	64.0	68.0	68.9	69.1	65.6	63.4	67.3
+ whiten^W	70.0	69.1	75.6	76.0	74.8	69.6	68.7	72.4
+ zscore	72.2	71.7	75.6	76.4	74.8	70.3	69.5	72.2
${idf}_{t}^{T}$	72.9	73.1	72.8	72.0	69.7	55.2	58.5	69.1
+ normalize	72.9	73.1	72.8	72.0	69.7	55.2	58.5	69.1
+ quantile-u	73.9	75.0	72.8	73.0	71.8	55.9	59.5	67.3
+ quantile-u^W	73.7	74.6	72.7	72.7	71.2	54.6	57.9	67.5
+ whiten	67.0	68.6	68.0	68.7	69.5	65.0	63.0	67.8
+ whiten^W	69.0	71.0	73.2	72.8	71.3	63.1	62.8	69.3
+ zscore	73.1	73.1	74.4	74.4	73.1	60.2	62.3	70.1
-biases	70.1	71.9	72.2	72.1	68.4	52.8	56.3	65.4
+ normalize	70.1	71.9	72.2	72.1	68.4	52.8	56.3	65.4
+ quantile-u	71.1	73.7	72.3	73.3	71.6	53.5	57.5	64.7
+ quantile-u^W	70.6	73.2	72.1	72.7	70.4	52.3	55.8	64.4
+ whiten^W	65.2	68.0	71.0	72.1	71.5	61.1	60.9	68.5
+ whiten	64.9	66.2	67.3	68.5	69.3	63.4	61.4	67.4
+ zscore	71.1	72.5	73.3	74.1	72.0	58.2	60.4	66.7
`[MASK]`	67.2	71.0
+ normalize	67.2	71.0
+ quantile-u	68.2	71.8
+ quantile-u^W	67.7	71.7
+ whiten	65.5	70.0
+ whiten^W	64.1	69.4
+ zscore	66.6	68.9

Table A6. Spearman correlation dependence on token aggregation and post-processing techniques for semantic textual similarity STS-B task. The best result for each model is bolded, while underlined result is the best across all the models.

Model	T0	T4	BERT	BERT + Avg.	Avg.	B2S	B2S-100	RE
Layer	Last		First + Last			–
avg.	60.4	61.0	59.0	59.9	54.2	56.1	60.7	46.5
+ normalize	60.4	61.0	59.0	59.9	54.2	56.1	60.7	46.5
+ quantile-u	67.7	70.9	63.6	65.2	61.7	56.9	61.1	52.4
+ quantile-u^W	62.8	65.0	61.3	63.5	59.7	54.7	60.4	50.5
+ whiten	68.2	70.0	68.7	70.0	69.6	66.7	64.6	68.1
+ whiten^W	62.1	65.8	62.6	67.3	69.4	65.1	64.1	67.5
+ zscore	68.7	71.1	65.0	66.7	63.0	58.8	62.5	54.6
${idf}_{t}^{W}$	69.0	68.2	70.3	69.6	66.4	68.0	66.9	69.8
+ normalize	69.0	68.2	70.3	69.6	66.4	68.0	66.9	69.8
+ quantile-u	69.4	68.6	69.3	69.6	67.8	66.7	64.7	64.4
+ quantile-u^W	69.4	68.8	70.2	70.3	68.1	67.0	65.3	65.7
+ whiten	65.6	65.0	67.9	68.0	67.2	65.8	64.1	66.5
+ whiten^W	67.0	66.6	71.2	72.1	71.5	69.8	68.2	70.4
+ zscore	70.7	69.9	71.0	71.7	70.1	69.0	67.8	70.0
${idf}_{t}^{T}$	68.8	70.7	69.6	69.3	67.3	56.1	60.7	67.0
+ normalize	68.8	70.7	69.6	69.3	67.3	56.1	60.7	67.0
+ quantile-u	72.4	75.1	69.2	69.3	67.7	56.9	61.1	64.2
+ quantile-u^W	70.4	73.2	69.7	69.6	67.7	54.7	60.4	65.0
+ whiten	69.6	72.2	68.4	68.4	67.7	66.7	64.6	67.0
+ whiten^W	67.8	71.0	69.6	69.9	68.7	65.1	64.1	67.1
+ zscore	72.4	74.7	70.1	70.3	68.6	58.8	62.5	67.4
-biases	64.7	65.4	70.2	71.0	67.8	52.9	57.3	66.6
+ normalize	64.7	65.4	70.2	71.0	67.8	52.9	57.3	66.6
+ quantile-u	70.1	73.4	69.9	71.0	68.9	54.2	58.1	62.8
+ quantile-u^W	66.1	69.2	70.3	71.5	69.0	51.7	57.1	63.9
+ whiten^W	66.1	68.5	69.6	71.3	70.9	61.8	60.8	68.8
+ whiten	68.1	69.9	69.5	69.9	69.3	64.6	62.4	66.5
+ zscore	71.1	73.2	70.2	71.1	68.9	56.1	59.4	67.4
`[MASK]`	68.5	73.3
+ normalize	68.5	73.3
+ quantile-u	70.8	74.8
+ quantile-u^W	69.3	74.4
+ whiten	70.5	74.3
+ whiten^W	65.3	70.9
+ zscore	70.5	73.7

Table A7. Spearman correlation dependence on token aggregation and post-processing techniques for semantic textual similarity SICK-R task. The best result for each model is bolded, while underlined result is the best across all the models.

Model	T0	T4	BERT	BERT + Avg.	Avg.	B2S	B2S-100	RE
Layer	Last		First + Last			–
avg.	64.4	64.2	63.8	63.1	60.3	60.2	61.2	53.1
+ normalize	64.4	64.2	63.8	63.1	60.3	60.2	61.2	53.1
+ quantile-u	66.1	66.3	64.9	64.7	62.5	61.0	61.8	54.8
+ quantile-u^W	64.9	64.7	64.3	64.1	61.6	59.7	61.0	53.1
+ whiten	61.0	60.8	60.4	59.1	55.1	55.3	55.4	53.3
+ whiten^W	63.1	63.2	63.7	63.9	61.5	61.0	59.9	58.4
+ zscore	66.3	66.7	65.0	64.9	62.8	62.1	63.1	56.3
${idf}_{t}^{W}$	62.5	62.0	61.7	59.7	57.5	61.9	60.1	57.4
+ normalize	62.5	62.0	61.7	59.7	57.5	61.9	60.2	57.4
+ quantile-u	63.0	62.6	61.8	60.4	58.5	60.8	59.4	52.9
+ quantile-u^W	62.9	62.5	61.8	60.5	59.0	61.1	59.5	54.7
+ whiten	57.7	57.5	57.1	55.9	53.8	54.5	54.7	52.0
+ whiten^W	60.9	61.1	61.9	60.8	59.0	61.2	60.1	56.7
+ zscore	63.7	63.4	63.2	62.3	60.9	62.9	61.3	57.3
${idf}_{t}^{T}$	64.4	65.0	62.8	61.1	59.2	60.2	61.3	56.8
+ normalize	64.4	65.0	62.8	61.1	59.2	60.2	61.2	56.8
+ quantile-u	66.0	66.3	63.2	61.6	59.7	61.0	61.8	54.3
+ quantile-u^W	65.0	65.6	63.0	61.3	59.5	59.7	61.0	55.2
+ whiten	59.9	60.3	58.1	56.8	54.4	55.3	55.4	52.5
+ whiten^W	63.5	64.2	62.4	61.0	58.9	61.0	59.9	56.3
+ zscore	66.2	66.5	63.7	62.4	60.8	62.1	63.1	57.0
-biases	65.1	65.9	65.6	65.1	63.8	59.0	59.7	59.7
+ normalize	65.1	65.9	65.6	65.1	63.8	59.0	59.7	59.7
+ quantile-u	66.9	67.7	66.0	65.7	64.7	59.9	60.2	56.5
+ quantile-u^W	65.6	66.5	65.8	65.5	64.4	58.6	59.3	57.3
+ whiten^W	64.8	65.4	64.7	63.9	61.3	59.8	58.6	58.0
+ whiten	61.3	61.1	60.3	58.9	55.4	54.2	54.1	53.1
+ zscore	67.1	67.8	66.3	66.3	65.6	61.1	62.0	60.4
`[MASK]`	64.8	65.0
+ normalize	64.8	65.0
+ quantile-u	67.1	66.0
+ quantile-u^W	65.6	65.6
+ whiten	61.7	61.8
+ whiten^W	63.3	64.2
+ zscore	67.1	66.2

Table A8. Spearman correlation dependence on token aggregation and post-processing techniques for semantic textual similarity STR task. The best result for each model is bolded, while underlined result is the best across all the models.

Model	T0	T4	BERT	BERT + Avg.	Avg.	B2S	B2S-100	RE
Layer	Last		First + Last			–
avg.	60.9	59.8	65.4	65.6	59.4	48.2	48.9	53.3
+ normalize	60.9	59.8	65.4	65.6	59.4	48.2	48.9	53.3
+ quantile-u	65.7	67.8	68.1	69.2	65.4	47.3	47.9	53.6
+ quantile-u^W	63.6	64.1	67.3	68.8	65.3	48.0	48.6	53.7
+ whiten	68.9	69.6	70.1	69.7	67.0	51.4	48.8	62.1
+ whiten^W	65.2	67.7	66.7	68.7	68.6	51.3	49.2	63.2
+ zscore	66.6	68.9	66.7	67.5	64.5	50.1	50.1	56.0
${idf}_{t}^{W}$	68.2	67.8	71.9	70.9	66.3	53.8	51.5	62.6
+ normalize	68.2	67.8	71.9	70.9	66.3	53.8	51.5	62.6
+ quantile-u	69.7	69.5	72.5	72.7	70.0	52.4	49.7	59.7
+ quantile-u^W	69.2	69.0	72.5	72.8	70.4	53.3	50.8	61.2
+ whiten	69.5	68.8	70.1	68.9	66.2	51.9	49.5	61.1
+ whiten^W	68.2	67.8	70.8	70.6	67.7	52.4	49.9	61.7
+ zscore	69.5	68.9	71.4	71.2	68.7	54.0	52.2	62.4
${idf}_{t}^{T}$	67.1	67.5	71.8	70.8	66.9	48.2	48.9	63.3
+ normalize	67.1	67.5	71.8	70.8	66.9	48.2	48.9	63.3
+ quantile-u	71.0	73.2	72.1	71.9	69.3	47.3	47.9	59.3
+ quantile-u^W	69.8	71.6	72.0	72.0	69.7	48.0	48.6	60.6
+ whiten	70.4	71.1	70.4	69.0	66.3	51.4	48.8	61.4
+ whiten^W	69.0	71.3	70.4	70.0	67.6	51.3	49.2	62.1
+ zscore	70.8	73.2	71.4	71.3	69.3	50.1	50.1	63.1
-biases	62.8	62.5	70.7	70.1	64.9	46.0	45.8	60.3
+ normalize	62.8	62.5	70.7	70.1	64.9	46.0	45.8	60.3
+ quantile-u	66.7	69.6	72.2	72.1	68.6	45.7	45.0	58.0
+ quantile-u^W	65.2	67.1	71.9	71.9	68.4	46.2	45.5	58.9
+ whiten^W	67.8	69.3	69.2	69.2	66.4	49.4	47.0	60.1
+ whiten	69.1	69.5	69.6	68.5	65.0	49.6	46.9	59.6
+ zscore	68.3	70.8	69.7	69.1	65.9	48.5	47.6	60.0
`[MASK]`	67.0	70.1
+ normalize	67.0	70.1
+ quantile-u	70.7	73.5
+ quantile-u^W	69.1	73.2
+ whiten	70.5	72.3
+ whiten^W	68.1	72.6
+ zscore	70.7	73.1

Table A9. Clustering accuracy dependence on token aggregation and post-processing techniques for the agnews dataset. The best result for each model is bolded, while underlined result is the best across all the models.

Model	T0	T4	BERT	BERT + Avg.	Avg.	B2S	B2S-100	RE
Layer	Last		First + Last			–
avg.	85.4	80.7	81.2	84.4	79.0	85.0	85.0	28.0
+ normalize	85.5	80.6	85.6	86.1	79.6	85.7	85.9	27.2
+ quantile-u	85.2	86.9	84.0	85.1	79.7	85.2	85.2	27.5
+ quantile-u^W	85.4	86.7	85.3	85.7	80.6	85.4	85.4	27.2
+ whiten	34.3	31.3	31.3	31.5	30.5	29.9	29.9	28.8
+ whiten^W	75.2	75.0	72.7	72.8	58.6	70.6	69.0	28.4
+ zscore	86.0	81.2	81.3	83.8	79.8	84.8	85.2	27.7
${idf}_{t}^{W}$	79.5	81.4	80.6	84.0	78.0	83.6	83.7	39.9
+ normalize	84.9	85.6	84.7	85.1	78.3	85.0	84.9	41.4
+ quantile-u	85.8	85.7	81.7	84.4	78.6	84.2	84.2	41.2
+ quantile-u^W	85.6	85.8	81.8	84.7	79.1	84.6	84.7	43.4
+ whiten	30.9	32.0	31.3	31.8	30.7	30.7	30.1	27.6
+ whiten^W	74.5	78.1	73.8	74.4	70.9	72.5	72.1	38.4
+ zscore	85.9	81.8	80.4	84.0	78.9	83.7	83.7	41.0
${idf}_{t}^{T}$	80.1	82.1	80.9	82.5	79.3	85.0	85.0	28.7
+ normalize	85.2	82.2	85.1	84.8	81.7	85.7	85.9	28.5
+ quantile-u	85.9	87.0	82.4	81.9	80.9	85.2	85.2	28.6
+ quantile-u^W	85.9	87.0	82.0	84.4	82.1	85.3	85.5	28.4
+ whiten	30.2	32.6	30.9	31.3	29.9	29.9	29.8	28.0
+ whiten^W	72.1	68.4	73.0	71.9	45.5	70.6	69.0	28.9
+ zscore	85.9	82.6	81.1	82.1	79.7	84.8	85.2	28.8
-biases	85.7	86.4	81.4	85.6	83.4	84.0	83.8	28.5
+ normalize	85.7	86.6	86.8	86.8	83.8	85.3	85.5	28.4
+ quantile-u	85.4	86.7	85.7	85.4	83.4	84.7	84.7	28.5
+ quantile-u^W	85.6	87.1	86.1	86.0	83.6	84.6	84.9	29.4
+ whiten^W	73.5	73.4	72.4	72.1	55.5	68.8	67.6	28.2
+ whiten	31.1	31.5	33.0	30.4	30.3	29.5	29.8	28.3
+ zscore	86.0	87.1	81.7	84.8	83.5	84.4	84.5	28.1
`[MASK]`	74.8	81.9
+ normalize	75.1	82.2
+ quantile-u	63.1	82.1
+ quantile-u^W	76.1	84.2
+ whiten	30.9	31.8
+ whiten^W	52.9	48.3
+ zscore	76.1	81.7

Table A10. Clustering accuracy dependence on token aggregation and post-processing techniques for the biomedical dataset. The best result for each model is bolded, while underlined result is the best across all the models.

Model	T0	T4	BERT	BERT + Avg.	Avg.	B2S	B2S-100	RE
Layer	Last		First + Last			–
avg.	32.7	32.4	34.8	36.1	30.7	33.1	37.0	19.4
+ normalize	32.8	32.0	35.1	36.3	30.3	33.8	37.6	20.7
+ quantile-u	33.6	32.5	37.0	39.1	33.1	32.9	36.7	21.0
+ quantile-u^W	33.4	32.5	35.4	37.9	34.5	33.0	37.0	21.1
+ whiten	29.4	30.2	27.2	23.2	18.2	12.7	13.2	15.9
+ whiten^W	34.8	34.7	40.2	42.7	36.5	35.0	34.4	29.4
+ zscore	33.5	33.0	36.1	37.4	32.1	32.4	36.1	20.0
${idf}_{t}^{W}$	31.0	30.0	32.3	32.3	28.1	34.5	34.5	32.1
+ normalize	31.2	30.2	32.5	32.6	28.7	34.7	34.8	35.2
+ quantile-u	32.5	31.3	33.6	33.6	29.8	33.9	35.0	32.4
+ quantile-u^W	31.7	30.7	31.9	33.6	30.7	33.7	34.4	33.4
+ whiten	27.3	27.6	26.4	20.6	17.1	11.8	12.4	16.7
+ whiten^W	32.8	32.1	37.0	38.7	35.4	37.2	37.2	30.8
+ zscore	32.0	30.8	32.6	32.5	28.2	34.3	34.3	31.8
${idf}_{t}^{T}$	33.4	33.6	36.0	36.2	32.6	33.1	37.0	30.0
+ normalize	33.6	34.1	36.0	37.1	33.9	33.8	37.6	32.2
+ quantile-u	35.4	35.3	36.9	37.2	34.1	32.9	36.7	31.3
+ quantile-u^W	35.0	35.7	36.6	37.5	34.7	33.0	37.0	32.2
+ whiten	28.3	30.8	20.9	19.4	16.3	12.7	13.2	15.2
+ whiten^W	36.8	37.4	40.4	39.9	33.1	35.0	34.4	29.2
+ zscore	34.7	35.4	36.9	36.8	32.4	32.4	36.1	29.5
-biases	35.1	32.2	39.0	39.2	37.2	31.6	35.9	29.2
+ normalize	35.2	32.5	39.4	39.4	37.7	32.9	36.9	30.8
+ quantile-u	36.0	33.0	39.8	39.6	37.2	32.3	36.2	29.9
+ quantile-u^W	36.6	34.6	39.6	39.9	37.3	32.4	36.3	30.9
+ whiten^W	38.6	38.7	41.7	40.3	34.4	34.4	33.4	28.7
+ whiten	29.8	31.6	22.0	17.1	13.5	12.5	12.0	13.2
+ zscore	35.6	32.7	39.4	39.5	37.0	30.9	35.6	28.7
`[MASK]`	24.0	34.8
+ normalize	25.1	35.3
+ quantile-u	27.4	35.9
+ quantile-u^W	25.1	34.7
+ whiten	21.9	28.4
+ whiten^W	23.7	32.8
+ zscore	26.4	35.0

Table A11. Clustering accuracy dependence on token aggregation and post-processing techniques for the googleTS dataset. The best result for each model is bolded, while underlined result is the best across all the models.

Model	T0	T4	BERT	BERT + Avg.	Avg.	B2S	B2S-100	RE
Layer	Last		First + Last			–
avg.	63.5	64.3	66.1	66.1	65.1	66.2	65.8	61.2
+ normalize	62.7	64.6	66.2	67.1	66.2	68.3	67.7	67.1
+ quantile-u	64.8	65.1	67.1	67.4	67.3	66.3	66.9	63.5
+ quantile-u^W	63.9	64.7	66.4	68.0	66.7	66.9	67.2	63.6
+ whiten	59.2	60.3	57.5	57.3	54.5	58.1	58.4	53.1
+ whiten^W	62.4	62.0	65.6	66.1	65.2	65.8	65.8	61.7
+ zscore	63.6	65.0	66.7	67.1	65.9	66.0	66.6	61.8
${idf}_{t}^{W}$	63.5	62.6	65.2	66.2	65.6	64.5	65.8	63.8
+ normalize	64.3	63.1	66.0	66.5	66.0	66.5	66.7	69.5
+ quantile-u	64.0	65.4	66.6	66.5	67.0	66.4	66.1	64.1
+ quantile-u^W	64.4	64.9	67.5	67.8	66.6	67.4	65.9	63.4
+ whiten	57.9	59.9	58.3	56.5	55.2	58.9	58.6	51.4
+ whiten^W	61.9	61.4	66.2	67.7	66.7	65.5	66.7	63.3
+ zscore	63.5	63.3	66.1	66.4	65.7	65.9	65.6	63.9
${idf}_{t}^{T}$	63.2	64.9	66.4	67.4	65.9	66.2	66.0	61.3
+ normalize	64.7	64.7	67.7	68.6	65.9	68.3	67.7	67.4
+ quantile-u	65.2	65.5	68.5	67.9	67.5	66.6	67.0	63.7
+ quantile-u^W	64.8	66.2	67.7	67.3	67.3	67.1	67.2	62.9
+ whiten	58.8	59.2	58.7	59.0	55.6	57.7	58.5	55.1
+ whiten^W	60.1	60.4	65.1	66.7	64.8	65.8	65.8	61.2
+ zscore	63.8	65.0	66.4	67.6	66.3	66.0	66.3	62.1
-biases	63.2	64.1	66.0	66.3	65.7	65.7	65.3	62.0
+ normalize	62.6	63.4	66.3	66.8	65.8	67.0	66.8	67.1
+ quantile-u	64.8	65.1	66.7	68.0	67.7	65.7	65.4	64.0
+ quantile-u^W	64.4	65.5	67.1	67.4	66.7	66.5	66.7	64.1
+ whiten^W	60.9	61.6	66.0	66.4	65.2	63.9	65.5	63.2
+ whiten	61.3	60.3	59.2	59.2	57.6	57.3	57.1	56.9
+ zscore	63.8	64.9	65.7	66.7	66.1	65.3	65.0	62.7
`[MASK]`	45.5	56.0
+ normalize	45.9	56.5
+ quantile-u	47.8	56.2
+ quantile-u^W	46.9	56.4
+ whiten	53.4	53.1
+ whiten^W	43.0	53.3
+ zscore	47.0	56.0

Table A12. Clustering accuracy dependence on token aggregation and post-processing techniques for the searchsnippets dataset. The best result for each model is bolded, while underlined result is the best across all the models.

Model	T0	T4	BERT	BERT + Avg.	Avg.	B2S	B2S-100	RE
Layer	Last		First + Last			–
avg.	67.3	73.3	72.2	70.8	63.6	72.0	72.0	23.3
+ normalize	68.2	73.8	81.7	73.1	65.7	76.4	76.3	20.4
+ quantile-u	69.8	72.4	74.1	73.0	66.8	75.2	72.9	22.3
+ quantile-u^W	69.7	73.5	81.6	74.7	68.6	75.7	75.1	21.7
+ whiten	33.8	36.1	32.3	31.7	26.3	22.9	24.8	21.2
+ whiten^W	47.7	49.2	58.3	61.8	56.3	58.8	58.3	24.0
+ zscore	67.9	73.2	72.7	71.3	63.7	72.3	72.1	22.8
${idf}_{t}^{W}$	67.1	67.0	71.9	70.4	63.0	69.9	69.9	29.2
+ normalize	68.4	69.4	81.8	73.0	65.2	81.4	81.4	36.6
+ quantile-u	70.3	71.6	73.8	72.5	65.9	74.4	72.2	30.7
+ quantile-u^W	70.8	72.5	80.6	74.0	67.1	74.8	80.7	32.1
+ whiten	32.7	34.8	33.0	31.7	23.7	27.3	24.1	20.5
+ whiten^W	47.8	51.9	64.5	66.1	61.9	67.6	68.5	30.0
+ zscore	69.8	70.0	72.2	70.4	62.7	70.1	70.2	29.7
${idf}_{t}^{T}$	55.1	75.3	72.0	70.5	59.3	72.0	72.0	26.1
+ normalize	63.2	75.6	81.2	76.9	60.9	76.4	76.3	24.8
+ quantile-u	68.8	78.5	74.6	76.1	59.1	75.2	72.9	26.1
+ quantile-u^W	71.2	78.7	81.3	77.0	61.0	75.7	75.1	25.8
+ whiten	30.6	35.7	30.8	29.8	25.0	22.9	24.8	21.3
+ whiten^W	46.8	46.1	62.7	63.0	52.9	58.8	58.3	26.7
+ zscore	68.6	79.1	73.3	72.4	59.0	72.3	72.1	25.8
-biases	66.5	70.4	82.6	73.0	71.7	70.5	71.9	22.8
+ normalize	67.0	70.5	82.9	74.8	73.8	75.2	80.9	22.6
+ quantile-u	68.8	80.3	80.7	73.8	72.3	73.5	73.8	23.8
+ quantile-u^W	69.8	80.6	82.6	76.5	74.2	75.6	72.7	22.9
+ whiten^W	48.4	46.0	64.5	64.4	54.9	57.1	53.6	23.9
+ whiten	31.6	36.3	33.2	29.3	22.7	23.3	23.9	22.0
+ zscore	67.6	79.8	82.1	72.6	71.4	72.6	72.6	23.3
`[MASK]`	61.1	69.9
+ normalize	61.0	69.6
+ quantile-u	60.3	69.6
+ quantile-u^W	61.4	69.8
+ whiten	31.1	34.0
+ whiten^W	39.4	45.3
+ zscore	61.8	69.8

Table A13. Clustering accuracy dependence on token aggregation and post-processing techniques for the stackoverflow dataset. The best result for each model is bolded, while underlined result is the best across all the models.

Model	T0	T4	BERT	BERT + Avg.	Avg.	B2S	B2S-100	RE
Layer	Last		First + Last			–
avg.	25.0	30.9	35.9	43.5	41.9	14.8	20.2	39.2
+ normalize	25.4	30.7	36.9	42.8	42.3	15.0	20.7	39.0
+ quantile-u	27.8	32.7	38.3	45.9	45.1	14.7	20.5	38.6
+ quantile-u^W	25.7	32.4	38.4	46.9	47.9	15.0	20.4	40.0
+ whiten	39.8	39.7	34.2	17.8	13.0	9.8	12.6	12.2
+ whiten^W	28.9	33.9	49.4	57.3	62.0	21.7	23.5	56.6
+ zscore	26.7	33.1	37.3	44.2	43.6	14.0	19.8	39.2
${idf}_{t}^{W}$	28.6	29.8	30.6	37.7	39.1	16.1	17.5	55.4
+ normalize	29.0	30.3	31.1	37.2	38.0	16.8	18.6	62.7
+ quantile-u	29.8	31.1	32.0	38.1	40.7	16.8	17.5	57.7
+ quantile-u^W	30.0	30.9	32.1	39.9	40.8	16.3	18.0	58.8
+ whiten	30.6	30.1	26.1	15.4	12.2	9.7	9.7	12.5
+ whiten^W	29.8	31.4	40.0	50.3	55.8	16.2	17.1	56.8
+ zscore	30.2	30.6	31.1	37.4	38.7	16.1	16.8	54.1
${idf}_{t}^{T}$	36.2	39.0	54.2	63.4	60.5	14.8	20.2	61.6
+ normalize	36.9	38.6	55.1	66.6	62.4	15.0	20.7	70.6
+ quantile-u	38.7	40.7	57.1	64.3	63.1	14.7	20.5	63.7
+ quantile-u^W	38.6	40.0	57.7	65.6	63.0	15.0	20.4	65.2
+ whiten	43.5	44.6	32.1	19.7	16.6	9.8	12.6	16.7
+ whiten^W	35.7	36.7	59.9	63.1	58.7	21.7	23.5	61.6
+ zscore	38.3	40.6	54.5	62.0	58.9	14.0	19.8	60.7
-biases	28.5	34.5	44.8	56.7	59.7	14.3	18.2	53.9
+ normalize	28.4	35.3	45.2	57.0	58.7	14.6	19.4	63.6
+ quantile-u	31.0	36.8	44.5	56.7	59.3	14.0	19.0	58.0
+ quantile-u^W	30.8	35.5	46.4	59.9	60.6	14.8	19.3	57.4
+ whiten^W	35.1	36.1	55.4	62.8	61.0	21.0	20.9	52.2
+ whiten	42.6	38.7	21.4	15.7	11.3	8.0	11.6	11.3
+ zscore	31.5	36.9	45.2	57.5	57.9	13.4	18.5	53.0
`[MASK]`	26.4	35.5
+ normalize	26.5	36.1
+ quantile-u	26.0	37.7
+ quantile-u^W	24.9	37.0
+ whiten	22.7	41.0
+ whiten^W	18.0	32.2
+ zscore	27.3	37.4

Table A14. Clustering accuracy dependence on token aggregation and post-processing techniques for the tweet dataset. The best result for each model is bolded, while underlined result is the best across all the models.

Model	T0	T4	BERT	BERT + Avg.	Avg.	B2S	B2S-100	RE
Layer	Last		First + Last			–
avg.	47.0	48.2	51.8	54.1	50.7	51.7	51.7	46.5
+ normalize	46.2	48.3	53.0	53.2	52.4	53.7	55.2	56.4
+ quantile-u	46.6	48.5	51.9	53.9	51.7	50.8	52.9	48.2
+ quantile-u^W	46.9	48.0	53.6	53.4	53.0	51.4	51.1	47.8
+ whiten	15.9	16.3	15.9	17.0	18.5	18.1	17.2	17.6
+ whiten^W	44.9	44.1	51.1	51.2	51.7	50.5	49.9	49.0
+ zscore	47.5	49.0	53.0	53.3	51.4	50.0	50.5	46.4
${idf}_{t}^{W}$	48.6	47.2	50.5	52.1	48.5	49.4	50.8	45.4
+ normalize	48.3	47.8	51.9	52.7	48.5	50.3	50.4	51.5
+ quantile-u	49.0	47.9	52.7	51.9	48.8	49.8	49.9	46.4
+ quantile-u^W	48.3	47.9	52.7	52.8	48.3	49.4	49.6	46.0
+ whiten	15.3	15.1	15.7	16.2	18.3	17.4	17.0	15.9
+ whiten^W	43.0	42.3	50.0	50.8	50.1	49.8	49.3	45.5
+ zscore	49.3	46.7	52.2	51.4	48.0	48.9	49.1	46.3
${idf}_{t}^{T}$	50.8	50.5	55.0	55.2	54.5	51.7	51.7	49.0
+ normalize	49.6	50.6	55.1	54.7	53.6	53.7	55.2	58.5
+ quantile-u	49.7	51.0	53.5	54.3	53.7	50.8	52.9	50.5
+ quantile-u^W	50.0	52.3	53.4	54.8	53.3	51.4	51.1	51.7
+ whiten	17.6	18.0	15.5	16.5	18.2	18.1	17.2	17.6
+ whiten^W	44.4	46.6	51.9	53.2	54.0	50.5	49.9	48.8
+ zscore	49.7	50.0	53.5	55.4	53.8	50.0	50.5	47.8
-biases	47.1	49.6	52.4	52.3	51.7	49.6	49.6	48.0
+ normalize	47.3	49.3	54.0	55.7	53.4	51.3	51.3	55.1
+ quantile-u	47.9	50.4	53.0	55.6	53.4	48.6	50.7	48.2
+ quantile-u^W	48.8	50.0	53.4	54.4	53.5	50.0	49.0	47.4
+ whiten^W	45.9	46.0	52.2	53.0	53.3	47.3	47.5	47.4
+ whiten	16.3	16.9	16.4	17.4	18.7	18.7	17.7	18.5
+ zscore	47.3	49.4	52.0	53.5	53.2	48.3	49.0	47.4
`[MASK]`	41.1	47.0
+ normalize	41.0	47.1
+ quantile-u	42.0	47.7
+ quantile-u^W	41.9	47.4
+ whiten	14.9	16.6
+ whiten^W	38.4	41.7
+ zscore	41.6	46.7

Appendix A.2. BERT + Avg. Model in Different Layers

Figure A1. BERT + Avg. model individual task STS performance dependence on the weight w of Avg. model and layer, from which (for both models) representations are used. To the right of the black line on the horizontal axis, average aggregation of multiple layers is also shown. Tokens are simply averaged and no post-processing is used. The horizontal line with

w = 0.0

corresponds to a regular Bert (B) model,

w = 0.5

is B + Avg., and

w = 1.0

is the Avg. model. The white × marks the maximum value.

Figure A1. BERT + Avg. model individual task STS performance dependence on the weight w of Avg. model and layer, from which (for both models) representations are used. To the right of the black line on the horizontal axis, average aggregation of multiple layers is also shown. Tokens are simply averaged and no post-processing is used. The horizontal line with

w = 0.0

corresponds to a regular Bert (B) model,

w = 0.5

is B + Avg., and

w = 1.0

is the Avg. model. The white × marks the maximum value.

Figure A2. BERT + Avg. model individual task clustering performance dependence on the weight w of Avg. model and layer, from which (for both models) representations are used. To the right of the black line on the horizontal axis, average aggregation of multiple layers is also shown. Tokens are simply averaged and no post-processing is used. The horizontal line with

w = 0.0

corresponds to a regular Bert (B) model,

w = 0.5

is B + Avg., and

w = 1.0

is the Avg. model. The white × marks the maximum value.

Figure A2. BERT + Avg. model individual task clustering performance dependence on the weight w of Avg. model and layer, from which (for both models) representations are used. To the right of the black line on the horizontal axis, average aggregation of multiple layers is also shown. Tokens are simply averaged and no post-processing is used. The horizontal line with

w = 0.0

corresponds to a regular Bert (B) model,

w = 0.5

is B + Avg., and

w = 1.0

is the Avg. model. The white × marks the maximum value.

Figure A3. BERT + Avg. model individual task supervised classification performance dependence on the weight w of Avg. model and layer, from which (for both models) representations are used. To the right of the black line on the horizontal axis, average aggregation of multiple layers is also shown. Tokens are simply averaged and no post-processing is used. The horizontal line with

w = 0.0

corresponds to a regular Bert (B) model,

w = 0.5

is B + Avg., and

w = 1.0

is the Avg. model. The white × marks the maximum value.

Figure A3. BERT + Avg. model individual task supervised classification performance dependence on the weight w of Avg. model and layer, from which (for both models) representations are used. To the right of the black line on the horizontal axis, average aggregation of multiple layers is also shown. Tokens are simply averaged and no post-processing is used. The horizontal line with

w = 0.0

corresponds to a regular Bert (B) model,

w = 0.5

is B + Avg., and

w = 1.0

is the Avg. model. The white × marks the maximum value.

References

Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems; Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2013; Volume 26. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Hart, B.; Risley, T. The early catastrophe: The 30 million word gap by age 3. Am. Educ. 2003, 27, 4–9. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented Transformer for Speech Recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar] [CrossRef]
Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision Transformer: Reinforcement Learning via Sequence Modeling. In Advances in Neural Information Processing Systems; Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar] [CrossRef]
Narang, S.; Chung, H.W.; Tay, Y.; Fedus, L.; Fevry, T.; Matena, M.; Malkan, K.; Fiedel, N.; Shazeer, N.; Lan, Z.; et al. Do Transformer Modifications Transfer Across Implementations and Applications? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; pp. 5758–5773. [Google Scholar] [CrossRef]
Belinkov, Y. Probing Classifiers: Promises, Shortcomings, and Advances. Comput. Linguist. 2022, 48, 207–219. [Google Scholar] [CrossRef]
Conneau, A.; Kiela, D.; Schwenk, H.; Barrault, L.; Bordes, A. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 670–680. [Google Scholar] [CrossRef]
Su, J.; Cao, J.; Liu, W.; Ou, Y. Whitening Sentence Representations for Better Semantics and Faster Retrieval. arXiv 2021, arXiv:2103.15316. [Google Scholar] [CrossRef]
Huang, J.; Tang, D.; Zhong, W.; Lu, S.; Shou, L.; Gong, M.; Jiang, D.; Duan, N. WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 238–244. [Google Scholar] [CrossRef]
Jiang, T.; Jiao, J.; Huang, S.; Zhang, Z.; Wang, D.; Zhuang, F.; Wei, F.; Huang, H.; Deng, D.; Zhang, Q. PromptBERT: Improving BERT Sentence Embeddings with Prompts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 8826–8837. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Pelletier, F.J. The principle of semantic compositionality. Topoi 1994, 13, 11–24. [Google Scholar] [CrossRef]
Bos, J. Wide-Coverage Semantic Analysis with Boxer. In Proceedings of the Semantics in Text Processing, STEP 2008 Conference Proceedings; College Publications: Marshalls Creek, PA, USA, 2008; pp. 277–286. [Google Scholar]
Montague, R. Universal grammar. Theoria 1970, 36, 373–398. [Google Scholar] [CrossRef]
Bos, J. A Survey of Computational Semantics: Representation, Inference and Knowledge in Wide-Coverage Text Understanding. Lang. Linguist. Compass 2011, 5, 336–366. [Google Scholar] [CrossRef]
Erk, K. Vector Space Models of Word Meaning and Phrase Meaning: A Survey. Lang. Linguist. Compass 2012, 6, 635–653. [Google Scholar] [CrossRef]
Yoshikawa, M.; Mineshima, K.; Noji, H.; Bekki, D. Combining Axiom Injection and Knowledge Base Completion for Efficient Natural Language Inference. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 7410–7417. [Google Scholar] [CrossRef]
Clark, S.; Pulman, S.G. Combining Symbolic and Distributional Models of Meaning. In Proceedings of the AAAI Spring Symposium: Quantum Interaction, Stanford, CA, USA, 26–28 March 2007. [Google Scholar]
Bjerva, J.; Bos, J.; van der Goot, R.; Nissim, M. The Meaning Factory: Formal Semantics for Recognizing Textual Entailment and Determining Semantic Similarity. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland, 23–24 August 2014; pp. 642–646. [Google Scholar] [CrossRef]
Smolensky, P. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artif. Intell. 1990, 46, 159–216. [Google Scholar] [CrossRef]
Mitchell, J.; Lapata, M. Composition in Distributional Models of Semantics. Cogn. Sci. 2010, 34, 1388–1429. [Google Scholar] [CrossRef] [PubMed]
Milajevs, D.; Kartsaklis, D.; Sadrzadeh, M.; Purver, M. Evaluating Neural Word Representations in Tensor-Based Compositional Settings. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 708–719. [Google Scholar] [CrossRef]
Jones, M.N.; Mewhort, D.J. Representing word meaning and order information in a composite holographic lexicon. Psychol. Rev. 2007, 114, 1. [Google Scholar] [CrossRef]
Widdows, D. Semantic Vector Products: Some Initial Investigations. In Proceedings of the Second AAAI Symposium on Quantum Interaction, Stanford, CA, USA, 26–28 March 2007; College Publications: Marshalls Creek, PA, USA, 2008. [Google Scholar]
Plate, T. Holographic reduced representations. IEEE Trans. Neural Netw. 1995, 6, 623–641. [Google Scholar] [CrossRef] [PubMed]
Baroni, M.; Zamparelli, R. Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, USA, 9–10 October 2010; pp. 1183–1193. [Google Scholar]
Guevara, E. A Regression Model of Adjective-Noun Compositionality in Distributional Semantics. In Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics, Uppsala, Sweden, 16 July 2010; pp. 33–37. [Google Scholar]
Grefenstette, E.; Dinu, G.; Zhang, Y.; Sadrzadeh, M.; Baroni, M. Multi-Step Regression Learning for Compositional Distributional Semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013)–Long Papers, Potsdam, Germany, 19–22 March 2013; pp. 131–142. [Google Scholar]
Baroni, M.; Bernardi, R.; Zamparelli, R. Frege in Space: A Program for Composition Distributional Semantics. Linguist. Issues Lang. Technol. 2014, 9, 241–346. [Google Scholar] [CrossRef]
Xing, C.; Wang, D.; Zhang, X.; Liu, C. Document classification with distributions of word vectors. In Proceedings of the Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, Siem Reap, Cambodia, 9–12 December 2014; pp. 1–5. [Google Scholar] [CrossRef]
Yu, L.; Hermann, K.M.; Blunsom, P.; Pulman, S. Deep Learning for Answer Sentence Selection. arXiv 2014, arXiv:1412.1632. [Google Scholar] [CrossRef]
Lev, G.; Klein, B.; Wolf, L. In Defense of Word Embedding for Generic Text Representation. In Natural Language Processing and Information Systems; Biemann, C., Handschuh, S., Freitas, A., Meziane, F., Métais, E., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 35–50. [Google Scholar]
Ritter, S.; Long, C.; Paperno, D.; Baroni, M.; Botvinick, M.; Goldberg, A. Leveraging Preposition Ambiguity to Assess Compositional Distributional Models of Semantics. In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics, Denver, CO, USA, 4–5 June 2015; pp. 199–204. [Google Scholar] [CrossRef]
White, L.; Togneri, R.; Liu, W.; Bennamoun, M. How Well Sentence Embeddings Capture Meaning. In Proceedings of the 20th Australasian Document Computing Symposium, Association for Computing Machinery, Parramatta, NSW, Australia, 8–9 December 2015. ADCS ’15. [Google Scholar] [CrossRef]
Shen, D.; Wang, G.; Wang, W.; Min, M.R.; Su, Q.; Zhang, Y.; Li, C.; Henao, R.; Carin, L. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 440–450. [Google Scholar] [CrossRef]
Wieting, J.; Bansal, M.; Gimpel, K.; Livescu, K. Towards Universal Paraphrastic Sentence Embeddings. arXiv 2015, arXiv:1511.08198. [Google Scholar] [CrossRef]
Aldarmaki, H.; Diab, M. Evaluation of Unsupervised Compositional Representations. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 2666–2677. [Google Scholar]
Iyyer, M.; Manjunatha, V.; Boyd-Graber, J.; Daumé, H., III. Deep Unordered Composition Rivals Syntactic Methods for Text Classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 26–31 July 2015; pp. 1681–1691. [Google Scholar] [CrossRef]
Pham, N.T.; Kruszewski, G.; Lazaridou, A.; Baroni, M. Jointly optimizing word representations for lexical and sentential tasks with the C-PHRASE model. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 26–31 July 2015; pp. 971–981. [Google Scholar] [CrossRef]
Wieting, J.; Gimpel, K. Revisiting Recurrent Networks for Paraphrastic Sentence Embeddings. arXiv 2017, arXiv:1705.00364. [Google Scholar] [CrossRef]
Kenter, T.; Borisov, A.; de Rijke, M. Siamese CBOW: Optimizing Word Embeddings for Sentence Representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 941–951. [Google Scholar] [CrossRef]
Pagliardini, M.; Gupta, P.; Jaggi, M. Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 July 2018; pp. 528–540. [Google Scholar] [CrossRef]
Gupta, P.; Jaggi, M. Obtaining Better Static Word Embeddings Using Contextual Embedding Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 5241–5253. [Google Scholar] [CrossRef]
Hill, F.; Cho, K.; Korhonen, A.; Bengio, Y. Learning to Understand Phrases by Embedding the Dictionary. Trans. Assoc. Comput. Linguist. 2016, 4, 17–30. [Google Scholar] [CrossRef]
Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, 3–7 April 2017; pp. 427–431. [Google Scholar]
Choi, H.; Kim, J.; Joe, S.; Gwon, Y. Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 5482–5487. [Google Scholar] [CrossRef]
Bommasani, R.; Davis, K.; Cardie, C. Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4758–4781. [Google Scholar] [CrossRef]
Yin, W.; Schütze, H. Learning Word Meta-Embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 5–10 July 2016; pp. 1351–1360. [Google Scholar] [CrossRef]
Coates, J.; Bollegala, D. Frustratingly Easy Meta-Embedding—Computing Meta-Embeddings by Averaging Source Word Embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 194–198. [Google Scholar] [CrossRef]
Adi, Y.; Kermany, E.; Belinkov, Y.; Lavi, O.; Goldberg, Y. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Riedel, B.; Augenstein, I.; Spithourakis, G.P.; Riedel, S. A simple but tough-to-beat baseline for the Fake News Challenge stance detection task. arXiv 2017, arXiv:1707.03264. [Google Scholar] [CrossRef]
Singh, P.; Mukerjee, A. Words are not Equal: Graded Weighting Model for Building Composite Document Vectors. In Proceedings of the 12th International Conference on Natural Language Processing, Trivandrum, India, 11–14 December 2015; pp. 11–19. [Google Scholar]
Arora, S.; Liang, Y.; Ma, T. A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
Ethayarajh, K. Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline. In Proceedings of the Third Workshop on Representation Learning for NLP, Melbourne, Australia, 20 July 2018; pp. 91–100. [Google Scholar] [CrossRef]
Stankevičius, L.; Lukoševičius, M. Testing pre-trained transformer models for Lithuanian news clustering. In Proceedings of the CEUR Workshop Proceeding: IVUS 2020, CEUR-WS, Kaunas, Lithuania, 23 April 2020; Volume 2698, pp. 46–53. [Google Scholar]
Yan, Y.; Li, R.; Wang, S.; Zhang, F.; Wu, W.; Xu, W. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 5065–5075. [Google Scholar] [CrossRef]
An, Y.; Kalinowski, A.; Greenberg, J. Clustering and Network Analysis for the Embedding Spaces of Sentences and Sub-Sentences. In Proceedings of the 2021 Second International Conference on Intelligent Data Science Technologies and Applications (IDSTA), Tartu, Estonia, 15–17 November 2021; pp. 138–145. [Google Scholar] [CrossRef]
Yang, Z.; Zhu, C.; Chen, W. Parameter-free Sentence Embedding via Orthogonal Basis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 638–648. [Google Scholar] [CrossRef]
Wang, B.; Kuo, C.C.J. SBERT-WK: A Sentence Embedding Method by Dissecting BERT-Based Word Models. IEEE ACM Trans. Audio Speech Lang. Proc. 2020, 28, 2146–2157. [Google Scholar] [CrossRef]
Schakel, A.M.J.; Wilson, B.J. Measuring Word Significance using Distributed Representations of Words. arXiv 2015, arXiv:1508.02297. [Google Scholar] [CrossRef]
Arefyev, N.; Ermolaev, P.; Panchenko, A. How much does a word weigh? Weighting word embeddings for word sense induction. arXiv 2018, arXiv:1805.09209. [Google Scholar] [CrossRef]
Luhn, H.P. The Automatic Creation of Literature Abstracts. IBM J. Res. Dev. 1958, 2, 159–165. [Google Scholar] [CrossRef]
Yokoi, S.; Takahashi, R.; Akama, R.; Suzuki, J.; Inui, K. Word Rotator’s Distance. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 2944–2960. [Google Scholar] [CrossRef]
Amiri, H.; Mohtarami, M. Vector of Locally Aggregated Embeddings for Text Representation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 1408–1414. [Google Scholar] [CrossRef]
Gupta, V.; Karnick, H.; Bansal, A.; Jhala, P. Product Classification in E-Commerce using Distributional Semantics. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 536–546. [Google Scholar]
Kim, H.K.; Kim, H.; Cho, S. Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing 2017, 266, 336–352. [Google Scholar] [CrossRef]
Guo, S.; Yao, N. Document Vector Extension for Documents Classification. IEEE Trans. Knowl. Data Eng. 2021, 33, 3062–3074. [Google Scholar] [CrossRef]
Li, M.; Bai, H.; Tan, L.; Xiong, K.; Li, M.; Lin, J. Latte-Mix: Measuring Sentence Semantic Similarity with Latent Categorical Mixtures. arXiv 2020, arXiv:2010.11351. [Google Scholar] [CrossRef]
Mekala, D.; Gupta, V.; Paranjape, B.; Karnick, H. SCDV: Sparse Composite Document Vectors using soft clustering over distributional representations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 December 2017; pp. 659–669. [Google Scholar] [CrossRef]
Gupta, V.; Saw, A.; Nokhiz, P.; Netrapalli, P.; Rai, P.; Talukdar, P. P-sif: Document embeddings using partition averaging. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7863–7870. [Google Scholar]
Aharon, M.; Elad, M.; Bruckstein, A. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 2006, 54, 4311–4322. [Google Scholar] [CrossRef]
Gupta, V.; Saw, A.; Nokhiz, P.; Gupta, H.; Talukdar, P. Improving Document Classification with Multi-Sense Embeddings. In Proceedings of the European Conference on Artificial Intelligence, Santiago de Compostela, Spain, 29 August–8 September 2020. [Google Scholar]
Gupta, A.; Gupta, V. Unsupervised Contextualized Document Representation. In Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, Virtual, 10 November 2021; pp. 166–173. [Google Scholar] [CrossRef]
Mekala, D.; Shang, J. Contextualized Weak Supervision for Text Classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 323–333. [Google Scholar] [CrossRef]
Ahmed, N.; Natarajan, T.; Rao, K. Discrete Cosine Transform. IEEE Trans. Comput. 1974, 100, 90–93. [Google Scholar] [CrossRef]
Almarwani, N.; Aldarmaki, H.; Diab, M. Efficient Sentence Embedding using Discrete Cosine Transform. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3672–3678. [Google Scholar] [CrossRef]
Almarwani, N.; Diab, M. Discrete Cosine Transform as Universal Sentence Encoder. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Online, 1–6 August 2021; pp. 419–426. [Google Scholar] [CrossRef]
Kayal, S.; Tsatsaronis, G. EigenSent: Spectral sentence embeddings using higher-order Dynamic Mode Decomposition. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 4536–4546. [Google Scholar] [CrossRef]
Kim, T.; Yoo, K.M.; Lee, S.g. Self-Guided Contrastive Learning for BERT Sentence Representations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 2528–2540. [Google Scholar] [CrossRef]
Tanaka, H.; Shinnou, H.; Cao, R.; Bai, J.; Ma, W. Document Classification by Word Embeddings of BERT. In Computational Linguistics; Nguyen, L.M., Phan, X.H., Hasida, K., Tojo, S., Eds.; Springer: Singapore, 2020; pp. 145–154. [Google Scholar]
Ma, X.; Wang, Z.; Ng, P.; Nallapati, R.; Xiang, B. Universal Text Representation from BERT: An Empirical Study. arXiv 2019, arXiv:1910.07973. [Google Scholar] [CrossRef]
Kovaleva, O.; Romanov, A.; Rogers, A.; Rumshisky, A. Revealing the Dark Secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 4365–4374. [Google Scholar] [CrossRef]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. arXiv 2021, arXiv:2107.13586. [Google Scholar] [CrossRef]
Zhong, Z.; Friedman, D.; Chen, D. Factual Probing Is [MASK]: Learning vs. Learning to Recall. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 5017–5033. [Google Scholar] [CrossRef]
Wang, H.; Li, Y.; Huang, Z.; Dou, Y.; Kong, L.; Shao, J. SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples. arXiv 2022, arXiv:2201.05979. [Google Scholar] [CrossRef]
Dalvi, F.; Khan, A.R.; Alam, F.; Durrani, N.; Xu, J.; Sajjad, H. Discovering Latent Concepts Learned in BERT. In Proceedings of the International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
Liu, N.F.; Gardner, M.; Belinkov, Y.; Peters, M.E.; Smith, N.A. Linguistic Knowledge and Transferability of Contextual Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 1073–1094. [Google Scholar] [CrossRef]
Tenney, I.; Xia, P.; Chen, B.; Wang, A.; Poliak, A.; McCoy, R.T.; Kim, N.; Durme, B.V.; Bowman, S.; Das, D.; et al. What do you learn from context? Probing for sentence structure in contextualized word representations. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Muller, B.; Elazar, Y.; Sagot, B.; Seddah, D. First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; pp. 2214–2231. [Google Scholar] [CrossRef]
Hämmerl, K.; Libovický, J.; Fraser, A. Combining Static and Contextualised Multilingual Embeddings. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 2316–2329. [Google Scholar] [CrossRef]
Ethayarajh, K. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 55–65. [Google Scholar] [CrossRef]
Carlsson, F.; Gyllensten, A.C.; Gogoulou, E.; Hellqvist, E.Y.; Sahlgren, M. Semantic re-tuning with contrastive tension. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 25–29 April 2020. [Google Scholar]
Chung, H.W.; Fevry, T.; Tsai, H.; Johnson, M.; Ruder, S. Rethinking Embedding Coupling in Pre-trained Language Models. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Rogers, A.; Kovaleva, O.; Rumshisky, A. A Primer in BERTology: What We Know About How BERT Works. Trans. Assoc. Comput. Linguist. 2020, 8, 842–866. [Google Scholar] [CrossRef]
Timkey, W.; van Schijndel, M. All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; pp. 4527–4546. [Google Scholar] [CrossRef]
Sajjad, H.; Alam, F.; Dalvi, F.; Durrani, N. Effect of Post-processing on Contextualized Word Representations. arXiv 2021, arXiv:2104.07456. [Google Scholar] [CrossRef]
Zhao, H.; Lu, Z.; Poupart, P. Self-Adaptive Hierarchical Sentence Model. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15. Buenos Aires, Argentina, 25–31 July 2015; pp. 4069–4076. [Google Scholar]
Yang, J.; Zhao, H. Deepening Hidden Representations from Pre-trained Language Models. arXiv 2021, arXiv:1911.01940. [Google Scholar]
Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 2227–2237. [Google Scholar] [CrossRef]
Michael, J.; Botha, J.A.; Tenney, I. Asking without Telling: Exploring Latent Ontologies in Contextual Representations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6792–6812. [Google Scholar] [CrossRef]
Shi, W.; Chen, M.; Zhou, P.; Chang, K.W. Retrofitting Contextualized Word Embeddings with Paraphrases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 1198–1203. [Google Scholar] [CrossRef]
Artetxe, M.; Schwenk, H. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. Trans. Assoc. Comput. Linguist. 2019, 7, 597–610. [Google Scholar] [CrossRef]
Toshniwal, S.; Shi, H.; Shi, B.; Gao, L.; Livescu, K.; Gimpel, K. A Cross-Task Analysis of Text Span Representations. In Proceedings of the 5th Workshop on Representation Learning for NLP, Online, 9 July 2020; pp. 166–176. [Google Scholar] [CrossRef]
Socher, R.; Pennington, J.; Huang, E.H.; Ng, A.Y.; Manning, C.D. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, 27–31 July 2011; pp. 151–161. [Google Scholar]
Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1631–1642. [Google Scholar]
Salton, G.; Fox, E.A.; Wu, H. Extended Boolean Information Retrieval. Commun. ACM 1983, 26, 1022–1036. [Google Scholar] [CrossRef]
Rücklé, A.; Eger, S.; Peyrard, M.; Gurevych, I. Concatenated Power Mean Word Embeddings as Universal Cross-Lingual Sentence Representations. arXiv 2018, arXiv:1803.01400. [Google Scholar]
Lin, Z.; Feng, M.; dos Santos, C.N.; Yu, M.; Xiang, B.; Zhou, B.; Bengio, Y. A Structured Self-Attentive Sentence Embedding. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Eger, S.; Rücklé, A.; Gurevych, I. Pitfalls in the Evaluation of Sentence Embeddings. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), Florence, Italy, 2 August 2019; pp. 55–60. [Google Scholar] [CrossRef]
Rudman, W.; Gillman, N.; Rayne, T.; Eickhoff, C. IsoScore: Measuring the Uniformity of Embedding Space Utilization. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 3325–3339. [Google Scholar] [CrossRef]
Mu, J.; Viswanath, P. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. In Proceedings of the International Conference on Learning Representations, 30 April–3 May 2018. [Google Scholar]
Zhou, T.; Sedoc, J.; Rodu, J. Getting in Shape: Word Embedding SubSpaces. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, Macao, China, 10–16 August 2019; pp. 5478–5484. [Google Scholar] [CrossRef]
Gao, J.; He, D.; Tan, X.; Qin, T.; Wang, L.; Liu, T. Representation Degeneration Problem in Training Natural Language Generation Models. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Schnabel, T.; Labutov, I.; Mimno, D.; Joachims, T. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–25 September 2015; pp. 298–307. [Google Scholar] [CrossRef]
Li, B.; Zhou, H.; He, J.; Wang, M.; Yang, Y.; Li, L. On the Sentence Embeddings from Pre-trained Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 9119–9130. [Google Scholar] [CrossRef]
Biś, D.; Podkorytov, M.; Liu, X. Too Much in Common: Shifting of Embeddings in Transformer Language Models and its Implications. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 5117–5130. [Google Scholar] [CrossRef]
Rajaee, S.; Pilehvar, M.T. An Isotropy Analysis in the Multilingual BERT Embedding Space. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 1309–1316. [Google Scholar] [CrossRef]
Schick, T.; Schütze, H. Rare Words: A Major Problem for Contextualized Embeddings and How to Fix it by Attentive Mimicking. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8766–8774. [Google Scholar] [CrossRef]
Miller, G.A. WordNet: A Lexical Database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
Lee, Y.Y.; Ke, H.; Huang, H.H.; Chen, H.H. Less is More: Filtering Abnormal Dimensions in GloVe. In Proceedings of the 25th International Conference Companion on World Wide Web, WWW ’16 Companion, Republic and Canton of Geneva, Switzerland, 11–15 April 2016; pp. 71–72. [Google Scholar] [CrossRef]
Kovaleva, O.; Kulshreshtha, S.; Rogers, A.; Rumshisky, A. BERT Busters: Outlier Dimensions that Disrupt Transformers. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 3392–3405. [Google Scholar] [CrossRef]
Luo, Z.; Kulmizev, A.; Mao, X. Positional Artefacts Propagate Through Masked Language Model Embeddings. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 5312–5327. [Google Scholar] [CrossRef]
Ding, Y.; Martinkus, K.; Pascual, D.; Clematide, S.; Wattenhofer, R. On Isotropy Calibration of Transformer Models. In Proceedings of the Third Workshop on Insights from Negative Results in NLP, Dublin, Ireland, 26 May 2022; pp. 1–9. [Google Scholar] [CrossRef]
Cai, X.; Huang, J.; Bian, Y.; Church, K. Isotropy in the Contextual Embedding Space: Clusters and Manifolds. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Rajaee, S.; Pilehvar, M.T. A Cluster-based Approach for Improving Isotropy in Contextual Embedding Space. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Online, 1–6 August 2021; pp. 575–584. [Google Scholar] [CrossRef]
Rajaee, S.; Pilehvar, M.T. How Does Fine-tuning Affect the Geometry of Embedding Space: A Case Study on Isotropy. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 3042–3049. [Google Scholar] [CrossRef]
Artetxe, M.; Labaka, G.; Lopez-Gazpio, I.; Agirre, E. Uncovering Divergent Linguistic Information in Word Embeddings with Lessons for Intrinsic and Extrinsic Evaluation. In Proceedings of the 22nd Conference on Computational Natural Language Learning, Brussels, Belgium, 31 October–1 November 2018; pp. 282–291. [Google Scholar] [CrossRef]
Wang, B.; Chen, F.; Wang, A.; Kuo, C.C.J. Post-Processing of Word Representations via Variance Normalization and Dynamic Embedding. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 718–723. [Google Scholar] [CrossRef]
Liu, T.; Ungar, L.; Sedoc, J.A. Unsupervised Post-Processing of Word Vectors via Conceptor Negation. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. AAAI’19/IAAI’19/EAAI’19. [Google Scholar] [CrossRef]
Jaeger, H. Controlling Recurrent Neural Networks by Conceptors. Technical Report No. 31. Jacobs University Bremen. arXiv, 2014; arXiv:1403.3369. [Google Scholar] [CrossRef]
Karve, S.; Ungar, L.; Sedoc, J. Conceptor Debiasing of Word Representations Evaluated on WEAT. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, Florence, Italy, 2 August 2019; pp. 40–48. [Google Scholar] [CrossRef]
Liang, Y.; Cao, R.; Zheng, J.; Ren, J.; Gao, L. Learning to Remove: Towards Isotropic Pre-Trained BERT Embedding. In Proceedings of the Artificial Neural Networks and Machine Learning—ICANN 2021: 30th International Conference on Artificial Neural Networks, Proceedings, Part V. Bratislava, Slovakia, 14–17 September 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 448–459. [Google Scholar] [CrossRef]
Raunak, V.; Gupta, V.; Metze, F. Effective Dimensionality Reduction for Word Embeddings. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), Florence, Italy, 2 August 2019; pp. 235–243. [Google Scholar] [CrossRef]
Artetxe, M.; Labaka, G.; Agirre, E. Generalizing and Improving Bilingual Word Embedding Mappings with a Multi-Step Framework of Linear Transformations. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar] [CrossRef]
Gao, T.; Yao, X.; Chen, D. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6894–6910. [Google Scholar] [CrossRef]
Wang, T.; Isola, P. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; Volume 119, pp. 9929–9939. [Google Scholar]
Wang, B.; Kuo, C.C.J.; Li, H. Just Rank: Rethinking Evaluation with Word and Sentence Similarities. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 6060–6077. [Google Scholar] [CrossRef]
Faruqui, M.; Dodge, J.; Jauhar, S.K.; Dyer, C.; Hovy, E.; Smith, N.A. Retrofitting Word Vectors to Semantic Lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA, 31 May– 5 June 2015; pp. 1606–1615. [Google Scholar] [CrossRef]
Yu, Z.; Cohen, T.; Wallace, B.; Bernstam, E.; Johnson, T. Retrofitting Word Vectors of MeSH Terms to Improve Semantic Similarity Measures. In Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis, Auxtin, TX, USA, 5 November 2016; pp. 43–51. [Google Scholar] [CrossRef]
Zhang, M.; Fujinuma, Y.; Paul, M.J.; Boyd-Graber, J. Why Overfitting Isn’t Always Bad: Retrofitting Cross-Lingual Word Embeddings to Dictionaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 2214–2220. [Google Scholar] [CrossRef]
Glavaš, G.; Vulić, I. Explicit Retrofitting of Distributional Word Vectors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 34–45. [Google Scholar] [CrossRef]
Zheng, J.; Wang, Y.; Wang, G.; Xia, J.; Huang, Y.; Zhao, G.; Zhang, Y.; Li, S. Using Context-to-Vector with Graph Retrofitting to Improve Word Embeddings. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 8154–8163. [Google Scholar] [CrossRef]
Wang, L.; Huang, J.; Huang, K.; Hu, Z.; Wang, G.; Gu, Q. Improving Neural Language Generation with Spectrum Control. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Tamkin, A.; Jurafsky, D.; Goodman, N. Language Through a Prism: A Spectral Approach for Multiscale Language Representations. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 5492–5504. [Google Scholar]
Zhelezniak, V.; Savkov, A.; Hammerla, N. Estimating Mutual Information Between Dense Word Embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8361–8371. [Google Scholar] [CrossRef]
Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef] [PubMed]
Zhelezniak, V.; Savkov, A.; Shen, A.; Hammerla, N. Correlation Coefficients and Semantic Textual Similarity. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 951–962. [Google Scholar] [CrossRef]
Zhelezniak, V.; Savkov, A.; Shen, A.; Moramarco, F.; Flann, J.; Hammerla, N.Y. Don’t Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Min, C.; Chu, Y.; Yang, L.; Xu, B.; Lin, H. Locality Preserving Sentence Encoding. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3050–3060. [Google Scholar] [CrossRef]
Kayal, S. Unsupervised Sentence-embeddings by Manifold Approximation and Projection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, Main Volume, Online, 19–23 April 2021; pp. 1–11. [Google Scholar] [CrossRef]
Kusner, M.; Sun, Y.; Kolkin, N.; Weinberger, K. From Word Embeddings To Document Distances. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; Volume 37, pp. 957–966. [Google Scholar]
Wu, L.; Yen, I.E.H.; Xu, K.; Xu, F.; Balakrishnan, A.; Chen, P.Y.; Ravikumar, P.; Witbrock, M.J. Word Mover’s Embedding: From Word2Vec to Document Embedding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4524–4534. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Y.; Wu, H. Structural-Aware Sentence Similarity with Recursive Optimal Transport. arXiv 2020, arXiv:2002.00745. [Google Scholar] [CrossRef]
Le, Q.; Mikolov, T. Distributed Representations of Sentences and Documents. In Proceedings of the 31st International Conference on Machine Learning, Bejing, China, 21–26 June 2014; Volume 32, pp. 1188–1196. [Google Scholar]
Lau, J.H.; Baldwin, T. An Empirical Evaluation of Doc2Vec with Practical Insights into Document Embedding Generation. In Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany, 11 August 2016; pp. 78–86. [Google Scholar] [CrossRef]
Hill, F.; Cho, K.; Korhonen, A. Learning Distributed Representations of Sentences from Unlabelled Data. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1367–1377. [Google Scholar] [CrossRef]
Chen, M. Efficient Vector Representation for Documents through Corruption. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
Schuster, M.; Paliwal, K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef]
Kiros, R.; Zhu, Y.; Salakhutdinov, R.R.; Zemel, R.; Urtasun, R.; Torralba, A.; Fidler, S. Skip-Thought Vectors. In Advances in Neural Information Processing Systems; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28. [Google Scholar]
Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Bowman, S.R.; Angeli, G.; Potts, C.; Manning, C.D. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 632–642. [Google Scholar] [CrossRef]
Logeswaran, L.; Lee, H. An efficient framework for learning sentence representations. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Subramanian, S.; Trischler, A.; Bengio, Y.; Pal, C.J. Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Nie, A.; Bennett, E.; Goodman, N. DisSent: Learning Sentence Representations from Explicit Discourse Relations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 4497–4510. [Google Scholar] [CrossRef]
Cer, D.; Yang, Y.; Kong, S.Y.; Hua, N.; Limtiaco, N.; St. John, R.; Constant, N.; Guajardo-Cespedes, M.; Yuan, S.; Tar, C.; et al. Universal Sentence Encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, 31 October–4 November 2018; pp. 169–174. [Google Scholar] [CrossRef]
Henderson, M.; Al-Rfou, R.; Strope, B.; hsuan Sung, Y.; Lukacs, L.; Guo, R.; Kumar, S.; Miklos, B.; Kurzweil, R. Efficient Natural Language Response Suggestion for Smart Reply. arXiv 2017, arXiv:1705.00652. [Google Scholar]
Cao, L.; Larsson, E.; von Ehrenheim, V.; Cavalcanti Rocha, D.D.; Martin, A.; Horn, S. PAUSE: Positive and Annealed Unlabeled Sentence Embedding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; pp. 10096–10107. [Google Scholar] [CrossRef]
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality Reduction by Learning an Invariant Mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; Volume 2, pp. 1735–1742. [Google Scholar] [CrossRef]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Liu, F.; Vulić, I.; Korhonen, A.; Collier, N. Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; pp. 1442–1459. [Google Scholar] [CrossRef]
Wu, X.; Gao, C.; Zang, L.; Han, J.; Wang, Z.; Hu, S. ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding. arXiv 2021, arXiv:2109.04380. [Google Scholar] [CrossRef]
Klein, T.; Nabi, M. SCD: Self-Contrastive Decorrelation for Sentence Embeddings. arXiv 2022, arXiv:2203.07847. [Google Scholar]
Shen, D.; Zheng, M.; Shen, Y.; Qu, Y.; Chen, W. A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation. arXiv 2020, arXiv:2009.13818. [Google Scholar] [CrossRef]
Rozsa, A.; Rudd, E.M.; Boult, T.E. Adversarial diversity and hard positive generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 25–32. [Google Scholar]
Cao, R.; Wang, Y.; Liang, Y.; Gao, L.; Zheng, J.; Ren, J.; Wang, Z. Exploring the Impact of Negative Samples of Contrastive Learning: A Case Study of Sentence Embedding. arXiv 2022, arXiv:2202.13093. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Wei, J.; Zou, K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 6382–6388. [Google Scholar] [CrossRef]
Su, P.; Peng, Y.; Vijay-Shanker, K. Improving BERT Model Using Contrastive Learning for Biomedical Relation Extraction. In Proceedings of the 20th Workshop on Biomedical Language Processing, Association for Computational Linguistics. Online, 11 June 2021; pp. 1–10. [Google Scholar] [CrossRef]
Fang, H.; Wang, S.; Zhou, M.; Ding, J.; Xie, P. CERT: Contrastive Self-supervised Learning for Language Understanding. arXiv 2020, arXiv:2005.12766. [Google Scholar] [CrossRef]
Edunov, S.; Ott, M.; Auli, M.; Grangier, D. Understanding Back-Translation at Scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 489–500. [Google Scholar] [CrossRef]
Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; pp. 353–355. [Google Scholar] [CrossRef]
Wu, Z.; Wang, S.; Gu, J.; Khabsa, M.; Sun, F.; Ma, H. CLEAR: Contrastive Learning for Sentence Representation. arXiv 2020, arXiv:2012.15466. [Google Scholar]
Giorgi, J.; Nitski, O.; Wang, B.; Bader, G. DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 879–895. [Google Scholar] [CrossRef]
Conneau, A.; Kiela, D. SentEval: An Evaluation Toolkit for Universal Sentence Representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
Wu, Q.; Tao, C.; Shen, T.; Xu, C.; Geng, X.; Jiang, D. PCL: Peer-Contrastive Learning with Diverse Augmentations for Unsupervised Sentence Embeddings. arXiv 2022, arXiv:2201.12093. [Google Scholar] [CrossRef]
Wu, X.; Gao, C.; Wang, J.; Zang, L.; Wang, Z.; Hu, S. DisCo: Effective Knowledge Distillation For Contrastive Learning of Sentence Embeddings. arXiv 2021, arXiv:2112.05638. [Google Scholar] [CrossRef]
Williams, A.; Nangia, N.; Bowman, S. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 1112–1122. [Google Scholar] [CrossRef]
Zhang, D.; Li, S.W.; Xiao, W.; Zhu, H.; Nallapati, R.; Arnold, A.O.; Xiang, B. Pairwise Supervised Contrastive Learning of Sentence Representations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; pp. 5786–5798. [Google Scholar] [CrossRef]
Li, S.; Hu, X.; Lin, L.; Wen, L. Pair-Level Supervised Contrastive Learning for Natural Language Inference. arXiv 2022, arXiv:2201.10927. [Google Scholar] [CrossRef]
Ni, J.; Ábrego, G.H.; Constant, N.; Ma, J.; Hall, K.B.; Cer, D.; Yang, Y. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. arXiv 2021, arXiv:2108.08877. [Google Scholar] [CrossRef]
Merity, S.; Xiong, C.; Bradbury, J.; Socher, R. Pointer Sentinel Mixture Models. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar] [CrossRef]
Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; Fidler, S. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Los Alamitos, CA, USA, 7–13 December 2015; pp. 19–27. [Google Scholar] [CrossRef]
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Taylor, W.L. “Cloze Procedure”: A New Tool for Measuring Readability. J. Q. 1953, 30, 415–433. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Bolstad, B.; Irizarry, R.; Åstrand, M.; Speed, T. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19, 185–193. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Zhang, X.; LeCun, Y. Text Understanding from Scratch. arXiv 2015, arXiv:1502.01710. [Google Scholar] [CrossRef]
Rakib, M.R.H.; Zeh, N.; Jankowska, M.; Milios, E. Enhancement of Short Text Clustering by Iterative Classification. In Natural Language Processing and Information Systems; Métais, E., Meziane, F., Horacek, H., Cimiano, P., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 105–117. [Google Scholar]
Xu, J.; Xu, B.; Wang, P.; Zheng, S.; Tian, G.; Zhao, J.; Xu, B. Self-Taught convolutional neural networks for short text clustering. Neural Netw. 2017, 88, 22–31. [Google Scholar] [CrossRef] [PubMed]
Yin, J.; Wang, J. A model-based approach for text clustering with outlier detection. In Proceedings of the 2016 IEEE 32nd International Conference on Data Engineering (ICDE), Helsinki, Finland, 16–20 May 2016; pp. 625–636. [Google Scholar] [CrossRef]
Phan, X.H.; Nguyen, L.M.; Horiguchi, S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-Scale Data Collections. In Proceedings of the 17th International Conference on World Wide Web; Association for Computing Machinery, WWW ’08, New York, NY, USA, 17–22 May 2008; pp. 91–100. [Google Scholar] [CrossRef]
Munkres, J. Algorithms for the Assignment and Transportation Problems. J. Soc. Ind. Appl. Math. 1957, 5, 32–38. [Google Scholar] [CrossRef]
Agirre, E.; Cer, D.; Diab, M.; Gonzalez-Agirre, A. SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity. In Proceedings of the *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Montréal, QC, Canada, 7–8 June 2012; pp. 385–393. [Google Scholar]
Agirre, E.; Cer, D.; Diab, M.; Gonzalez-Agirre, A.; Guo, W. *SEM 2013 shared task: Semantic Textual Similarity. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, Atlanta, GA, USA, 13–14 June 2013; pp. 32–43. [Google Scholar]
Agirre, E.; Banea, C.; Cardie, C.; Cer, D.; Diab, M.; Gonzalez-Agirre, A.; Guo, W.; Mihalcea, R.; Rigau, G.; Wiebe, J. SemEval-2014 Task 10: Multilingual Semantic Textual Similarity. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland, 23–24 August 2014; pp. 81–91. [Google Scholar] [CrossRef]
Agirre, E.; Banea, C.; Cardie, C.; Cer, D.; Diab, M.; Gonzalez-Agirre, A.; Guo, W.; Lopez-Gazpio, I.; Maritxalar, M.; Mihalcea, R.; et al. SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, CO, USA, 4–5 June 2015; pp. 252–263. [Google Scholar] [CrossRef]
Agirre, E.; Banea, C.; Cer, D.; Diab, M.; Gonzalez-Agirre, A.; Mihalcea, R.; Rigau, G.; Wiebe, J. SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, CA, USA, 16–17 June 2016; pp. 497–511. [Google Scholar] [CrossRef]
Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; Specia, L. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, BC, Canada, 3–4 August 2017; pp. 1–14. [Google Scholar] [CrossRef]
Marelli, M.; Menini, S.; Baroni, M.; Bentivogli, L.; Bernardi, R.; Zamparelli, R. A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; pp. 216–223. [Google Scholar]
Abdalla, M.; Vishnubhotla, K.; Mohammad, S.M. What Makes Sentences Semantically Related: A Textual Relatedness Dataset and Empirical Study. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Dubrovnik, Croatia, 2–6 May 2023. [Google Scholar]
Pang, B.; Lee, L. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, MI, USA, 25–30 June 2005; pp. 115–124. [Google Scholar] [CrossRef]
Hu, M.; Liu, B. Mining and Summarizing Customer Reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 22–24 August 2004; KDD ’04. pp. 168–177. [Google Scholar] [CrossRef]
Wiebe, J.; Wilson, T.; Cardie, C. Annotating expressions of opinions and emotions in language. Lang. Resour. Eval. 2005, 39, 165–210. [Google Scholar] [CrossRef]
Deng, L.; Wiebe, J. MPQA 3.0: An entity/event-level sentiment corpus. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA, 31 May–5 June 2015; pp. 1323–1328. [Google Scholar]
Dolan, B.; Quirk, C.; Brockett, C. Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources. In Proceedings of the COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland, 23–27 August 2004; pp. 350–356. [Google Scholar]
Cohan, A.; Ammar, W.; van Zuylen, M.; Cady, F. Structural Scaffolds for Citation Intent Classification in Scientific Publications. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 3586–3596. [Google Scholar] [CrossRef]
Li, X.; Roth, D. Learning Question Classifiers. In Proceedings of the COLING 2002: The 19th International Conference on Computational Linguistics, Taipei, Taiwan, 26–30 August 2002. [Google Scholar]
Voita, E.; Sennrich, R.; Titov, I. The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 4396–4406. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; NIPS ’20. Curran Associates Inc.: Red Hook, NY, USA, 2020. [Google Scholar]

Figure 1. Relation between average Spearman correlation for STS tasks and IsoScore of Wikitext representations for each model. Pearson correlation coefficients are shown.

Figure 2. Relation between average clustering accuracy and IsoScore of Wikitext representations for each model. Pearson correlation coefficients are shown.

Figure 3. Alignment and uniformity of representations in relation to various token pooling and post-processing techniques. Lower values are better.

Figure 4. Layer-wise performance of templated models T0 (a,d,g) and T4 (b,e,h), as well as BERT versus RE with no weighting or post-processing (c,f,i). The average performance of STS (a–c), clustering (d–f), and classification tasks (g–i) is shown by the lines, while shadow areas correspond to the standard deviation. We also show first + last aggregation over layers as the last tick

\binom{1}{12}

on the horizontal axis.

Figure 4. Layer-wise performance of templated models T0 (a,d,g) and T4 (b,e,h), as well as BERT versus RE with no weighting or post-processing (c,f,i). The average performance of STS (a–c), clustering (d–f), and classification tasks (g–i) is shown by the lines, while shadow areas correspond to the standard deviation. We also show first + last aggregation over layers as the last tick

\binom{1}{12}

on the horizontal axis.

Figure 5. BERT + Avg. model performance dependence on the weight w of Avg. model and layer, from which (for both models) representations are used. To the right of the black line on the horizontal axis, average aggregation of multiple layers is also shown. Tokens are simply averaged and no post-processing is used. The horizontal line with

w = 0.0

corresponds to a regular Bert (B) model,

w = 0.5

is B + Avg., and

w = 1.0

is the Avg. model. The white × marks the maximum value.

Figure 5. BERT + Avg. model performance dependence on the weight w of Avg. model and layer, from which (for both models) representations are used. To the right of the black line on the horizontal axis, average aggregation of multiple layers is also shown. Tokens are simply averaged and no post-processing is used. The horizontal line with

w = 0.0

corresponds to a regular Bert (B) model,

w = 0.5

is B + Avg., and

w = 1.0

is the Avg. model. The white × marks the maximum value.

Table 1. Manual template search by adding additional text and [MASK] ([M]) tokens. Target sentences are inserted into the place of [X]. Only the average of 12th layer [M] representations are used. Bolded results are the best.

No.	Template	STS	Clust.	Class.
T0	This sentence: “[X]” means [M].	63.4	45.5	78.7
T1	This sentence: “[X]” means [M][M].	63.4	46.6	78.6
T2	This sentence: “[X]” means “[M][M]” and is about [M].	70.4	52.8	78.3
T3	This sentence from the paraphrase dictionary: “[X]” means “[M]”, which is about [M].	69.6	54.2	77.4
T4	This sentence from the dictionary: “[X]” means “[M]” and is about [M], which is a synonym for [M].	69.3	54.2	76.8

Table 2. Dataset statistics for the short text clustering datasets. N is the number of text samples, C is the number of clusters, L/S is the imbalance number defined as the size of the largest class divided by that of the smallest class,

∥ V ∥

is vocabulary size, Len is the average number of tokens in each text sample, Alpha is the percent of tokens that are alphabetic (all token characters must be defined in the Unicode character database as “Letter”, while tokens with numbers, punctuation, or BERT continuation tokens such as “##ing” are excluded). Statistics with plain word tokenization are marked with “W” and with bert-base-uncased model tokenizer as “B”.

Table 2. Dataset statistics for the short text clustering datasets. N is the number of text samples, C is the number of clusters, L/S is the imbalance number defined as the size of the largest class divided by that of the smallest class,

∥ V ∥

is vocabulary size, Len is the average number of tokens in each text sample, Alpha is the percent of tokens that are alphabetic (all token characters must be defined in the Unicode character database as “Letter”, while tokens with numbers, punctuation, or BERT continuation tokens such as “##ing” are excluded). Statistics with plain word tokenization are marked with “W” and with bert-base-uncased model tokenizer as “B”.

Dataset	N	C	L/S	$∥ V ∥$		Len		Alpha, %
Dataset	N	C	L/S	W	B	W	B	W	B
agnews	8000	4	1	21,062	16,140	23	26	100	86
biomedical	20,000	20	1	18,888	9326	13	20	98	64
googleTS	11,109	152	143	19,508	14,763	28	33	100	85
searchsnippets	12,340	8	7	30,643	16,334	19	24	93	77
stackoverflow	20,000	20	1	22,909	7332	8	12	87	71
tweet	2472	89	249	5098	5091	9	11	100	81

Table 3. Dataset statistics for STS and classification tasks. N is the number of text samples, C is the number of clusters, L/S is the imbalance number, defined as the size of the largest class divided by that of the smallest class (for STS tasks, the two classes are binned to 1 point label value length ones from the highest and lowest sides),

∥ V ∥

is vocabulary size, Len is the average number of tokens in each text sample, Alpha is the percent of tokens that are alphabetic. Statistics with plain word tokenization are marked with “W” and with bert-base-uncased model tokenizer as “B”. For MRPC, SICK-R/E, STS-B, and STS tasks, statistics are calculated for tokenized and then concatenated sentences in each pair.

Table 3. Dataset statistics for STS and classification tasks. N is the number of text samples, C is the number of clusters, L/S is the imbalance number, defined as the size of the largest class divided by that of the smallest class (for STS tasks, the two classes are binned to 1 point label value length ones from the highest and lowest sides),

∥ V ∥

is vocabulary size, Len is the average number of tokens in each text sample, Alpha is the percent of tokens that are alphabetic. Statistics with plain word tokenization are marked with “W” and with bert-base-uncased model tokenizer as “B”. For MRPC, SICK-R/E, STS-B, and STS tasks, statistics are calculated for tokenized and then concatenated sentences in each pair.

Dataset	Split	N	C	L/S	$∥ V ∥$		Len		Alpha, %
Dataset	Split	N	C	L/S	W	B	W	B	W	B
STS tasks
STS12		3108		5.2	8127	7802	25	28	83	77
STS13		1500		0.7	5152	5141	20	21	88	82
STS14		3750		1.6	9117	8613	21	23	86	80
STS15		3000		0.9	7364	7185	23	24	89	85
STS16		1186		1.2	3971	4175	26	28	87	83
STR		5500		1.0	22,392	12,883	25	32	83	76
Binary classification
MR		10,662	2	1.0	20,325	13,802	22	26	84	76
CR		3775	2	1.8	5675	5221	20	22	84	80
SUBJ		10,000	2	1.0	22,636	15,912	25	28	85	78
MPQA		10,606	2	2.2	6239	6248	3	3	97	88
SST2	train	67,349	2	1.3	14,816	11,570	9	11	88	78
	dev	872	2	1.0	4339	4542	20	23	85	76
	test	1821	2	1.0	7053	6824	19	23	85	76
MRPC	train	4076	2	2.1	16,112	12,061	44	50	81	75
	test	1725	2	2.0	10,092	8471	44	50	82	75
Fine-grained classification
SST5	train	8544	5	2.1	16,579	12,395	19	23	84	75
	dev	1101	5	2.1	5038	5168	19	23	85	76
	test	2210	5	2.3	7929	7478	19	23	85	76
TREC	train	5452	6	14.5	9437	8492	10	11	87	81
	test	500	6	15.3	1100	1247	7	8	85	79
SCICITE	train	7320	3	4.4	27,775	13,603	31	40	100	77
	dev	916	3	4.4	7625	6918	31	40	100	78
	test	1861	3	3.8	12,609	9217	31	41	100	77
SICK-E/R	train	4500	3/…	3.8/3.6	2258	2277	19	20	99	96
	dev	500	3/…	3.8/5.0	1122	1172	20	20	99	96
	test	4927	3/…	3.9/3.9	2271	2291	19	20	99	96
STS-B	train	5749	…	1.3	12,430	10,792	22	25	86	80
	dev	1500	…	0.7	6542	6511	26	28	85	80
	test	1379	…	1.1	4888	4921	22	24	85	81

Table 4. Effect of different token aggregation and post-processing methods on multiple text embedding models. Both correlation and accuracy scores range from 0 to 100 (from the worst to the best). We also added results for SimCSE-BERT model [143], which is fine-tuned on NLI data supervision. The best result for each model is bolded, while underlined results are the best across all the models.

Model	T0	T4	BERT	BERT + Avg.	Avg.	B2S	B2S-100	RE
Layer	Last		First + Last			–
Avg. Spearman correlation (%) of STS tasks
avg.	61.3	62.3	61.3	61.4	56.4	56.7	60.6	50.3
${i d f}_{t}^{W}$	67.8	67.4	69.2	69.3	66.9	66.1	65.4	66.4
+ zscore	68.6	68.1	69.8	70.6	69.5	66.4	65.9	66.4
${i d f}_{t}^{T}$	67.4	69.2	68.7	69.0	67.5	56.7	60.6	64.9
+ quantile-u	69.5	71.6	68.9	69.5	68.4	56.2	61.0	63.0
-biases	64.7	66.0	68.2	69.3	67.3	54.1	57.5	63.7
`[MASK]`	63.4	69.3
+ quantile-u	65.6	70.5
SimCSE performance: 81.5
Avg. accuracy (%) of clustering tasks
avg.	53.5	55.0	57.0	59.2	55.2	53.8	55.3	36.3
${i d f}_{t}^{W}$	53.0	53.0	55.2	57.1	53.7	53.0	53.7	44.3
+ normalize	54.3	54.4	58.0	57.9	54.1	55.8	56.1	49.5
${i d f}_{t}^{T}$	53.1	57.6	60.7	62.5	58.7	53.8	55.3	42.8
+ quantile-u^W	57.6	60.0	63.1	64.4	60.2	54.6	56.0	44.4
+ normalize	55.5	57.6	63.4	64.8	59.8	55.5	57.2	47.0
-biases	54.3	56.2	61.1	62.2	61.6	52.6	54.1	40.7
+ normalize	54.4	56.3	62.4	63.4	62.2	54.4	56.8	44.6
`[MASK]`	45.5	54.2
+ quantile-u	44.4	54.9
SimCSE performance: 59.8 and 64.0 (avg. + quantile-u^W, first + last layers)
Avg. accuracy (%) of classification tasks
avg.	80.5	79.9	79.9	79.7	77.3	76.9	75.1	69.5
${i d f}_{t}^{W}$	73.1	67.1	78.3	77.9	75.3	76.8	74.1	67.6
${i d f}_{t}^{T}$	80.1	79.7	78.9	78.4	76.1	76.9	75.1	68.1
-biases	80.1	79.9	80.1	79.8	77.2	76.3	74.2	70.3
`[MASK]`	78.7	76.8
SimCSE performance: 86.5

Table 5. Performance of post-processing with plain token averaging. abtt-2 is the ABTT method with top 2 principal components removed. The best result for each model is bolded, while underlined results are the best across all the models.

Model	T0	T4	BERT	BERT + Avg.	Avg.	B2S	B2S-100	RE
Layer	Last		First + Last			–
Avg. Spearman correlation (%) of STS tasks
avg.	61.3	62.3	61.3	61.4	56.4	56.7	60.6	50.3
+ zscore	65.4	67.3	65.1	66.5	64.1	58.5	62.1	55.6
+ quantile-u	65.0	67.3	64.3	65.7	63.0	56.2	61.0	53.0
+ quantile-u^W	62.8	64.6	63.0	64.7	62.1	55.5	60.5	52.2
+ abtt-2	64.7	66.7	64.3	66.3	64.3	58.3	62.0	54.4
+ normalize	61.3	62.3	61.3	61.4	56.4	56.7	60.6	50.3
+ whiten	63.5	64.2	64.7	65.6	65.5	62.1	61.2	63.2
+ whiten^W	60.9	63.7	63.1	65.8	67.0	61.4	61.0	64.3
Avg. accuracy (%) of clustering tasks
avg.	53.5	55.0	57.0	59.2	55.2	53.8	55.3	36.3
+ zscore	54.2	55.7	57.8	59.5	56.1	53.3	55.0	36.3
+ quantile-u	54.6	56.3	58.7	60.7	57.3	54.2	55.8	36.8
+ quantile-u^W	54.2	56.3	60.1	61.1	58.6	54.5	56.0	36.9
+ abtt-2	53.4	54.8	57.0	59.0	55.8	53.6	55.0	36.5
+ normalize	53.4	55.0	59.7	59.7	56.1	55.5	57.2	38.5
+ whiten	35.4	35.7	33.1	29.8	26.8	25.3	26.0	24.8
+ whiten^W	49.0	49.8	56.2	58.7	55.0	50.4	50.1	41.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Stankevičius, L.; Lukoševičius, M. Extracting Sentence Embeddings from Pretrained Transformer Models. Appl. Sci. 2024, 14, 8887. https://doi.org/10.3390/app14198887

AMA Style

Stankevičius L, Lukoševičius M. Extracting Sentence Embeddings from Pretrained Transformer Models. Applied Sciences. 2024; 14(19):8887. https://doi.org/10.3390/app14198887

Chicago/Turabian Style

Stankevičius, Lukas, and Mantas Lukoševičius. 2024. "Extracting Sentence Embeddings from Pretrained Transformer Models" Applied Sciences 14, no. 19: 8887. https://doi.org/10.3390/app14198887

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Extracting Sentence Embeddings from Pretrained Transformer Models

Abstract

1. Introduction

2. Related Work

2.1. Composing Word Vectors

2.1.1. Formal Semantics

2.1.2. Tensor Products

2.1.3. Averaging

2.1.4. Weighted Average

2.1.5. Clustering

2.1.6. Spectral Methods

2.1.7. Using Special Tokens

2.1.8. Aggregating through Layers

2.1.9. Other Means of Composition

2.2. Reshaping Representation Spaces

2.2.1. Isotropy

Word Frequencies

Outliers

Criticism

2.2.2. Post-Processing Methods

Z-Score Normalization

All-But-The-Top (ABTT)

Whitening

2.2.3. Retrofitting

2.2.4. Other Methods

2.2.5. Similarity Measures

2.3. Learning Sentence Embeddings Directly

2.3.1. Paragraph Vectors

2.3.2. To RNN- and Transformer-Based Models

2.3.3. Contrastive Learning Approaches

Feature/Vector/Embedding-Level Augmentations

Token-Level Augmentations

Positives by Relative Placement

Positives by Two Networks

Positives and Negatives from Supervised Data

Direction of Contrastive Learning for NLP

3. Methods

3.1. Problem Formulation

Additional Context Data

3.2. Models

3.2.1. BERT

Input

Transformer Block

Multi-Head Self-Attention

3.2.2. Prompting Method (T0, T4)

3.2.3. Averaged BERT (Avg.)

Combining Averaged BERT and Regular BERT (BERT + Avg.)

3.2.4. BERT2Static (B2S, B2S-100)

3.2.5. Random Embeddings (RE)

3.3. Aggregating Tokens

3.4. Post-Processing Embeddings

3.4.1. Z-Score Normalization

3.4.2. Quantile Normalization to Uniform Distribution (quantile-u)

3.4.3. Whitening

3.4.4. All-But-The-Top (ABTT)

3.4.5. Normalization

3.4.6. Learning Post-Processing

3.5. Evaluation

3.5.1. Clustering Tasks

Agnews

Biomedical

GoogleTS

Searchsnippets

Stackoverflow

Tweet

3.5.2. Semantic Textual Similarity (STS) Tasks

3.5.3. Downstream Classification Tasks

Binary Classification

Ternary Classification

Fine-Grained Classification

3.5.4. Isotropy

3.5.5. Alignment and Uniformity

4. Results

4.1. Token Aggregation and Post-Processing Techniques

4.1.1. Avg. versus B2S and B2S-100

4.1.2. BERT versus Random Embeddings

4.1.3. Isotropy

IsoScore Correlation with Task Performance

4.1.4. Alignment and Uniformity