Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Extracting Sentence Embeddings from Pretrained Transformer Models

Appl. Sci. 2024, 14(19), 8887; https://doi.org/10.3390/app14198887

by Lukas Stankevičius^*

and Mantas Lukoševičius

Reviewer 1:

Jianzong Wang

Reviewer 2:

Onur Dogan

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Appl. Sci. 2024, 14(19), 8887; https://doi.org/10.3390/app14198887

Submission received: 14 August 2024 / Revised: 18 September 2024 / Accepted: 23 September 2024 / Published: 2 October 2024

(This article belongs to the Special Issue Advances in Large Language Models: Techniques, Applications and Challenges)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper explores the extraction of sentence embeddings from pre-trained transformer models, focusing on BERT. The authors investigate various token aggregation and representation post-processing techniques to enhance the quality of sentence embeddings. The evaluation is conducted on short text clustering, semantic textual similarity, and classification tasks. The paper presents several findings, including the effectiveness of simple representation extraction methods, the proposal of competitive static token models, and the introduction of an improved BERT+Avg combined model.

My concerns about this paper:

1) While the paper presents a comprehensive evaluation, the proposed methods themselves may be seen as incremental improvements over existing techniques. The core ideas of token aggregation and post-processing have been explored in previous work.

2) The paper primarily focuses on the BERT model, and it is unclear how well the proposed methods would generalize to other transformer architectures, particularly larger and more recent models.

3) The paper could benefit from a more in-depth theoretical analysis of why certain token aggregation and post-processing techniques work well. The current explanations are mostly based on empirical observations.

Author Response

The general response to all reviewers:

We thank the reviewers for timely, constructive, and generally positive reviews. In addition to the suggested changes (discussed in detail below), we also improved the article in some other places, including better phrasing, abstract, language use, and explanations. All the changes are highlighted in the provided revision.

********** REVIEWER 1 ****************

> This paper explores the extraction of sentence embeddings from pre-trained transformer models, focusing on BERT. The authors investigate various token aggregation and representation post-processing techniques to enhance the quality of sentence embeddings. The evaluation is conducted on short text clustering, semantic textual similarity, and classification tasks. The paper presents several findings, including the effectiveness of simple representation extraction methods, the proposal of competitive static token models, and the introduction of an improved BERT+Avg combined model.

> My concerns about this paper:

> 1) While the paper presents a comprehensive evaluation, the proposed methods themselves may be seen as incremental improvements over existing techniques. The core ideas of token aggregation and post-processing have been explored in previous work.

We agree that our proposed methods may be seen as incremental improvements. However, we did extensive testing of many combinations and made a lot of extensions of the techniques we found in the literature. Our paper is also novel in that we showed that "simple baselines with representation shaping techniques reach or even outperform more complex BERT-based models" (line 16).

> 2) The paper primarily focuses on the BERT model, and it is unclear how well the proposed methods would generalize to other transformer architectures, particularly larger and more recent models.

We started experiments with the BERT as it was the most known and popular model in the research community. We considered extending our work to the other models but found it unfeasible due to the vast number of experiments and time-consuming computations. We will leave this direction for future work.

> 3) The paper could benefit from a more in-depth theoretical analysis of why certain token aggregation and post-processing techniques work well. The current explanations are mostly based on empirical observations.

We agree that theoretical analysis would be beneficial. However, in our case, this would include analysis of how transformer model weights behave, and this is not a common practice due to the black box nature of deep neural networks. Therefore, our main contributions are based on empirical results and often include discussions of the most likely causes.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Attached.

Comments for author File: Comments.pdf

Comments on the Quality of English Language

Attached

Author Response

The general response to all reviewers:

******** REVIEWER 2 ********

> I reviewed the study “Extracting Sentence Embeddings from Pretrained Transformer Models”.
> My comments are as follows:
> 1. Some sentences are complex and could benefit from simplification to improve readability. For example, in the sentence, "On the 10th birthday, a child is expected to have already experienced the meaning of more than 100 million words", the use of "experienced the meaning" is vague.

We agree that some phrasing may lack clarity and readability. The phrase "experienced the meaning" was deliberately chosen to convey a nuanced understanding of language acquisition that goes beyond mere exposure to words. It encompasses both the comprehension and contextual integration of words, which is a more complex cognitive process than just "reading" or "hearing" words. Now we rephrased it with "encountered and understood the meaning", which we think will be a better substitution.

> 2. The paper uses various terms for similar concepts, which can cause confusion. For example, the terms "sentence-level embeddings," "sentence representations," and "text embeddings" are used interchangeably. Choose one term and stick to it throughout the paper to maintain consistency.

We used these terms because the embeddings we analyzed are applicable from sentence to paragraph-length texts. As one can see in Table 2 and Table 3, the average text length on our evaluation tasks varies from 7 to 44 words. To maintain a better consistency we edited the paper with only the more popular "sentence embeddings" phrase.

> 3. The explanation of some methodologies lacks sufficient detail, which might make it hard for readers to fully grasp the approach. The discussion on the token aggregation methods could be expanded.

To improve the clarity of the Models subsection we added the following distinction at the beginning of the section:

We use multiple text representation methods, focusing mainly on BERT-based ones. Prompting method T4, Averaged BERT (Avg.), BERT+Avg., B2S-100, and Random embeddings (RE) are our proposed models or modifications, while BERT, T0, and B2S are plain adaptations of existing ones. We tested token aggregation and sentence representation post-processing techniques on all the eight models. We will describe them in more detail below.

In the token aggregation methods section, we now have added a reference to Section 2.1. "Composing word vectors" where the different aggregation methods in the literature are discussed in detail.

> 4. The results section lacks visual elements, which could make it harder for readers to quickly grasp the findings. The text mentions "very high improvements for static token-based models," but this is not represented.

Our results depend on models, layers, token aggregations, post-processing types, and evaluation tasks, therefore visualizing them is not trivial. After multiple trials, we chose the Table 4 design. All 5 dimensions find their place here and one can clearly see how static token-based models B2S, BDS-100, and RE are improved. We agree that this table is not as visual as graphs or charts and it takes a bit of time to grasp it. To make our article more visually appealing, we also incorporated line plots (Figure 4) and heat maps (Figure 5) presenting different views of the data.

> 5. The literature review could be more comprehensive. The related briefly mentions several methods but does not deeply engage with recent developments. Expand the literature review to include more recent works and discuss their limitations or how your work advances the field.

We agree that we lack the most recent developments as the field of Large Language Models develops rapidly. However, we have already dedicated Section 2.3.1. "Contrastive learning approaches" for this where we described the most influential contributions from recent years. Here we summarized that contrastive learning currently produces state-of-the-art sentence embeddings which is still true at the time of this writing.

> 6. The conclusion section is somewhat brief and does not fully capture the broader implications of the work. The conclusion on mentions that simple baselines can outperform complex models, but this is a significant finding that deserves more emphasis.

In our experiments, this finding was the most pronounced for the stackoverflow dataset. We added a bigger emphasis on that in the results:
Also, if the texts do not contain the natural language of the type the BERT was pre-trained on (e.g., they contain code), the model can not properly contextualize the tokens.
and conclusions:
It also shows very high performance for some tasks like stackoverflow classification, where BERT token contextualization may not work well on code samples in the texts.

It is not very surprising that the token contexts produced by BERT sometimes do not provide an advantage when the type of language is quite different from what it has been pre-trained on.

> 7. There are a few minor grammatical errors and awkward phrasing throughout the paper. The phrase "surface it enough?" on Page 1, Line 5 is colloquial

The phrase "surface it enough?" was intended to succinctly question whether the methods in question (plain averaging or prompt templates) are sufficient to bring out the full potential of sentence-level embeddings. The choice of this phrasing was aimed at engaging the reader and drawing attention and, indeed, it is a bit too informal or colloquial for an academic paper. To maintain a formal tone and improve clarity, we changed it to "sufficiently capture and represent the underlying meaning".

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Dear Authors,

Your article "Extracting Sentence Embeddings from Pretrained Transformer Models" has am interesting topic. I have some improvement recommendations and you can find them below.

First of all, the dimension of the article is too big. I recommend you discuss with the editor of the journal and think about storing the appendix F1 and appendix F2 on a public repository (for example FigShare), so that the readers can access the data from appendix, but the size of the article becomes smaller.

In the Introduction section, I recommend you to try describing the research question of your article, because at this moment is a bit diffuse.

Within chapter "2. Related work" you present different models: composing word vectors, formal semantics, tensor products, averaging, weighted average, clustering, spectral methods, special tokens, aggregating through layers, etc. This chapter contains valuable information, but I suggest you decrease its size. Please try to be more concise.

At the end of the chapter "2. Related Work" I recommend you define and describe the research hypotheses of your article. At this moment, there is no research hypothesis to be tested and validated within your research.

The size of the chapter "Method" should be reduced. I recommend you try to present in a more synthetic manner the general method.

In the "Results" chapter, you should present the results in correspondence with the research hypotheses, so that the readers understand the logical flow of your article: the research hypotheses, the method, the results, the validation of the hypotheses.

Before the Conclusions section, I recommend you include a Discussion chapter. Here you should present your research results and compare them to the others from the extant scientific literature. This way, the readers will understand your contribution to the science.

As a final remark, due to its size, at this point the article looks more like a chapter in a book. I recommend you to restructure the content so that the size fits that of a usual scientific article.

Best Regards!

Author Response

The general response to all reviewers:

************ REVIEWER 3 *****************

> Your article "Extracting Sentence Embeddings from Pretrained Transformer Models" has am interesting topic. I have some improvement recommendations and you can find them below.

> First of all, the dimension of the article is too big. I recommend you discuss with the editor of the journal and think about storing the appendix F1 and appendix F2 on a public repository (for example FigShare), so that the readers can access the data from appendix, but the size of the article becomes smaller.

We agree that the article is quite long. However, much work went into it, the results are interesting and valuable enough not to discard them. We moved the detailed results to the appendices so as not to clutter the main text, but they could get lost if hosted separately and externally.

The article can be printed excluding the appendices to conserve paper if that is a concern.

> In the Introduction section, I recommend you to try describing the research question of your article, because at this moment is a bit diffuse.

We describe our hypothesis (lines 86-91) and the main contributions (lines 92-104) of our work in the Introduction section.

We also provide a mathematical formulation of the problem we are solving in Section 3.1, at the beginning of the Methods section.

> Within chapter "2. Related work" you present different models: composing word vectors, formal semantics, tensor products, averaging, weighted average, clustering, spectral methods, special tokens, aggregating through layers, etc. This chapter contains valuable information, but I suggest you decrease its size. Please try to be more concise.

We believe that the extensive review of all the related work is one of the strong contributions of this article. We now highlight it more in the text.

> At the end of the chapter "2. Related Work" I recommend you define and describe the research hypotheses of your article. At this moment, there is no research hypothesis to be tested and validated within your research.

In Section 3.1, called the "Problem formulation" we have already described that the goal is to find "an aggregation function", which takes as input all activations and outputs from a transformer model and reduces it to as meaningful as possible vector.

> The size of the chapter "Method" should be reduced. I recommend you try to present in a more synthetic manner the general method.

The chapter "Method" consists of problem formulation, models, techniques and evaluation sections. These are the main parts that we think should be within the "Method" section. We also do not want to reduce the contents, as we targeted the details to be of similar granularity as in related articles in our field.

> In the "Results" chapter, you should present the results in correspondence with the research hypotheses, so that the readers understand the logical flow of your article: the research hypotheses, the method, the results, the validation of the hypotheses.

We start the "Results" chapter with the direct result of our problem formulation (Section 3.1): to find "an aggregation function", which takes as input all activations and outputs from a transformer model and reduces it to as meaningful as possible vector. We show an effect our techniques and their combinations have when applied to transformer-type and other models. The following sections in "Results" chapter only try to explain these results, by depicting comparisons or more fine-grained calculations of the technique in question.

> Before the Conclusions section, I recommend you include a Discussion chapter. Here you should present your research results and compare them to the others from the extant scientific literature. This way, the readers will understand your contribution to the science.

We did not add a new Discussion chapter, as to some degree both the Results and Conclusions chapters already have the relevant content.

Regarding comparison to the others from the extant scientific literature, we updated Table 4 and added the following paragraph:

ur techniques can even improve the dedicated SimCSE model \cite{gao-etal-2021-simcse} which was fine-tuned on NLI data. Its main strength lies in semantic textual similarity tasks where it leads with over 10\% difference. However, for clustering tasks, its average accuracy is similar to the other evaluated models at 59.8\% and improves up to 64.0\%, if we apply the best-performing techniques. This showcases a general tendency that the top-performing models are very good only in a narrow subset of tasks and highlights the importance of our more general methods.

> As a final remark, due to its size, at this point the article looks more like a chapter in a book. I recommend you to restructure the content so that the size fits that of a usual scientific article.

Yes, we agree that the article is quite long, which has some downsides. We went through the text and simplified the phrasing in many places. However, it is not feasible to significantly reduce its length without losing valuable information. We specifically have chosen a journal that has no page limit.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

This manuscript investigates various methods for extracting sentence-level embeddings from pretrained transformer models, specifically focusing on the BERT model. The authors propose new techniques and experimental evaluation is given to show its performance.

Comments:

1. The authors introduces techniques for extracting sentence embeddings, such as BERT+Avg, but does not provide enough motivation or theoretical backing for why these specific combinations were chosen. The selection of layers and the rationale behind the weighting schemes, for instance, are not thoroughly explained. Besides, the authors integrate some operations, such as averaging or weighted summation, to aggregate token representations from different layers of the BERT model. While they could be computationally efficient, they might oversimplify the complex interactions between layers and tokens in transformer models.

2. I strongly suggest reorganizing the “Related Work” section. Unless this paper aims to be a survey, Section 2 is overly redundant. Focus on clearly describing the landscape of Sentence Embeddings, Large Language Models (LLMs), and Unsupervised Learning, highlighting how they are interconnected. The current 20 pages for related work are excessive—consider moving some details to an appendix if necessary. Additionally, the paper could benefit from including references to research on text representation learning from categorical sequences view, such as https://doi.org/10.1109/TKDE.2020.2992529, https://doi.org/10.1016/j.eswa.2022.116637 and https://doi.org/10.1007/978-3-030-01298-4_2 , which would strengthen the argument and provide readers with additional resources for further exploration.

3. For Section “Method”, it is hard to distinguish between the existing models being introduced and the author’s own contributions. I had to carefully go back through the text to identify. I recommend clearly separating the description of current models from the novel contributions to avoid confusion.

4. While the paper presents a broad range of experimental results, it falls short in analyzing why certain methods perform better or worse across different tasks (\e.g., STS, clustering, classification). The discussion lacks depth regarding the factors that drive these performance variations. Furthermore, the paper does not include comparisons with the latest SOTA methods.

Comments on the Quality of English Language

Moderate editing of English language required.

Author Response

The general response to all reviewers:

************* REVIEWER 4 ******************

> This manuscript investigates various methods for extracting sentence-level embeddings from pretrained transformer models, specifically focusing on the BERT model. The authors propose new techniques and experimental evaluation is given to show its performance.

> Comments:

> 1. The authors introduces techniques for extracting sentence embeddings, such as BERT+Avg, but does not provide enough motivation or theoretical backing for why these specific combinations were chosen. The selection of layers and the rationale behind the weighting schemes, for instance, are not thoroughly explained. Besides, the authors integrate some operations, such as averaging or weighted summation, to aggregate token representations from different layers of the BERT model. While they could be computationally efficient, they might oversimplify the complex interactions between layers and tokens in transformer models.

Regarding BERT+Avg model, as we wrote in line 1203 "We also wanted to see how sentence embeddings derived from static averaged BERT tokens can contribute to the original BERT representations. Therefore, we averaged sentence representations from the two methods mentioned above". Also, a natural way to assess the contributions of individual models was to try different weights in weighted averaging.

Regarding the selection of layers, we tried every single one individually, promising combinations of layers found in literature, and also experimented with our own combination of layers. Results Table 4 incorporates only the best and most popular layer selections, while Figures 4, 5, and A6-A8 show detailed dependencies on individual layers.

We agree that some of our proposed techniques can oversimplify the complex interactions between layers and tokens in transformer models. We have already found works in literature that tried all possible combinations of 2 up to 3 layers, with the latter showing diminishing returns. Trying more combinations is too computationally expensive and not promising at this point. Therefore, we opted to explore existing beneficial layer combinations (and single layers) with other token aggregation and embedding post-processing techniques.

> 2. I strongly suggest reorganizing the “Related Work” section. Unless this paper aims to be a survey, Section 2 is overly redundant. Focus on clearly describing the landscape of Sentence Embeddings, Large Language Models (LLMs), and Unsupervised Learning, highlighting how they are interconnected. The current 20 pages for related work are excessive—consider moving some details to an appendix if necessary. Additionally, the paper could benefit from including references to research on text representation learning from categorical sequences view, such as https://doi.org/10.1109/TKDE.2020.2992529, https://doi.org/10.1016/j.eswa.2022.116637 and https://doi.org/10.1007/978-3-030-01298-4_2 , which would strengthen the argument and provide readers with additional resources for further exploration.

We agree that our related work section is deeper than that of a usuall paper. Our goal was to review many works, find the most promising techniques and then extend them and combine. Large Language Models are the natural direction of our field. In particular, the ones employding constrastive learning, as we have analyzed in Section 2.3.1.

We think that provided references would not complement our article, as our scope is about vectors from sentence to paragraph length, while referenced papers focus on problems where the texts are up to several words.

> 3. For Section “Method”, it is hard to distinguish between the existing models being introduced and the author’s own contributions. I had to carefully go back through the text to identify. I recommend clearly separating the description of current models from the novel contributions to avoid confusion.

We are sorry for the confusion. Subsection “Models” structure is made to be identical to the results in Table 4 and Table 5 columns, as our first contribution of testing multiple combinations of various token aggregation and sentence representation post-processing techniques applies to all models. However, as you noticed, only several of these models are proposed entirely by us. To make this clear, we now added the following distinction at the beginning of the section:

> 4. While the paper presents a broad range of experimental results, it falls short in analyzing why certain methods perform better or worse across different tasks (\e.g., STS, clustering, classification). The discussion lacks depth regarding the factors that drive these performance variations. Furthermore, the paper does not include comparisons with the latest SOTA methods.

We added the comparison to the SimCSE model in Table 4, which is a well-known model trained with contrastive objective on NLI data and matches the size and architecture with other evaluated models. In results section we added the following paragraph:

Our techniques can even improve the dedicated SimCSE model [143], which was fine-tuned on NLI data. Its main strength lies in semantic textual similarity tasks, where it leads STS tasks with over 10\% difference. However, for clustering tasks, its average accuracy is similar to the other evaluated models at 59.8\% and improves up to 64.0\%, if we apply the best performing techniques. This showcases a general tendency that usually the top performing models are very good only in a narrow subset of tasks and highlights the importance of our more general methods.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Acceptable in the present form

Author Response

Thank you!

Reviewer 3 Report

Comments and Suggestions for Authors

Dear Authors,

I have read the proposed revised version of the article. It is almost unchanged. As for my constructive recommendations in the previous report, you have avoided implementing them.

Whatever your arguments, the article is too big and it doesn't seem to be a scientific article, but a book chapter.

Kind Regards!

Author Response

The general response to all reviewers:

We thank the reviewers for timely, constructive, and generally positive reviews. In addition to the suggested changes (discussed in detail below), we also improved the article in some other places, including the abstract and explanations of our contributions. All the changes are highlighted in the provided revision.

>Dear Authors,

>I have read the proposed revised version of the article. It is almost unchanged. As for my constructive recommendations in the previous report, you have avoided implementing them.

>Whatever your arguments, the article is too big and it doesn't seem to be a scientific article, but a book chapter.

>Kind Regards!

We agree once again that the article is long, which has its downsides. However, we do not believe that deleting large parts of it will make it better or more valuable at this stage.

The difference between a book chapter and a long-form journal article with an extensive literature review can be subjective. We agree that the style and length of the work can resemble a book chapter. But that is not necessarily a bad thing. The work also has all the usual parts of a research paper, which book chapters typically do not.

We have also highlighted our main contributions better in this new revision.

The total number of pages can also be a bit misleading since almost half of them are for the reference list (235 references, 13 pages) and appendices (16 pages) alone.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

My main concern remains with the structure and contributions of the paper. The authors dedicate an extensive portion of the manuscript to related work (Section 2, over 20 pages), while the "Methods" section (Section 3) largely focuses on providing a detailed background of existing methods. I would like know whether this paper is intended as a survey work, as it currently reads like one. Additionally, are the authors primarily evaluating the performance of existing methods rather than introducing novel approaches? Clarifying this would help to better understand the paper’s contributions.

Author Response

The general response to all reviewers:

We thank the reviewers for timely, constructive, and generally positive reviews. In addition to the suggested changes (discussed in detail below), we also improved the article in some other places, including abstract and explanations about our contributions. All the changes are highlighted in the provided revision.

> My main concern remains with the structure and contributions of the paper. The authors dedicate an extensive portion of the manuscript to related work (Section 2, over 20 pages), while the "Methods" section (Section 3) largely focuses on providing a detailed background of existing methods. I would like know whether this paper is intended as a survey work, as it currently reads like one. Additionally, are the authors primarily evaluating the performance of existing methods rather than introducing novel approaches? Clarifying this would help to better understand the paper’s contributions.

Thank you, we agree that our main contributions were not highlighted and separated well enough.

We have now clarified our contributions in the abstract:
Methods: After providing a comprehensive review of existing sentence embedding extraction and refinement methods, we thoroughly test different combinations and our original extensions of the most promising ones. Namely,

...
Conclusions: Our work shows that the representation-shaping techniques significantly improve sentence embeddings extracted from BERT-based and simple baseline models.

updated the second item in the list of the main contributions at the end of Introduction:

• We experimentally test how multiple combinations of various most promising token aggregation and sentence representation post-processing techniques impact the performance of three classes of different tasks and properties of representations on several models.

and the beginning of the Methods section:
Our extensive literature review allowed us to see the big picture of the sentence embedding research.

Currently, the evolution of models for sentence embeddings and related NLP tasks is settled at transformers. In this work, we use existing models to source the raw, token-level embeddings. In Section 3,2 we describe the main model that we use in detail, some baselines we used for comparisons, as well as some of our original extensions. In particular, in T1 - T4 models we extend original prompting templates by incorporating more than one [MASK] token. Next, we present a new Avg. model where we derived sentence embeddings first by averaging tokens in different contexts and then by averaging the tokens themselves. Finally, we present our BERT+Avg. model which combines both contextual and multiple-context averaged representations, all derived from the same BERT transformer model.

In addition to these extensions, we found two main directions that can be used to improve sentence embeddings from transformer models: token aggregation and post-processing techniques. Note, however, that our main contribution here is not the methods, as we reuse most of them from the existing works, but the combinations of them on the transformer model and extensive evaluation on multiple tasks. We thoroughly describe the techniques used (and minor extensions) for token aggregation in Section 3.3 and post-processing of embeddings in Section 3.4.

We have noticed that most works confine themselves to a small subset of evaluation tasks, which limits their results comparability to others. Papers from top conferences always include classification tasks on top of semantic textual similarity, which is usually the only evaluation. In this work, we evaluate sentence embeddings on three different types of tasks: semantic textual similarity, downstream classification, and clustering. We present these tasks in Section 3.5.

Article Menu

Extracting Sentence Embeddings from Pretrained Transformer Models

Further Information

Guidelines

MDPI Initiatives

Follow MDPI