This section briefly describes the vectorization forms used in the study and the methods used to evaluate the results.
4.1. Vectorization
The quality of vectorization plays a crucial role in vector-based retrieval. The vectorization models tested are presented in
Table 4. The models for vectorization were selected to cover a wide range of models starting from a word-embedding-based out-dated model to state-of-the-art transformer-based embedding models. First, we give a brief overview and then we provide an in-depth introduction for these models in the next paragraph.
As a baseline we chose the
fasttext model since this is the oldest model and is based on word embeddings. Since our dataset was a legal dataset in Hungarian, we selected the origo of Hungarian BERT models named huBERT [
36]. There are three different models based on huBERT that we tested: a non-finetuned (
hubert), a semantic similarity fine-tuned (
sbert_hubert) and a Q&A fine-tuned but on a small dataset (
danieleff).
The current trend in vector-based retrieval points towards the use of multi-lingual embedding models that are fine-tuned for several tasks including information retrieval and Q&A. We selected models having smaller context windows (
mcontriever,
e5_large,
e5_base and
cohere, 512 tokens) and ones that have significantly wider context windows (
openai_ada,
openai_3_large,
bge-m3 and
jina_v3, 8192 tokens). Many of these models are at the top of the MTEB leaderboard’s [
37] non-English Retrieval and Semantic Textual Similarity (STS) sections. The models are described in detail below.
The
cohere,
openai_3_large,
openai_ada and
jina_v3 models were reached through an API. The
fasttext [
3] model was trained by using 157,000 Hungarian legal decisions using the official
fasttext Python package. The model is a skip-gram model [
6], and during training, negative sampling was set to 10, the learning rate was set to 0.05, and the model produced 100-dimensional length vectors. All other parameters were left at their default settings.
Table 4.
Vectorization models compared in this study.
Table 4.
Vectorization models compared in this study.
The oldest technology is
fasttext, which is similar to the Word2Vec [
6,
39] word embedding method but handles the out-of-vocabulary words by using subword embeddings for character n-grams. This embedding proved to be especially effective in highly inflecting languages like Hungarian [
3]. Generally, a document representation is calculated as the average of the whole words and the character n-grams of the text. All the other models are transformer-based models.
The next group is a group of models that is pre-trained in Hungarian language (
hubert) and fine-tuned for providing semantic representations in an S-BERT [
25] fashion marked with
sbert_hubert in this study [
35]. This model was fine-tuned on the Hunglish 2.0 parallel corpus [
40] to mimic the
bert-base-nli-stsb-mean-tokens [
25] model. The vast majority of the parallel corpus is from the legal domain.
The
danieleff model is based on the most commonly used
huBERT base model [
36], which is a Hungarian BERT base language model. The model was fine-tuned using 170 question-answer pairs from the university studies domain. The answers were 1000–5000 characters long chunks from organizational and operational rules of universities.
The next group consists of multilingual models fine-tuned for question answering, semantic similarity and/or passage retrieval tasks: multilingual e5 models [
31], cohere multilingual embedding model (
https://txt.cohere.com/introducing-embed-v3/ (accessed on 2 December 2024)), BGE-M3 multilingual embedding model [
32], Jina AI’ multilingual embedding model [
33] and Facebook’s mContriever model [
38]. These models were fine-tuned on the MSMARCO dataset [
41] (that is a Q&A dataset) among other datasets.
The
mContriever model is multilingual BERT-based solution [
12], trained on the CCNet dataset [
42] using contrastive learning [
38]. Contrastive learning means that given the query representation, the goal is to retrieve the representation corresponding to the positive document among all the negatives [
38].
The
e5 models are multilingual models that are XLM-RoBERTa-based [
43] and multilingual extensions of the English E5-based text embeddings [
44]. These models have been pre-trained in a weakly supervised contrastive manner on billions of text pairs and fine-tuned in a supervised way on a small quantity of high-quality labeled data [
31].
There are no published details on how
cohere embeddings are created, likely due to commercial reasons. However, according to a blog post (
https://txt.cohere.com/introducing-embed-v3/ (accessed on 2 December 2024)) it is emphasized that
cohere embeddings capture both the topic match and the content quality, in contrast to e.g the
openai_ada vector that captures only the topic similarity aspect. This leads to higher-quality retrieval results.
The
bge_m3 is a relatively new model, similar to the
e5 models. It provides meaningful representations across more than 100 languages, and at multiple depths (word, paragraph, and full text up to 8192 tokens). Additionally, it can handle three common retrieval tasks: dense, sparse, and multi-vector retrieval. Sparse retrieval uses word-level importance scores derived from embeddings to compare and rank documents based on common words, making it suitable for scenarios where precise word matching is important. Multi-Vector retrieval employs multiple vectors to capture different aspects or parts of a document, allowing for more accurate relevance alignment between the query and parts of the document, which is useful for understanding broader context and semantic meaning. The model is XLM-RoBERTa-based and supports dense retrieval tasks using the [CLS] vector, while token vectors can be applied to sparse and multi-vector retrieval tasks. Jina AI’s v3 multilingual embedding model [
33] was recently introduced and currently leads in many non-English languages for the Semantic Text Similarity task on the MTEB leaderboard (
https://huggingface.co/spaces/mteb/leaderboard (accessed on 2 December 2024)) [
37]. The model is available both through huggingface (
https://huggingface.co/jinaai/jina-embeddings-v3 (accessed on 2 December 2024)) and via an API (
https://jina.ai/embeddings/ (accessed on 2 December 2024)). In this study, we used the API for vectorization. This model is Jina-XLM-RoBERTa-based and also provides multilingual and multi-task embeddings like the BGE-M3 model and it also has 8192-token-long context window. It can be used for a wide range of tasks, including query-document retrieval, clustering, classification, and text matching. The embedding dimension can be modified from 1024 to as low as 32 without significantly impacting performance, thanks to Matryoshka Representation Learning (MRL).
OpenAI’s v3 embedding has also been recently released. These models were trained using a technique that allows the embeddings to be shortened by removing certain numbers from the end without losing their concept-representing properties [
45].
4.1.1. Handling Long Texts
Transformer-based models excel at capturing semantic relations. However, one major drawback is that these models can only handle a fixed length of token window. This is not a problem if the text is smaller than the context window of the model. In contrast, questions arise when the texts do not fit into this window. To address this, we developed and compared seven different approaches, including Last Chunk Scaling (LCS), which is particularly useful for texts that are approximately 2–5 times longer than the context window. The approaches are as follows:
truncated: this is the easiest approach, keeping only the first context window length of the document. Note that different models cover different numbers of characters. This is the baseline.
chunked: splits the document into chunks having context window length, ensuring that this split does not cut words into half but only on word ends. The vectors are calculated for these chunks and averaged, resulting in a document vector.
stride: creating chunks similarly compared to the chunked approach but ensuring that there are tokens overlapping between the chunks as described in [
17]. These chunks are averaged to get a document vector. We tested two different stride sizes: 25% of the context window and fixed 16 tokens.
last chunk scaling (LCS): this technique can be applied alongside the chunked or stride methods. After chunking the document, the last chunk generally consists of fewer tokens than the context window. Our solution scales the last chunk’s vector by the following factor: . This method was tested with the chunked and both stride approaches (25% and fix 16 tokens).
The pseudo code for all of the chunking strategies is shown in Algorithm 1.
Algorithm 1 Chunking strategies. |
- 1:
function GetVector(tokens, max_len, stride, lcs) - 2:
Input: tokens: list of tokens - 3:
Input: max_len: maximum tokens per chunk - 4:
Input: stride: striding window size, 0 means no striding - 5:
Input: lcs: whether to use last chunk scaling - 6:
slices ← GetChunks(document, max_len, stride) - 7:
vectors ← VectorizeSlices(slices) - 8:
if lcs then - 9:
scale_factor ← len(slices[-1]) / max_len - 10:
vectors[-1] ← vectors[-1] * scale_factor - 11:
end if - 12:
return Average(vectors) - 13:
end function - 14:
function GetChunks(tokens, max_len, stride) - 15:
Input: tokens: list of tokens - 16:
Input: max_len: maximum tokens per chunk - 17:
Input: stride: number of overlapping tokens (0 means no overlap) - 18:
slices ← [] - 19:
word_ends ← GetWordEndPositions(tokens) - 20:
start_pos ← 0 - 21:
while start_pos < len(tokens) do - 22:
end_pos ← max() - 23:
current_slice ← tokens[start_pos:end_pos] - 24:
slices.append(current_slice) - 25:
start_pos ← min() + 1 -
▹ +1 is needed for the word start position - 26:
end while - 27:
return slices - 28:
end function - 29:
function GetWordEndPositions(tokens) - 30:
Input: tokens: list of tokens - 31:
Output: word_ends: list of indices marking the ends of complete words - 32:
word_ends ← [] - 33:
for each i in do - 34:
if tokens[i] is the end of a word or punctuation then - 35:
word_ends.append(i) - 36:
end if - 37:
end for - 38:
return word_ends - 39:
end function - 40:
function VectorizeSlices(slices, vectorization_model) - 41:
Input: slices: list of token chunks - 42:
Input: vectorization_model: model used for vectorization - 43:
Output: vectors: list of vectors corresponding to the slices - 44:
vectors ← [] ▹ Initialize an empty list for vectors - 45:
for each in do - 46:
vector ← vectorization_model.vectorize(slice) - 47:
vectors.append(vector) - 48:
end for - 49:
return vectors - 50:
end function
|
Both the stride and the LCS approaches aim to mitigate the chunking method’s major drawback, namely that the chunks are completely separated from each other. While the stride method addresses this issue by sharing tokens between the chunks, the LCS approach works differently: it prevents the unwanted overweighting of the last, often smaller chunks when averaging the chunk vectors. This is important because such overweighting can easily distort the direction of the average vector, particularly when there are 2–5 chunks per document. Our case fell into this category, as the Facts documents consisted of 2–8 chunks on average, depending on the vectorizers. An example showing the effect of the Last Chunk Scaling is shown in
Figure 1.
All in all, seven approaches were tested on the models having at most 512 tokens: truncated, chunked, chunked+LCS, stride (25%), stride (25%)+LCS, stride (fix 16), and stride (fix 16)+LCS.
4.1.2. Vectorization Architecture
The vectorization architecture diagram is shown in
Figure 2. First the chunks are generated from the documents. These chunks are then fed into the vectorizer (embedding) model. Vectorization can be done either by calling the vectorizer on a single chunk or on a batch. However, vectorizing batches is more computationally intensive. The vectorizer used Sentence-BERT [
25] in the majority of the models except for the
fasttext and
hubert. The
fasttext is not transformer-based and in case of
hubert we used the
[CLS] embeddings instead of the S-BERT default token averaging. The document vector is finally gained by averaging the chunk vectors.
4.2. Evaluation Metrics
As evaluation metrics, we applied two widely used measures: Mean Reciprocal Rank (MRR) and Recall at n (R@n), as well as another metric, namely the Cosine Similarity Difference (CSD). The exact calculations of these are described below.
4.2.1. Mean Reciprocal Rank
The mean reciprocal rank is calculated as follows:
where
n is the number of documents used during the evaluation (in our case 100 documents) and
is the retrieved position of the good document. Mean reciprocal rank is a metric that can be applied when for a given query the expected document is known. And this was the case with our created dataset described in
Section 3.1.
4.2.2. Recall at n (R@n)
Recall at n is a metric used to compare different solutions by measuring whether the relevant document was retrieved within the top n documents. It is calculated by dividing the number of relevant items retrieved by the total number of relevant items. For example, if the ground truth document is A and the retrieved list of documents is [C, D, A, E, F] the R@1 score will be 0, but the R@3 score will be 1. Note that Precision@1 is the same as the Recall@1 value in our setting since if the first retrieved document is not the ground truth it can be considered either as a false positive or a false negative.
4.2.3. Cosine Similarity Difference (CSD)
A major drawback of the MRR method is that it does not take into account if a similar but not the same document is retrieved as the expected one. Cosine similarity is a widely used metric to compare two embedded texts semantically. To measure how well the retrieved documents compare to the ground truth one, we calculated the cosine scores for the given query and calculated the difference of these scores between the first retrieved and the ground truth documents.
Example: The retrieval for a given query retrieved the following order of documents: [C, B, A] with cosine scores [0.98, 0.96, 0.95], respectively. The ground truth document for the query is A. The difference is calculated as 0.98 − 0.95 = 0.03. Note that if the first document is the ground truth one, this score is 0. To make the comparison easier, we also multiplied the results by 100, so in the example above the result would have been 3. Another important point is that CSD scores cannot be compared across different vectorization models, as the retrieval scores vary between methods. However, this measure can still be used to compare different chunking methods.