1. Introduction
The advent of extremely large language models (LLMs) in the past decade has pushed Natural Language Processing (NLP) for under-resourced languages beyond all foreseen expectations, while the building and training of these LLMs has been an impetus for low-resource NLP, the deployability and sustainability of these technologies for real-world use cases is an often ignored secondary aspect. Even though multilingual models such as mBERT [
1] and XLM-R [
2] excel at low-resource and multilingual NLP, they often fail when it comes to this second aspect because they are extremely large language models with vocabularies of hundreds of languages, which may not be necessary for the deployment of a model for a single low-resourced language. Unlike for high-resourced languages, under-resourced languages often lack the availability of a single monolingual language model, such as CamemBERT [
3] for French or RobBERT [
4] for Dutch, thus making large jointly trained multilingual models a necessary evil; while one can argue that mBERT and XLM are still deployment-friendly in some ways, the trends toward an exponential rise in parameters will soon make it impossible to deploy research-grade released models. For example, this occurs in the mT5-XXL (13 billion parameters) [
5] and the Turing ULR (4.6 billion parameters) [
6] series of models, which are currently state of the art on the XTREME [
7] data set—a comprehensive benchmark for cross-lingual transfer learning for a large variety of NLP tasks and languages.
While there have been significant strides forward in reducing model footprints, inference, and training times with methodologies such as Distillation, Quantization, and Pruning, these methodologies are often tested in a general direction, i.e., reducing a multilingual model as a whole, or in a task-specific setting, i.e., creating a smaller model specialized for a particular task. In this work, we attempt to explore the consequences of using the ideas behind knowledge distillation and applying these to large pre-trained multilingual models, to filter knowledge specific to a target language into a new, smaller, and faster student language model which performs identically to or even outperforms the teacher in some cases. The main contribution of this paper is to dive deep into standard knowledge distillation practices and explore optimal strategies to distill individual target languages from a large multilingual model.
The first objective of the proposed research is to explore the standard knowledge distillation setup designed for generic full-model distillation for two widely used multilingual models, i.e., multilingual-BERT (mBERT) and XLM-RoBERTa (XLM-R). Important to note is that we attempt to only keep information for a single target language for the student. We build upon the pilot experiments for
Eliquare, first proposed in Singh and Lefever [
8], and perform all experiments on a set of six carefully selected languages accounting for as much variation as possible with regard to their typologies, language families, and available resources. We consider Dutch and French to be representative of high-resourced languages, Hindi and Hebrew are considered moderately resourced languages, and Swahili and Slovene are representatives of low-resourced languages. For each language, we evaluate the obtained distilled students on a set of two downstream tasks: one being a syntactic word-level task such as Part-of-Speech Tagging and the other a semantic sentence-level task such as Sentiment Analysis.
A second, and perhaps more vital objective of this research is to propose ideas that specifically benefit the construction of students for low-resourced languages, i.e., Swahili and Slovene in our case. We attempt to do this in two stages. Firstly, we explore the principles behind altering the vocabularies of the final student to better suit the low-resource setting. While joint models have large combined vocabularies which assist in multilingual aspects, for a distilled student model only the vocabulary of a single target language is required. While the high-resourced languages used in our work (Dutch and French) have enough sub-words in the multilingual vocabulary to adequately represent the language space, the middle- and under-resourced languages have an extremely poor representation. In mBERT, for example, a medium-resourced language such as Hebrew has around 2483 sub-words in the vocabulary accounting for approximately 2% of the whole vocabulary, while Thai only has 370 sub-words, amounting to around 0.3% of the vocabulary. We, therefore, explore techniques to reduce the vocabulary sizes both pre- and post-distillation while keeping the performance consistent across all benchmarks. Secondly, we perform a detailed ablation study to explore what components and hyper-parameters specifically impact the performance of the distilled student in the low-resource setting. We specifically dive deeper into the two most vital components of the distillation framework: the loss and softmax temperature.
The remainder of this paper is structured as follows. In
Section 2, we first describe relevant related research on knowledge distillation and shortcomings of large multilingual language models for low-resourced settings, and
Section 3 discusses the fundamental principles of classic knowledge distillation and builds from the DistilBERT [
9] setup towards a language-specific distillation setting.
Section 4 discusses the experimental setup and results of the basic setups and demonstrates the viability of the proposed
Eliquare methodology.
Section 5 and
Section 6 further venture into advanced modifications possible to the distillation setup, to suit a low-resource language setting, while
Section 5 discusses the concept of altering the vocabularies of the multilingual models, to only accommodate a low-resourced language, while also speeding up the distillation process further.
Section 6 discusses the impact of some of the key hyper-parameters and their impact on the student models for Slovene and Swahili.
Section 7 concludes this paper by summarizing our findings and suggesting ideas for future research.
3. Language Distillation Setup
We begin the system description by explaining the fundamental principles behind a distillation setup in more detail. While there has been work that is an exception that forgoes the standard logit setup and uses ideas such as mutual information [
28] and graph-based methods [
29], most distillation methodologies work with a few common principles at the core. Distillation, as previously described, can be simply thought of as the task of finding the approximation
where
is the student model’s final output for the training data
x, and
is a teacher model’s final output on the same data. There can be three broad variables in a distillation setup. Firstly, the data used for distillation, which determines the type of knowledge being distilled. For instance, to distill a specialized model for Natural Language Inference (NLI) (NLI is a sentence-pair task that given a premise, evaluates if a hypothesis is an entailment, a contradiction, or unrelated to the premise), only information vital for NLI needs to be filtered from the teacher, and this can be conducted by imitating the teacher’s knowledge for an NLI dataset, therefore implying that any stored information not relevant to NLI can be forgotten. A few approaches experiment with augmenting data to boost the learning towards a target task, but this is usually useful in task-specific settings where labeled data is a requirement for distillation. The second variable can be the loss function. The loss function essentially determines how we choose to compare the student to the teacher during the learning stage. Given a loss metric
L(
x,
y) and a teacher and student prediction on a sample
i represented by
and
, a minimization objective over a dataset of size N can be defined as
The third and final variable can be how the student model is set up, primarily, the architecture and the initialization. While most approaches work with an architecture identical to the teacher but with a smaller number of layers, there has been work that adopts simpler architectures for the student than for the teacher. A number of initialization strategies have also been explored since a better initialization can heavily impact the distillation outcome, as shown by Turc et al. [
30].
Regarding the first variable, i.e., the distillation data, our goal is to distill knowledge relevant to a single target language which is why we use the entire latest Wikipedia dump for the target language. The minimization objective for that given target language
t can then be simply modified as
For the second variable, i.e., the distillation objective, Hinton et al. [
11] introduced two vital contributions, which have become fundamental building blocks of most distillation setups since. Firstly, the error function
L(
x,
y) is defined as the cross-entropy between the student and teacher logits:
Secondly, Hinton et al. also introduced the concept of softmax temperature. Instead of using logits from the teacher directly in the error function
, they propose using soft targets instead, determined by a preset temperature value. Given a temperature value
and
representing the
output logit given
K classes, the soft targets can be generated with
Thus, softening the probability distribution of the logits if
or hardening the distribution if
. Softening the targets can produce stable training that reduces the impact of noisy labels from the teacher model, while hardening can be more useful for faster convergence when distillation data is hard to come by. Sanh et al’s [
9] setup inherits from the temperature-based soft targets and uses cross-entropy between the soft targets as the error function
L(
x,
y). Additional loss functions
and a standard
for Masked Language Modeling defined below, are used in addition to
While
is expected to minimize the cosine distance between the soft targets and the student logits,
adds an additional component that learns directly from the data
instead of the teacher outputs. This can serve as a self-correction for those examples where the teacher is not always reliable, while also speeding up training by adding an additional learning signal directly from the ground truth. The three losses are combined with a preset weighted sum,
While for the initial setups, we inherit the preset weights (
) and softmax temperature (
) from Sanh et al. [
9], we discuss the impact of these components further in
Section 6.
For the third and final variable, i.e., the student model’s setup, we use an architecture identical to the teacher, but with 6 encoder layers, in contrast to the 12 teacher layers. We attempt two alternate setups by changing the vocabulary of the teacher pre-distillation or of the student directly through post-distillation. This is further covered in
Section 5, as the initial experiments did not involve any changes to the vocabulary. Another important part of this variable is the initialization of the student. We follow the general approach [
9] where the student is initialized from the teacher’s layers. The authors explore the initialization of the student with the first 6 layers, or the final 6 layers of the teacher model, but concluded that using alternating layers of the teacher offers the best initialization, i.e., layer
n of the student is initialized from layer
of the teacher and we, therefore, adopt an identical initialization.
4. Experiments
For the experiments we build upon the pilot experiments discussed in Singh and Lefever [
8] using mBERT [
1], as well as experiment with another state-of-the-art multilingual teacher, i.e., XLM-RoBERTa [
31]. We name our approach
Eliquare which is the Latin word for ‘distillation, filtering or refining’. For both setups, the student (
Eliquare) is initialized from the given teacher (mBERT or XML-RoBERTa) using the
approach described in
Section 3.
We experiment with six target languages for distillation: French, Dutch, Hindi, Hebrew, Slovene, and Swahili. As can be derived from
Table 2, these languages have been selected because they are varied in terms of typology (they all belong to different language groups), script, and resources available (expressed in the number of available Wikipedia pages; for reference, English has 57.29 million Wikipedia pages). Based on this latter column, we consider Dutch and French as representative of high-resourced languages, Hindi and Hebrew as moderately-resourced languages and Swahili and Slovene as low-resourced in our experiments and analyses.
The same Wikipedia dumps of these target languages are used as distillation data in order to construct the Eliquare student models with the basic distillation setup. For each language, we obtain the latest Wikipedia XML dumps and pre-process them for MLM, with a masking probability of 0.15 and word masking, word replacement, and unchanged word proportions of 0.8, 0.1, and 0.1, respectively. We also employ the MLM smoothing parameter (set to 0.7) to emphasize masking of less frequent words. Next, the pre-processed data is split into two parts for training and validation with a 90:10 split. All students are trained for 10 iterations over this processed data, using a starting learning rate of . As learning from larger batches works better for distillation, we opted for a batch size of 32 (8 per device) and performed gradient accumulation for 50 steps (effective batch size of ). We use the Adam optimizer with an of . The position embeddings in XLM-RoBERTa are frozen to save some computing time. We store the student model after every epoch and use the version with the best distillation loss on the held-out validation set for the evaluation step.
For the evaluation step, a logical choice could be to look at perplexity and validation loss. However, these are not the best metrics to assess the overall language understanding of an LLM, since they focus on evaluating the Masked Language Modeling objective, rather than general language understanding. Instead, we decided to assess the six monolingual students by fine-tuning them for different language-specific downstream tasks. For each target language, two downstream tasks have been selected, as summarized in
Table 3. One task each time requires higher-level (semantic) sentence understanding (such as Sentiment Analysis or News Classification) while the other is highly syntax-dependent (such as Part-of-Speech Tagging). Please note that for the two under-resourced target languages Slovene and Swahili, it was not always possible to find available datasets for these tasks. In those cases, we fell back to the task of named entity recognition or NER, which can be perceived as a task requiring both semantic (which entities do these refer to in the real world) and syntactic (often named entities consists of more than one token) understanding.
For Task 1 we employed Sentiment Analysis data from various sources for three languages: Le et al. [
32] for French, Van der Burgh and Verberne [
33] for Dutch, and Amram et al. [
34] for Hebrew. For Hindi and Swahili we relied on News Genre Classification data from Hindi2Vec (
https://github.com/NirantK/hindi2vec, accessed on 1 January 2023), comprising 14 news classes, and from SNCD (
https://huggingface.co/datasets/swahili_news, accessed on 1 January 2023) comprising 6 news classes in Swahili. Due to the unavailability of a suitable semantic sentence-level task for Slovene, we used NER data from Rahimi et al. [
35] as an alternative. For Task 2 we relied on the Universal Dependencies (
https://universaldependencies.org, accessed on 1 January 2023) (UD) project, which comprises treebanks with unified POS-tagged data for French (GSD), Dutch (Lassy-small), Hebrew (HTB), Hindi (HDTB), and Slovene (SSJ). Since there is no UD (or other) treebank publicly available for Swahili, we fell back to NER and used NER data from the Masakhane initiative [
36].
We train the student of the respective language individually for each downstream task for 10 epochs with a starting learning rate of with a decay of 0.01 after 500 warmup steps. We select the best validation model (train-validation-test splits are used as provided by the datasets; however, when this is not provided, an 80-10-10 split is used). All tasks are evaluated using F1-score, except task 1 for Dutch (DBRD) which is evaluated with accuracy to allow comparisons with the upper bound.
The results of these experiments are presented in
Table 4. Each time we compare the performance of our student models (
Eliquare-mBERT and
Eliquare-XLM) to a similarly sized reference, namely distilmBERT which serves as our baseline. Moreover, a comparison is made with the two teacher models and we also represent the upper bound (row in gray) which is each time based on monolingual transformers of the same size as the standard,
BERT-base-uncased for English. These upper bounds, therefore, are of much larger sizes and trained with multitudes more monolingual data for the target language, while also having a significantly larger and specialized vocabulary for the script in question. The best results per transformer algorithm (BERT/RoBERTa) for each language and task are indicated in bold. From the table, we can observe that the
Eliquare models often perform similarly or in some cases even better than the respective teachers, i.e., mBERT and XLM, which are much larger in size. The statistical significance of
Eliquare-mBERT’s improvement over the teacher mBERT was validated using the Wilcoxon Signed-Rank (Left-Tailed) Test (
) (Statistically significant if
). Moreover, in a number of low-resourced settings, specifically, for Hebrew (task 2), Slovene (task 1), and Hindi (tasks 1 and 2) the students sometimes even outperform the upper bound. The monolingual performance of the
Eliquare student models further emphasizes the added value of language-specific distillation since in low-resource settings (Slovene and Swahili) the much more efficient and sustainable student models are able to compete in performance with their larger upper-bounds trained on extensive amounts of monolingual data, making them a better choice for deployment in practical scenarios. It is important to stress the advantages of
Eliquare students for sustainability and efficiency. The base
Eliquare student after
vocabulary reduction (
Section 5) has 66 million parameters, which is 2.5 times less than mBERT, and 2 times less than distilmBERT, while also having a significantly faster inference speed of 0.066 s, compared to mBERT’s 0.384 s (single V100 GPU with a batch size of 32).
These results clearly demonstrate, that even with a vanilla distillation setup, it is possible to obtain better monolingual models for low-resourced languages from a multilingual teacher. In the next sections, we further explore the changes that could be made to the vanilla setup to make the language-specific application of distillation even more viable.
5. Vocabulary Manipulation for mBERT
While the
Eliquare distilled student models achieve results on par with their respective multilingual teacher models (see
Table 4), there are still issues that need to be addressed when using them in a monolingual setting. The most vital of these issues pertains to the multilingual vocabulary of these huge multilingual LLMs.
As visualized in
Figure 1, the vocabulary of multilingual models, in this case mBERT, heavily favors Latin-based languages, while having only a meager few thousand sub-words for large language groups such as Indic (6545 to be exact, which can be derived from the circa 12 included languages in mBERT from the Indian sub-continent) and Cyrillic (10 languages and 13,782 sub-words in mBERT). Having a smaller vocabulary in these languages thus means less diverse sub-words which inevitably results in some semantically meaningless alphabet-based tokens in the vocabulary, such as
, etc.
As an example,
Figure 2 represents the tokenization of long words in English, a similar high-resourced language (Dutch), a medium-resourced language (Hindi), and a low-resourced language (Farsi). We compare the tokenization by mBERT’s WordPiece tokenizer to that of a monolingual model in the respective language. As illustrated in the figure the tokenization is consistent between mBERT and the monolingual model for English, on average having a size of around three characters per sub-word. However, this changes as we go down the resource ladder. For Dutch, some sub-words are only two characters long, especially sub-words that do not have much semantic meaning attached to them. However, the final two sub-words for Dutch still have 4–5 characters and allow them to have some abstract sense associated with them. Finally, for the final two examples in Hindi and Farsi, mBERT ends up breaking down the word into each individual character, whereas the monolingual model considers the example as an independent whole sub-word.
These tokenization issues, combined with the poor overall representation of low-resourced languages in the vocabulary space are a motivation to investigate strategies to alter multilingual vocabulary for use in a monolingual target language setting, while XLM-RoBERTa suffers from many of the same issues, the Word-Piece Tokenizer of mBERT allows some flexibility to alter the vocabulary even after pre-training, while the Byte-Pair Encoding (BPE) Tokenizer of XLM-RoBERTa is more rigid and does not allow vocabulary deletions/additions as easily. This is why for this and the next section we only experiment with mBERT to alleviate this vocabulary issue. However, we do hope to transfer the methodologies to XLM-RoBERTa in future work.
Hypothetically, two stages can be discerned when building a monolingual student with the ideal low-resourced vocabulary. Firstly, mBERT can be purged of any additional sub-words that may not be needed for a particular target language. We will call this the
VocabReduce step. Two alternate methodologies can be used for this step. On the one hand, the distillation can work identically to the basic setup, and the vocabulary can be reduced post-distillation directly from the student by removing unnecessary tokens (as proposed by Abdaoui et al. [
37]. On the other hand, vocabulary can be reduced pre-distillation, i.e., directly from the teacher. By purging additional sub-tokens from the teacher, we ensure that the student does not initialize the vocabulary for the additional sub-words. In this step pre-distillation reduction has a significant advantage over post-distillation as the distillation can go significantly faster. This is because the sizes of both the student and the teacher are reduced significantly beforehand, thus reducing the number of parameters and by extension the computing time for each iteration.
For the second stage additional richer sub-words could be input to the target language to force the tokenization to not result in meaningless character-based sub-words
VocabAmp. The vocabulary setups for all the discussed methodologies are summarized in
Figure 3. It should be noted that the
VocabAmp step is more complex as in order to learn additional representations for non-existent sub-words one needs to rely exclusively on external data since the teacher does not possess representations for these missing sub-words. Moreover, given a mismatch between the logits of the teacher and the student, the standard distillation loss cannot be computed since it relies on the divergence between the teacher and student logits. Due to these additional challenges, we consider
VocabAmp beyond the scope of this work and focus on
VocabReduce.
We perform experiments with pre- and post-distillation
VocabReduce for all six languages with mBERT as the teacher. We initialize a list of sub-words for the target language that we would like to retain, by tokenizing the respective Wikipedia dump and selecting sub-words that exist in at least 0.05% of the sentences. We then proceed to reinitialize the transformer’s embedding layer and tokenizer so only the selected sub-words are retained. For pre-distillation we apply this technique directly to the teacher, while in post-distillation we apply it to the student after the distillation process.
Table 5 shows the result of the experiments for the two tasks for each of the six target languages, while post-distillation results in near-identical performance to the basic distillation setup due to the reduction only taking place afterwards. Pre-distillation comes with minor variance, sometimes better and sometimes worse compared to post-distillation; however, it is consistently faster to train due to the significant reduction in the model’s embedding layer sizes. Since the performance difference is barely noticeable, pre-distill VocabReduce should be the go-to methodology due to the additional advantages it comes with.
6. Analysis for Low-Resourced Settings
While the basic distillation setups seem to be quite robust in obtaining comparable performance to the multilingual counterparts and in some cases even comparable to the respective upper-bounds, we look into further adaptations that can be made to make distillation setups more suitable for the low-resourced setting. To this end, we perform an ablation study with two vital parts of the distillation pipeline:
Loss Components: we attempt to find the most and least impactful components of the three-fold loss function to better tune loss weights for low-resourced settings.
Softmax Temperature: while softening the distribution with a temperature of 2.0 is standard practice in most distillation settings, we dig deeper and see if hardening or further softening can have an impact in the low-resourced setting.
To study the impact of these two variables, we perform additional experiments for the two low-resourced languages: Slovene and Swahili. For the baseline setup, we use the distilled student from the previous section with pre-distillation vocabulary reduction using mBERT as the teacher.
For the first ablation study we thus experiment with the three-fold loss function. The results are presented in
Table 6 where the baseline scores (row 1) represent the setup from
Section 4 with losses weighted with alpha values of 5.0, 1.0, and 2.0, respectively. The second row gives a general indication of the performance when all losses are weighted equally, while the next three rows show the impact of the individual loss components by removing them from the setup. We notice a drop in performance.
Figure 4 also provides a visual intuition of the trends by visualizing this drop in performance in the even weights setting (row 2). The figure demonstrates that each loss component is vital to the setup, which is in line with the consistent drop in performance (rows 3–5) when removing any of the losses. It is also possible to infer from the figure that
is the most pivotal component of the loss function. This is quite an intuitive finding since the student models often perform better than mBERT, their teacher. For these students to learn information missing from the teacher, they would have to rely on knowledge that is not present in the teacher but comes from external sources. In that respect
is the only component able to provide such an external signal. This especially holds in a low-resourced setting, where mBERT’s signals may not always be reliable.
For our second ablation study, we experiment with the softmax temperature (
). The results are presented in
Table 7, while a
of 2.0 was used in the baseline experiments in
Section 4, four additional experiments have been performed. For two the distribution was further softened with a
of 3.0 and 4.0, one uses the unchanged logits from the teacher (
) and for another the distribution was hardened (
), while at first sight, the other setups seem to be only marginally deficient, the baseline setup with a
of 2.0 is consistently better. This indicates that further softening or hardening the logits does not benefit the student specifically in a low-resourced setting.
Figure 5 elaborates on this finding, as it shows the drop in performance from the peak F1 score at 2.0. While there are some anomalies, it seems to be the case that the further we move from the optimal
of 2.0, the worse the performance becomes. It is also important to note that tasks such as POS-tagging for Slovene, seem to be quite robust to drops in performance with changes in
. However, this might be only because the dataset is comparatively easier and performance might already be quite saturated with extremely high scores of the order of 0.984.
7. Conclusions
In this work, we have further explored and improved upon the novel language distillation methodology first introduced in Singh and Lefever [
8], where it was tested for mBERT [
1]. In this research, we have extended the approach to the more robust and state-of-the-art XLM-RoBERTa [
31] and demonstrated its efficacy. Similarly to the language-distillation systems developed from mBERT, the
Eliquare students of XLM-RoBERTa are able to produce consistent student models for six languages. These languages were carefully selected to account for as much variation as possible with regard to their typologies, language families and available resources. We considered Dutch and French as representative of high-resourced languages, Hindi and Hebrew as moderately resourced languages, and Swahili and Slovene as low-resourced languages. The experimental results confirmed that language-distillation is viable, especially in low-resourced settings, and the resulting students were often able to outperform the teacher multilingual models while being up to four times smaller and six times faster for inference than their respective teachers.
The objective of this research was to further progress research in low-resourced languages, in particular by creating systems for these languages building on existing large multilingual models. This area of research was explored further by looking into the manipulation of the vocabulary of the resulting student models. Two different strategies were proposed to reduce a multilingual vocabulary into a monolingual one as part of the distillation process. We showed that pre-distillation VocabReduce is a consistently better strategy since it performs just as well and saves computing time over the alternative, post-distillation VocabReduce.
In addition, we also explored the impact of the different loss components on the student of the two low-resourced languages. We discovered that is the most impactful component of the triplet loss. However, all losses contribute to the performance and the ablation of any component results in a drop in performance.
Finally, we also investigated optimal softmax temperatures in the low-resourced setting and concluded that the default values of are optimal, further softening or hardening of the logits results in a drop in performance.
In future work, we would like to venture into more advanced distillation setups described in
Section 2, such as TinyBERT [
21] and MobileBERT [
38], with additional loss components such as Feature Map Transfer and Attention Transfer. We also aim to explore alternate teacher–student setups with multiple teachers, and the construction of bilingual students for two typologically related languages. A logical next step then will be to research strategies for
VocabAmp, while also modifying the
VocabReduce technique for application to XLM-RoBERTa.