4.3. Baseline Models
In recent years, most of the research on named entity recognition was based on deep learning. We tried to use some traditional machine learning methods for comparison, but they were not satisfactory, so for our next contrast model we used common neural networks or a cross-language pre-training model. As we are the first to use a cross-language pre-training model to perform NER in Uygur and Hungarian. We compared our LRLFiT model with several baseline models in different categories, including CNN-LSTM, BILSTM, BiGRU and XLM-R, each of which is described below.
CNN-LSTM: A
CNN [
35] and
LSTM [
36] combination model. The
CNN can be trained from vocabulary characteristics’ shape information, and optimizes the model’s parameters by sharing parameters.
LSTM can come to the conclusions according to the context.
BiLSTM: A long short-term memory network (LSTM) is proposed to solve the problem of long-term dependence of the cyclic neural network due to the extensive sentence processing and too much information. The forward
LSTM is combined with the backward
LSTM to form
BiLSTM [
37], and the output of
BiLSTM is the joint action of two cyclic neural networks in opposite directions, which can predict the probability that each word belongs to different labels.
BiGRU:
GRU (recurrent recurrent unit) is a variant of
LSTM (long short-term memory) and an improved model of the recurrent neural network (RNN). As a variant of
LSTM,
GRU is also very suitable for processing sequence data and can remember the information of previous nodes through the "gate mechanism." In the
BiGRU neural network, the context information is obtained from front to back and upward at the same time to improve the accuracy of feature extraction [
38].
BiGRU has the advantages of small dependence on word vectors, low complexity and a fast response time.
mBERT:
Multilingual BERT [
15] is a transformer model pretrained on a large corpus of multilingual data in a self-supervised fashion,
mBERT follows the same model architecture and training procedure as
BERT, except that it is pre-trained on concatenated Wikipedia data of 104 languages. For tokenization,
mBERT utilizes WordPiece embeddings [
39] with a 110,000-word shared vocabulary to facilitate embedding space alignment across different languages. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data), instead using an automatic process to generate inputs and labels from those texts.
XLM-R:
XLM-R [
8] shows how pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. It trains a transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data.
XLM-R significantly outperforms
multilingual BERT (mBERT) [
15] in a variety of cross-lingual benchmarks.
4.5. Results and Analysis
We combined two low-resource datasets to analyze the experimental results in terms of accuracy, recall rate and F1. Combined with the characteristics of the cross-language pre-training model and the experimental performance, we got the evidence that it is better than the previous model. Thus, we present the results of XLM-R on the named entity recognition task. Finally, we compare multilingual and monolingual models and present results on low-resource languages.
From
Table 7 and
Table 8, we can see that the accuracy, recall and F1 of our model on the Uyghur and Hungarian datasets were the best, and the F1 scores of the method in this paper reached 95.00% and 96% for the two languages. First, the F1 of
BiLSTM-CRF was higher than that of
CNN-CRF. It can be seen that the bi-directional structure of
BiLSTM has a stronger ability to acquire context sequence features than the one-way structure. The recognition effect of the
BiGRU model is better than those of the
CNN-LSTM model and the
BiLSTM model. The reason is that the average lengths of labeled words in these datasets were generally a little longer than normal, which was conducive to the
BiGRU network capturing long-distance information in the training process, extracting the entity features more accurately, obtaining effective information in the entity, and improving the recognition accuracy.
Among the cross-language pre-training models used in the NER field that performed best for Uyghur and Hungarian data, the latest is the Facebook team’s
XLM-R. We provide
Multilingual BERT training as a reference, and the results are not satisfactory. On the basis of
Multilingual BERT, a new task—the translation language model—has been added. Therefore, an experimental comparison with
Multilingual BERT was added to
Table 6. Finally, our LRLFiT model outperformed both
XLM-R and
Multilingual BERT in every setting. All of those issues point to the importance of carefully collecting corpora to generate pre-trained language models for each language, especially for language models with fewer resources. These models are often under-represented in large multilingual models.
In order to solve the problem of low-resource data with fewer annotations, a solution based on a cross-language model was proposed to establish a reliable model and make accurate predictions according to the existing knowledge. The model works as a neural network with self-attention mechanism characteristics.
As can be seen in
Figure 3, through the comparative analysis of PER, ORG and LOC, the F1 score of
LRLFiT reached 0.7535 for Uyghur language. Then it identified the entity types PER, ORG and LOC, and the F1 scores reached 0.8029, 0.9458 and 0.8847, respectively.
As can be seen in
Figure 4, the F1 score of
LRLFiT, reached 0.9288 in Hungarian. It identified the PER, ORG and LOC entity types, with the F1 scores being 0.8029, 0.9458 and 0.8847, respectively.
Taken together, these results suggest that unlike a deep neural network model, while using the selected XLM-R training models of the application for the characteristic of migration, first and with almost unlimited text, it studies the context of every input sentence using implicitly learned universal grammar; second, it can be taught from open fields of knowledge transferred downstream of the named entity recognition task, to improve the low-resource tasks. Its language processing for low-resource languages is also very good. Additionally, the pre-training model with fine-tuning mechanism has good scalability, We can see that the pre-training model has achieved the best results in the downstream tasks, and the F1 value has improved a lot.
To thoroughly verify the effect of different
LRLFIT versions on entity recognition effect, we used two versions (base, large) for comparison, as shown in
Figure 5—which shows the scores of the different algorithms in Uyghur and Hungarian. In the Uyghur language, the results showed that the overall effect became better and better with the increases in the size and the number of parameters of the model pre-training corpus. The results showed that
XLM-Roberta-Large had the best effect. The P, R and F1 scores were 0.7431, 0.7897 and 0.7709, respectively. The results were 0.62% percent, 1.43% and 1.74% higher than the best results for for
Multilingual BERT.
In the Hungarian language, as shown in
Figure 6, the results showed that the overall effect became better and better with the increases in the size and number of parameters of the model pre-training corpus. The results showed that
XLM-Roberta-Large had the best effect, and the P, R and F1 scores were 0.9389, 0.9269 and 0.9382, respectively. The results were 1.01%, 1.08% and 1.37% higher than the best results for
Multilingual BERT.
4.6. Ablation Study
To evaluate the contributions of key factors in our method, we used the
LRLFiT model for a comparison of ablation experiments for training and testing on datasets of the Uyghur and Hungarian languages. The methods with and without data augmentation, and with and without attention mechanisms were trained. As shown in
Figure 7.
The effect of augmentation. We compare the F1 values between the constructed dataset and the dataset without data augmentation. The experimental results show that the data augmentation method greatly improves the performance of our task, and can effectively take advantage of the feature of vocabulary sharing in the cross-language pre-training model, enabling us to obtain more effective data.
The effect of self-attention. We can observe that our model improves the F1 value by adding a fine-tuning of attention. After adding self-attention to the named entity recognition task, the F1 value was significantly improved. The self-attention mechanism can capture contextual information from multiple different subspaces to better understand the sentence structure, so that entities can be correctly identified without the introduction of the self-attention mechanism.