SSuieBERT: Domain Adaptation Model for Chinese Space Science Text Mining and Information Extraction

Liu, Yunfei; Li, Shengyang; Deng, Yunziwei; Hao, Shiyi; Wang, Linjie

doi:10.3390/electronics13152949

Open AccessArticle

SSuieBERT: Domain Adaptation Model for Chinese Space Science Text Mining and Information Extraction

by

Yunfei Liu

^1,2,3,*

,

Shengyang Li

^1,2,3

,

Yunziwei Deng

^1,2,

Shiyi Hao

^1,2,3 and

Linjie Wang

^1,2,3

¹

Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences, Beijing 100094, China

²

Key Laboratory of Space Utilization, Chinese Academy of Sciences, Beijing 100094, China

³

University of Chinese Academy of Sciences, Beijing 100094, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 2949; https://doi.org/10.3390/electronics13152949 (registering DOI)

Submission received: 13 June 2024 / Revised: 13 July 2024 / Accepted: 23 July 2024 / Published: 26 July 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

With the continuous exploration of space science, a large number of domain-related materials and scientific literature are constantly generated, mostly in the form of text, which contains rich and unexplored domain knowledge. Natural language processing technology has made rapid development and pre-trained language models provide promising information extraction tools. However, due to the strong professionalism of space science, there are many domain concepts and technical terms. Moreover, Chinese texts have complex language structures and word combinations, which may yield suboptimal performance by general pre-trained models such as BERT. In this work, we investigate how to adapt BERT to Chinese space science and propose the space science-aware pre-trained language model, namely, SSuieBERT. We validate it through downstream tasks such as named entity recognition, relation extraction, and event extraction, which can perform better than general models. To the best of our knowledge, our proposed SSuieBERT is the first pre-trained language model in space science, which can promote information extraction and knowledge discovery from space science texts.

Keywords:

Chinese space science; pre-trained language model; domain adaptation; natural language processing

1. Introduction

China’s space exploration mission continues to advance, the entire mission period will continue to generate a massive amount of related data, including technical manuals, scientific literature, experimental reports, internet web pages, and other different data sources and types. These data contain rich, scattered, and unexplored domain knowledge. The results of space science experiments record the experimental process and results; Research institutions and experts record data on the research directions, collaborative relationships, and research achievements of various organizations and experts.

Recently, knowledge mining and analysis in space science are relatively scarce, with complex data types including numerical, text, image, video, speech, and other types of data. Moreover, as a disciplinary field, space science is highly specialized, and these factors pose significant challenges to the mining and analyzing domain data. In this work, we mainly explore knowledge mining of text data. Pre-trained Language Models (PLMs) have become a main research direction in natural language processing. PLMs have learned rich language representations through unsupervised pre-training on a large amount of text data, providing a powerful foundation for various downstream tasks [1,2,3,4,5]. Pre-trained models mainly learn the embedded representations of each word in sequential texts in an unsupervised manner [6]. With the enhancement of computing power and the accumulation of big data, various types of PLM have been proposed, thereby changing the research schema of natural language processing technology, such as BERT [7], BERT-wwm [8], ERNIE [9], XLNet [10] and RoBERTa [11] have demonstrated remarkable successes in many downstream tasks.

However, research on pre-trained language models is not without challenges. How to effectively fine-tune these models to adapt to specific tasks, ensure the accuracy and factual consistency of the models, and evaluate and compare the performance of different models are all current research hotspots. There are many limitations in directly applying pre-trained models from general fields to text mining and information extraction tasks related to space science. Firstly, most language models are pre-trained on general domain corpora, and transferring them directly to space science may result in performance loss, secondly, due to the significant differences between the domain of space science and other disciplines, including the inconsistent distribution of domain concepts and professional terms. In addition, there are also many differences in language structure between Chinese and English.

In this work, we develop a pre-trained language model on large-scale Chinese text corpora related to space science. Figure 1 shows the approach for developing SSuieBERT. We verify the performance of the model through different downstream tasks, our model achieves state-of-the-art results compared to other PLMs. Therefore, current work has bridged the gap in the availability of language models in the space science domain, boosting domain information extraction, knowledge discovery, and other downstream tasks, accelerating the development of the space science discipline.

2. Related Work

Embedding representation of unstructured data is a fundamental task in natural language processing, texts are usually represented in the form of structured vectors and are applied in many tasks, such as text classification, information extraction, question-answering systems, semantic parsing, etc. Therefore, vector representation of text that can express rich semantic information is very important. A major problem with traditional word embedding [12,13] is that once a word vector is pre-trained, it can only be embedded into the model according to the training results, which can result in the word vector being unable to adjust according to specific contexts dynamically.

Pre-trained Language Models (PLMs) have been proposed as a very important schema in natural language processing in recent years, such as ELMo [14], ULM-Fit [15], BERT [7], XLNet [10], BERT-wwm [8], and GPT [16]. These models are pre-trained with a large amount of unlabeled text data to capture rich language knowledge, and then fine-tuned for specific tasks, demonstrating strong performance and broad application potential. These language models obtain semantic information from a large amount of unlabeled data through unsupervised learning and achieve significant performance in many downstream tasks with pre-train and fine-tuning.

The currently released PLMs are mainly trained on general corpora. Although they can be fine-tuned and applied to downstream tasks, several studies have shown that pre-training on domain-related corpora can better solve domain-specific tasks. BioBERT [1] developed a biomedical language representation model using a large-scale biomedical corpus. Clinical BERT [17] aims to predict hospital readmission, with BERT pre-trained on clinical recording, and [18] applied BERT to clinical recording and discharge summaries. MatSciBERT [19] pre-trained on materials science literature for text information mining. SciBERT [3] uses a large library of multidisciplinary scientific publications to train science-domain-specific BERT models to improve the performance of downstream scientific NLP tasks.

However, the space science domain-specific pre-trained language model has not been developed, which hinders domain knowledge mining and discovery. To the best of our knowledge, we are the first to develop a domain-specific pre-trained language model in space science.

3. Methodology

3.1. Construction of Domain Pre-Training Corpus

In the development of a pre-trained language model, a large-scale training corpus is necessary. For example, BERT [7] was pre-trained on large-scale textual data, including BookCorpus [20], English Wikipedia, and news, which contains approximately 3.3 billion words. However, the corpus of most PLMs is unrelated to the domain of space science. Here, we compile Chinese space science corpus (CSSC) from various sources, with approximately 0.27 B tokens out of ~0.1 M documents, namely:

Web crawler: Through Google and Baidu search engine API, search for web pages related to Chinese space science. the crawler of more than 2500 URLs belonging to the Chinese space science domain.

Wikipedia: Collect articles related to the “中国空间站 (China Space Station)” through the API library (https://github.com/martin-majlis/ Wikipedia-API (accessed on 25 July 2024)) provided by Wikipedia, starting from 17/01/2021, with a maximum of 5 subcategories for the same category. When there are multiple links to the same article, they will be discarded to avoid collecting duplicate content.

Scientific publications: we first define the query keywords such as “Tianhe Core Module (天和核心舱)”, “containerless material experimental cabinet (无容器材料实验柜)” and “ultraviolet edge imaging spectrometer (紫外临边成像光谱仪)”, to name a few. Based on these keywords, download scientific literature written in Chinese related to space science from CNKI (https://www.cnki.net (accessed on 25 July 2024)), WanfangDATA (https://www.wanfangdata.com.cn (accessed on 25 July 2024)), and CQVIP (http://www.cqvip.com (accessed on 25 July 2024)). Occasionally, depending on the journal and publication date, some papers only have abstracts and not full text. Due to the inclusion of relevant content such as domain issues in the abstract, we have included it in our corpus.

Archived files: In the process of space science exploration and the construction and operation of the China space station, a large number of documents have been generated, including the planning, design, and development of the space station platform and space science experiment payloads.

For each space science resource, we apply a cleanup pipeline with custom actions to read data from different sources and formats, split files into sentences, detect language, remove noisy and malformed sentences, eliminate duplication, and finally output data within the boundaries of the original document. Table 1 shows detailed statistics for each part of the corpus.

3.2. Domain Chinese Word Segmentation

In the original formula of BERT [7], the WordPiece tokenizer [21] was used to split the text into WordPieces tokens, which results in some words with complete semantics being divided into several small pieces. The masking language model randomly selects tokens for masking, and when a part of a word is masked, it is easy to predict the masked part based on the unmasked part. BERT-wwm [8] has released an updated version called Whole Word Masking (WWM). The WWM mainly alleviates the disadvantage of the original BERT, if the WordPiece tokens to be masked belong to a complete word, other parts of the same word will also be masked (all Chinese characters that make up the same word will be masked), enabling the model to learn the semantic information of the word, which is effective for various Chinese NLP tasks.

However, BERT-wwm applies the general Chinese word segmentation (CWS) model of LTP [22] to identify the boundaries of words. In the domain of space science, there are many professional words, and the word segmentation effect of the general word segmentation model will be reduced, and it will not be able to accurately identify professional words. In this work, we combine the user dictionary of the domain. The construction of a professional segmentation model leads to more tailored segmentation, which has advantages over the existing general segmentation model in the professional domain. To some extent, it’s also an approach to inject domain knowledge into Chinese space science representation learning, including space missions, the space science sub-disciplines, space science experiments, space science experiment payloads, scientific experiment projects, and so on. An example of the domain Chinese word segmentation (DCWS) is shown in Table 2.

3.3. Informative Masking

PLMs such as BERT and its variants mainly model semantic information through the Masking Language Model (MLM), masking several words in the input sentence with [MASK] through random selection, all tokens have an equal masking rate to be masked, and then predicting which words are masked through other visible words. However, some randomly masked tokens can be inferred solely with local cues [23], which provides a small loss that is inefficient for training. Moreover, random masks with the same probability can easily miss important information. While named entities composed of certain tokens (e.g., “containerless material experimental cabinet (无容器材料实验柜)”) are more important and require special attention [9,24].

To address this issue, we note that a recent work informask [25] has been proposed, which optimizes the masking strategy. InforMask leverages Pointwise Mutual Information (PMI) [24] to select tokens with the most information for masking. Different from informask, in this work, we propose InforMask++, when calculating Informative Relevance, we calculate the correlation between masked words and unmasked words after domain word segmentation, which can improve domain adaptability. Meanwhile, because the mutual information is calculated by words after domain word segmentation, rather than tokens, the computing efficiency is improved. This process is illustrated in Algorithm 1.

Algorithm 1. Informask++ Algorithm

Text Set

T

Randomly sampled candidates’ size

s

Informative score of

i

-th masking candidate in text

t

F_{i}^{t}

for

t

∈

T

do
for

i

= 1, 2,…,

s

do
Calculate

i

-th masking candidate for

t

M_{i}^{t}

← Masked Words

U_{i}^{t}

← Unmasked Words

F_{i}^{t}

← 0
for

w_{1}

∈

M_{i}^{t}

do
for

w_{2}

∈

U_{i}^{t}

do

F_{i}^{t}

=

F_{i}^{t}

+ pmi (

w_{1}

,

w_{2}

)
pmi (

w_{1}

,

w_{2}

) =

l o g

\frac{p (w_{1}, w_{2})}{p (w_{1}) p (w_{2})}

end for
end for
end for
Select candidate based on maximum

F_{i}^{t}

end for

Specifically, the following modifications have been made to the original MLM.

We use the domain segmentation model to generate domain words, the method to create the domain words refers to the Section 3.2.
We propose to apply informask++ instead of random masking. we aim to automatically identify words with more important semantic information and increase their mask probability, which facilitates the model to focus on more informative words to obtain abundant semantic information.
To align with BERT, we set the overall mask rate to 15%. 80% of the tokens are replaced with [MASK] tokens, 10% of the tokens are replaced with random words, and the original words are kept in the remaining 10%.

Note that InforMask++ only affects the mask strategy during the pre-training phase, which selects informative tokens to mask. Table 2 provides an example of the InforMask++.

3.4. Further Pretraining in SSUIE Domain

The knowledge learned by BERT and its variants is too general and deviates from the proprietary domain that needs to be applied, often unable to achieve ideal results in specific domains. We have also encountered this problem in our actual tasks. To further improve the performance of downstream tasks and maximize the value of a large amount of unlabeled domain data, we attempt to develop PLM in space science scenarios.

In this work, we develop SSuieBERT based on the Chinese Space Science Corpus (CSSC) as described in the last subsection. Table 1 lists the text corpora used for SSuieBERT pre-training. Pre-training a language model from scratch requires significant computing power and a huge data set. Meanwhile, the universal pre-trained language models provide some fundamental capabilities. To address this issue, we initialize SSuieBERT using the pre-trained BERT-Base-Chinese model [7].

To develop SSuieBERT, we draw inspiration from RoBERTa [11], which has made some improvements based on BERT, including improvements in training data, training methods, and masked language models, achieving better performance. Specifically, SSuieBERT pre-training uses the following modifications:

Informative masking: As mentioned in Section 3.3, it aims to automatically identify more informative tokens (e.g., professional terms and phrases) and increase the ratio that they will be masked.
Eliminate NSP loss from training objectives: Pre-train BERT with two tasks: MLM (masking language model) and NSP (next sentence prediction). The NSP task is to determine whether two sentences are matched and semantically coherent. The authors of RoBERTa claim that the performance of downstream tasks will be improved without NSP loss.
Full-length sequences: The maximum length limit for BERT input is 512. The authors of RoBERTa verified through experiments that the model can achieve better results when trained using full-length sequences. Specifically, it will continuously extract sentences from a text to fill the input sequence, but if it reaches the end of the text, it will continue to extract sentences from the next text to fill the sequence, and the content in different texts will still be segmented according to the [SEP] separator.
Larger batch size: In RoBERTa’s comparative experiments with different batch sizes and learning rates, it was found that increasing the batch size is beneficial for reducing the Perplexity of training data and further improving the performance of the model.

4. Experimental Setups

4.1. SSUIE Language Model Pretraining

To compare with BERT, we use model training parameter settings similar to BERT and inherit the official BERT-Base-Chinese vocabulary and weights, not pre-train from scratch. The base model contains 12 Transformer layers, 12 self-attention heads, and 768 dimensions of hidden sizes. We develop SSuieBERT using CSSC as the pre-training corpus and set the longest sequence length that BERT can support, which is 512, to pre-train on 2 NVIDIA RTX 24GB GPUs for about 15 days with a batch size of 256 sequences. We use the AdamW optimizer, where β1 = 0.92, β2 = 0.96, ε = 1 × 10⁻⁶, weight decay = 1 × 10⁻², and linear decay schedule for learning rate with 10,000 warm-up steps and peak learning rate = 1 × 10⁻⁴.

4.2. Finetuning Tasks

Pre-trained language models provide a foundation for downstream tasks and can be applied to different applications by combining pre-training and fine-tuning. The input data is encoded and represented by the pre-trained language model, and different downstream task models are combined to output different result information.

Once the PLM is constructed, we can fine-tune it for a variety of downstream tasks. The PLM adds a task-specific output layer. We use three different tasks supported by SSUIE 1.0 [26] to validate the effectiveness of SSuieBERT.

The SSUIE 1.0 dataset was constructed for tasks related to information extraction in Chinese space science texts, including entity recognition, relation extraction, and event extraction. The dataset includes 19 types of entities, 36 types of relationships, and 20 types of events, with a total of 6926 sentences, including 58,771 entities, 30,338 triples, and 3039 events.

Named Entity Recognition (NER) is the process of labeling each word (character) in a text sequence, indicating whether the word (character) is part of a named entity. The common annotation methods for named entities are BIO and BIOES, and we use the BIO scheme here. The domain of space science includes numerous types of entities, including space missions, scientific experiment payloads, scientific experiment projects, and so on.

The Dataset SSUIE-NER for the NER task is labeled from the archival documents and web crawler related to Chinese space science, which contains 19 different entity types. We divide the dataset into a training set, validation set, and test set, which are 4155, 1386, and 1385 sentences, respectively.

Relation Extraction (RE) [27] aims to extract relations between entities from a segment of text. The output is a label that represents the directional relationship between two entities. The two entity spans can be expressed as

s_{1}

= (

m

,

n

) and

s_{2}

= (

p

,

q

), where

m

and

n

represent the start and end indexes of the first entity in the input statement, and similarly,

p

and

q

represent the start and end indexes of the second entity. Here,

m

≤

n

,

p

≤

q

, and (

n

<

p

or q <

m

). The output label belongs to

L

, where

L

is a set of relation types.

The dataset SSUIE-RE consists of 36 relationship labels. The train, validation, and test sets consist of 18,202, 6069, and 6067, respectively. The task is to predict relation types such as “compositional relationship”, “experimental _Of”, etc.

Event Extraction (EE) is an important task in information extraction, mainly by identifying event types and extracting argument role information of events. Event extraction can be divided into two types of tasks, including event classification and event argument role information extraction. Event classification (EC) mainly distinguishes event types by identifying event trigger words [28], which can be seen as a multi-classification task [7].

In this paper, we will only consider event classification. SSUIE-EE, the dataset used to classify texts related to Chinese space science events, has been taken from SSUIE 1.0. These events belong to space science, such as “launch”, “docking”, “experiment”, and so on. We divide the events into a 3:1:1 train-validation-test split.

We use a single NVIDIA Titan Rtx (24GB) GPU to fine-tune SSuieBERT for each task. Note that the fine-tuning process is computationally more efficient than pre-trained SSuieBERT. For fine-tuning, choose a batch size from 16, 32, or 64 with a learning rate of 4 × 10⁻⁵, 3 × 10⁻⁵, or 1 × 10⁻⁵. Fine-tuning SSuieBERT in event classification tasks takes less than an hour because the size of the training data is much smaller than the training data used in [7]. In NER and RE tasks, SSuieBERT requires over 20 epochs to achieve maximum performance.

4.3. Modeling

We use different PLMs to encode the input text sequence for three different tasks, in this work, we consider PLMs as BERT, RoBERTa, SciBERT, BERT-wwm, and SSuieBERT.

For the NER task, we use different sequence annotation models for NER based on the obtained text sequence encoding vectors. CRF is a classic Named Entity Recognition (NER) model scheme, which has inspired many subsequent model improvements. In this work, we will use Linear, CRF, and BiLSTM-CRF for entity annotation.

Linear: We implement a BERT token classifier using the Transformers library [29], which has a linear layer for entity annotation.

CRF: We replace Linear architecture with a CRF layer [30,31]. after the vector is passed into CRF, CRF decodes a label sequence.

BiLSTM-CRF: BiLSTM-CRF [32] model is a combination of the bidirectional LSTM model and CRF model. The input of the model is a word sequence, and the output is the label predicted by the model for each word, which is a label sequence.

We use two recent models as the baselines, W2NER [33] and LE-NER [34].

W2NER [33] proposed an architecture for word-word relationship classification, effectively modeling entity boundaries and adjacent relationships between entity words.

LE-NER [34], proposes a lexicon-enhanced Chinese named entity recognition method based on Skipword-Lattice and Word-aware attention to identify the complex long entities.

In the case of the relation extraction task, we use the initiation architecture proposed by [27] for modeling. Here, we use some special words to enclose the entity span in the sentence. We wrap the entities with [

E_{1}

], [\

E_{1}

] and [

E_{2}

], [\

E_{2}

],

E_{1},

and

E_{2}

represent the first and second entities, respectively. We connect the output embeddings of [

E_{1}

] and [

E_{2}

] and then activate them through the linear layer via softmax. Training linear layers and finetuning language models with standard cross-entropy loss functions.

We use three recent models as the baselines, PURE [35], PFN [36], and OntoRE [37].

PURE [35] use an end-to-end architecture to solve relation extraction problems, which consists of two independently trained components, the “entity model” and the “relation model”.

PFN [36] use GCN to process text to capture contextual relationships between entities and relations and proposes partition filters to handle long-distance dependencies, improving model accuracy.

OntoRE [37] use an ontology knowledge enhancement method to solve the semantic gap between the general domain and the specific domain.

In the Event Classification task. We use a simple classifier to make predictions. We implement a finetuned network semantic classification model for event detection tasks based on the single sentence classification task in BERT.

5. Results

We use different downstream tasks to validate the effectiveness of the SSuieBERT in learning professional information on Chinese space science.

5.1. Named Entity Recognition

The results of NER on the SSUIE 1.0 by SSuieBERT, SciBERT, BERT-wwm, RoBERTa, and BERT are shown in Table 3.

We observe that the PLM-CRF sequence annotation model is superior to the PLM-Linear model. This indicates that CRF can accurately model BIO tags.

Meanwhile, RoBERTa performs better than BERT because of its larger pre-trained corpus. All BERT-wwm models perform better than SciBERT, RoBERTa, and BERT architecture, showing the discrepancy of different languages as BERT-wwm pre-trained on the Chinese general corpus. SciBERT outperforms BERT indicating that PLMs in the general domain need to further domain adaptation to improve their performance, because SciBERT’s corpus contains scientific literature, it is closer to CSSC compared to BERT’s pre-training corpus.

We obtain an improvement of ~7.2 F1 score for SSuieBERT vs. BERT-wwm while using the PLM-CRF annotation model. SSuieBERT-BiLSTM-CRF performs better than BERT-wwm-BiLSTM-CRF by ~7.5 F1 score. Similar improvements can also be seen for other architectures.

We notice that SSuieBERT performs better than the current best results. This suggests that SSuieBERT is indeed able to provide better performance on complex problems unique to the Chinese space science domain, using the additional information learned from the CSSC.

5.2. Relation Extraction

Table 4 shows the results of the relation extraction task for each model. We also compare the results to the three recent baseline models, OntoRE, PURE and PFN. we observe that SSuieBERT consistently outperforms SciBERT, BERT-WWM, RoBERTa, BERT, and the baseline models with salient margins.

5.3. Event Classification

In the event classification task, we consider the PLM’s ability to classify the manuscript into one of 20 event categories. Table 5 shows a comparison of the accuracy of SSuieBERT, SciBERT, BERT-wwm, RoBERTa, and BERT implementations. It can be seen that SSuieBERT exceeds SciBERT by 2.26% accuracy on the test set. This proves its effectiveness in event extraction.

Altogether, we demonstrate that SSuieBERT performs better than SciBERT, BERT-wwm, RoBERTa, and BERT on all downstream tasks of Chinese space science text mining, such as NER, RE, and event classification. These results also indicate that scientific literature and other related corpora in space science (which SSuieBERT pre-trained on) are significantly different from SciBERT(pre-trained on computer science and biomedical), as well as BERT-wwm, RoBERTa, and BERT (on general corpora). Specifically, each scientific discipline exhibits significant differences in ontology information such as specific concepts and professional vocabulary. Therefore, developing PLM for specific domains can significantly improve the performance of text mining and information extraction tasks in specific domains.

6. Discussion

Based on the experimental analysis, we can see that SSuieBERT has a significant improvement over the traditional BERT in Chinese space science information extraction tasks, indicating its effectiveness and generalizability. In this section, we study task-specific fine-tuning strategies and the effect of the designed components on SSuieBERT. Moreover, we conduct an error analysis to understand the limitations of SSuieBERT and identify potential areas for improvement.

6.1. Fine-Tuning Strategies

6.1.1. Labeling Schemes

We investigate different labeling schemes in the NER task, with standard labeling schemes that distinguish words based on their positions in the entity. As we see in Table 6, the utility of a more complex labeling scheme is diminished with little difference, suggesting that the sequential nature of tags is less essential in NER modeling in the case of text represented by domain pre-trained language models.

6.1.2. Learning Rates

An appropriate learning rate can improve the training efficiency while ensuring model convergence. We use different learning rates to fine-tune SSuieBERT, and the learning curve for error rates on SSUIE 1.0-NER is shown in Figure 2. We find that a relatively low learning rate, such as 1 × 10⁻⁵, can make the model converge stably, while an aggressive learning rate leads to unstable model training and failure to converge.

6.2. Effectiveness of SSuieBERT

We conduct detailed ablation experiments. The results are shown in Table 7. Overall, removing any component in SSuieBERT results in reducing the performance of the model, indicating that DCWS and informative masking both contribute to the overall improvement. Specifically, DCWS optimizes word tokenization, and informask++ improves mask strategy, which are modifications to the MLM task. This is plausible because DCWS implicitly injects domain knowledge into representation learning, and the performance of NER tasks seems to benefit more from informask++ tasks. This indicates that combining these two tasks can compensate for each other and achieve better performance.

6.3. Investigation on MLM Task

As shown in Figure 3, InforMask++ achieves superior training efficiency. InforMask++ maintains performance advantages throughout the entire training process. It is important to note that random masking and PMI-Masking perform worse than informask, indicating that they are not optimal for accomplishing knowledge-intensive tasks. PMI-Masking is also significantly inferior to other masking strategies in the early stages of pre-training, indicating that training the model may take longer. As shown in Figure 4, We generate checkpoints of SSuieBERT corresponding to the training steps and fine-tune them on the NER task to observe the performance. The model trained with InforMask++ outperformed BERT in less than ~20% of the training steps during the entire pre-training process, validating the effectiveness of our masking strategy.

6.4. Error Analysis

In this section, we analyze the possible errors of the models based on several samples. Table 8 clearly shows the detection results of different models on the NER task. Neither BERT nor SciBERT can identify complex entity boundaries and entity types of Chinese space science, while SSuieBERT can identify domain entities well. At the same time, SSuieBERT has recognition errors when recognizing newly emerged entities such as space science experiment projects consisting of long tokens. Therefore, the performance of the model is relatively low when it comes to identifying such entities. We believe that it is necessary to involve richer context features, such as external domain knowledge, to better understand professional concepts to improve the performance of the model on downstream tasks.

7. Conclusions

In this paper, we develop a pre-trained language model of space science, namely, SSuieBERT, which is pre-trained on Chinese space science corpus (CSSC) collected from a variety of sources. We design a new MLM strategy that combines word informativeness and domain Chinese word segmentation. We validate the effectiveness of SSuieBERT on different downstream tasks, such as named entity recognition, relation extraction, and event extraction. Results demonstrate that SSuieBERT outperforms previous models on Chinese space science text mining tasks. we hope it will further accelerate text mining and information extraction in the space science domain.

Author Contributions

Conceptualization, Y.L. and S.L.; methodology, Y.L. and S.H.; investigation, L.W.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L. and Y.D.; visualization, S.H.; supervision, S.L. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Director’s Fund Project of Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences: Research on Pre-training of Multimodal Data Representation Learning in Space Science and Applications with Grant No. T303241.

Data Availability Statement

The dataset used in this study cannot be published for confidentiality reasons.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA, 6–8 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 8342–8360. [Google Scholar]
Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3613–3618. [Google Scholar]
Araci, D. FinBERT: Financial sentiment analysis with pre-trained language models. arXiv 2019, arXiv:1908.10063. [Google Scholar] [CrossRef]
Lee, J.-S.; Hsiang, J. Patent classification by fine-tuning BERT language model. World Pat. Inf. 2020, 61, 101965. [Google Scholar] [CrossRef]
Manning, C.; Schutze, H. Foundations of Statistical Natural Language Processing; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z.; Wang, S.; Hu, G. Pre-Training with Whole Word Masking for Chinese BERT. arXiv 2019, arXiv:1906.08101. [Google Scholar] [CrossRef]
Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Chen, X.; Zhang, H.; Tian, X.; Zhu, D.; Tian, H.; Wu, H. ERNIE: Enhanced Representation through Knowledge Integration. arXiv 2019, arXiv:1904.09223. [Google Scholar]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv 2019, arXiv:1906.08237. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing System, Lake Tahoe Nevada, CA, USA, 5–10 December 2013; pp. 3111–3119. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Matthew, E.; Peters; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, LA, USA, 1–6 June 2018; Volume 1 (Long Papers), pp. 2227–2237. [Google Scholar]
Howard, J.; Ruder, S. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; pp. 328–339. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Huang, K.; Altosaar, J.; Ranganath, R. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv 2019, arXiv:1904.05342. [Google Scholar]
Alsentzer, E.; Murphy, J.; Boag, W.; Weng, W.; Jindi, D.; Naumann, T.; McDermott, M. Publicly available clinical bert embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, MN, USA, 7 June 2019; pp. 72–78. [Google Scholar]
Gupta, T.; Zaki, M.; Krishnan, N.M.A.; Mausam. MatSciBERT: A materials domain language model for text mining and information extraction. npj Comput. Mater. 2022, 8, 102. [Google Scholar] [CrossRef]
Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; Fidler, S. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 19–27. [Google Scholar]
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Che, W.; Li, Z.; Liu, T. Ltp: A Chinese language technology platform. In Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, Beijing, China, 23–27 August 2010; Association for Computational Linguistics: Stroudsburg, PA, USA, 2010; pp. 13–16. [Google Scholar]
Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. Spanbert: Improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
Levine, Y.; Lenz, B.; Lieber, O.; Abend, O.; Leyton-Brown, K.; Tennenholtz, M.; Shoham, Y. Pmi-masking: Principled masking of correlated spans. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Sadeq, N.; Xu, C.; McAuley, J. InforMask: Unsupervised Informative Masking for Language Model Pretraining. arXiv 2022, arXiv:2210.11771. [Google Scholar] [CrossRef]
Liu, Y.; Li, S.; Wang, C.; Xiong, X.; Zheng, Y.; Wang, L.; Hao, S. SSUIE 1.0: A Dataset for Chinese Space Science and Utilization Information Extraction. In Proceedings of the Natural Language Processing and Chinese Computing, Proceedings of the 12th National CCF Conference, NLPCC 2023, Foshan, China, 12–15 October 2023; pp. 223–235. [Google Scholar] [CrossRef]
Baldini Soares, L.; FitzGerald, N.; Ling, J.; Kwiatkowski, T. Matching the blanks: Distributional similarity for relation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 2895–2905. [Google Scholar]
Ji, H.; Grishman, R. Refining event extraction through cross-document inference. In Proceedings of the ACL-08: HLT, Columbus, OH, USA, 15–20 June 2008; Association for Computational Linguistics: Stroudsburg, PA, USA, 2008; pp. 254–262. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; p. 15947. [Google Scholar]
Lafferty, J.D.; McCallum, A.; Pereira, F.C.N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, Sydney, Australia, 28 June–1 July 2001; Morgan Kaufmann: San Francisco, CA, USA, 2001; pp. 282–289. [Google Scholar]
pytorch-crf—Pytorch-crf 0.7.2 Documentation. Available online: https://pytorch-crf.readthedocs.io/en/stable/ (accessed on 25 July 2024).
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar] [CrossRef]
Li, J.; Fei, H.; Liu, J.; Wu, S.; Zhang, M.; Teng, C.; Ji, D.; Li, F. Unified named entity recognition as word-word relation classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, Canada, 22 February–1 March 2022; Volume 36, pp. 10965–10973. [Google Scholar]
Wang, C.; Xiong, X.; Wang, L.; Zheng, Y.; Liu, Y.; Li, S. A Lexicon Enhanced Chinese Long Named Entity Recognition Using Word-Aware Attention. In Proceedings of the 2023 6th International Conference on Machine Learning and Natural Language Processing, Sanya, China, 27–29 December 2023; pp. 234–242. [Google Scholar]
Zexuan, Z.; Chen, D. A frustratingly easy approach for entity and relation extraction. arXiv 2020, arXiv:2010.12812. [Google Scholar]
Yan, Z.; Zhang, C.; Fu, J.; Zhang, Q.; Wei, Z. A partition filter network for joint entity and relation extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Virtual, 7–11 November 2021. [Google Scholar]
Xiong, X.; Wang, C.; Liu, Y.; Li, S. Enhancing Ontology Knowledge for Domain-Specific Joint Entity and Relation Extraction. In Proceedings of the 22nd Chinese National Conference on Computational Linguistics, Harbin, China, 3–5 August 2023; pp. 713–725. [Google Scholar]

Figure 1. The approach for developing SSuieBERT. A Chinese space science corpus (CSSC) is created by querying, searching, collecting, and filtering relevant data. Pre-training LM on CSSC to develop SSuieBERT and evaluate the performance in downstream tasks.

Figure 2. Learning curve for error rates on SSUIE 1.0-NER.

Figure 3. Macro-F1 with different masking strategies on SSUIE-NER evaluated every 10 k steps.

Figure 4. The comparison results of SSuieBERT with fine-tuned BERT, SciBERT, and BERT-wwm on SSUIE-NER throughout the entire pre-training process. After 400 k training steps, it achieves performance comparable to BERT.

Table 1. List of individual sources in the training corpora.

Corpus Name	Documents	No. Tokens
Web crawler	46,746	123,570,673
Wikipedia	45,706	47,347,061
Scientific publications	11,342	55,683,291
Archived files	12,284	37,845,593
Total	116,078	264,446,618

Table 2. Examples of different Chinese word segmentation and masking strategies. [M] indicate the selected tokens to mask.

	Chinese	English
Original Sentence	基于先进的静电悬浮技术开发的无容器材料实验柜。	A containerless material experimental cabinet based on advanced electrostatic suspension technology.
+CWS	基于/先进/的/静电/悬浮/技术/开发/的/无容器/材料/实验柜/。	—
+DCWS	基于/先进/的/静电悬浮/技术/开发/的/无容器材料实验柜/。	—
+BERT Tokenizer	基/于/先/进/的/静/电/悬/浮/技/术/开/发/的/无/容/器/材/料/实/验/柜/。	A container ##less material experimental cabinet based on advanced electro ##static suspension technology.
Original Masking	基/于/先/进/的/[M]/电/悬/浮/技/[M]/开/发/的/[M]/[M]/器/材/料/实/验/柜/。	A [M] ##less material experimental cabinet [M] on advanced [M] ##static suspension [M].
+WWM	基于/先进/的/[M][M]/悬浮/[M][M]/开发/的/[M][M][M]/材料/实验柜/。	A [M] [M] material experimental cabinet based on advanced [M] [M] suspension [M].
+Informask++	基于/先进/的/[M][M][M][M]/技术/开发/的/[M][M][M][M] [M][M][M][M]/。	A [M] [M] [M] [M] [M] based on advanced [M] [M] [M] technology.

Table 3. Results on SSUIE 1.0-NER. The average scores of the three seeds are depicted in parenthesis.

Models	SSuieBERT	SciBERT	BERT-wwm	RoBERTa	BERT	LE-NER	W2NER
Linear	79.75 (79.62)	71.32 (71.30)	72.32 (72.28)	64.55 (64.50)	63.11 (63.07)	78.05 (78.02)	77.95 (77.20)
CRF	80.63 (80.53)	72.76 (72.42)	73.46 (73.37)	66.17 (66.15)	64.23 (64.21)
BiLSTM-CRF	81.34 (81.31)	71.80 (71.53)	73.82 (71.66)	67.56 (67.43)	64.78 (64.65)

Table 4. Relation extraction test results. The average scores of the three seeds are depicted in parenthesis.

SSuieBERT	SciBERT	BERT-wwm	RoBERTa	BERT	OntoRE	PURE	PFN
65.55 (65.52)	58.68 (58.67)	58.74 (58.73)	58.61 (58.58)	58.56 (58.45)	63.40 (63.21)	61.31 (61.28)	60.72 (60.62)

Table 5. Results on SSUIE 1.0-EC. The average scores of the three seeds are depicted in parenthesis.

SSuieBERT	SciBERT	BERT-wwm	RoBERTa	BERT
94.56 (94.32)	92.30 (92.18)	92.31 (92.21)	91.54 (91.52)	91.42 (91.21)

Table 6. Entity-level F1 comparison using different labeling schemes on SSuieBERT for SSUIE 1.0-NER.

Labeling Scheme	BIO	BIOES
Linear	79.75	79.73
CRF	80.63	80.60
BiLSTM-CRF	81.34	81.35

Table 7. Ablations of SSuieBERT on different fine-tuning tasks (DCWS: domain Chinese word segmentation, CWS: Chinese word segmentation), The average scores of the three seeds are depicted in parenthesis.

System	NER	RE	EC
SSuieBERT	81.34 (81.31)	65.55 (65.52)	94.56 (94.32)
DCWS→CWS	80.21 (80.18)	65.38 (65.36)	94.52 (94.52)
w/o DCWS	79.76 (79.75)	65.25 (65.21)	94.06 (94.05)
informask++→informask	80.66 (80.58)	65.48 (65.45)	94.50 (94.48)
w/o informask++	78.28 (78.26)	64.43 (64.42)	92.43 (92.41)

Table 8. Sample analysis and comparison on NER task. The bold tokens denote the entities extracted by the model, and the subscripts represent entity types.

Model	Result
Setence	天和核心舱是中国空间站的首发舱段，配备有高微重力实验柜和无容器材料实验柜。 The Tianhe Core module is the first segment of the China’s space station, equipped with a high microgravity experiment cabinet and a containerless material experiment cabinet.
Ground Truth	[_{Space_Mission} 天和核心舱]是[_{Space_Mission} 中国空间站]的首发舱段，配备有[_{Scientific_Experiment_Payload} 高微重力实验柜]和[_{Scientific_Experiment_Payload} 无容器材料实验柜]。 The [_{Space_Mission} Tianhe Core module] is the first segment of the [_{Space_Mission} China’s space station], equipped with a [_{Scientific_Experiment_Payload} high microgravity experiment cabinet] and a [_{Scientific_Experiment_Payload} containerless material experiment cabinet].
SSuieBERT	[_{Space_Mission} 天和核心舱]是[_{Space_Mission} 中国空间站]的首发舱段，配备有[_{Scientific_Experiment_Payload} 高微重力实验柜]和[_{Scientific_Experiment_Payload} 无容器材料实验柜]。 The [_{Space_Mission} Tianhe Core module] is the first segment of the [_{Space_Mission} China’s space station], equipped with a [_{Scientific_Experiment_Payload} high microgravity experiment cabinet] and a [_{Scientific_Experiment_Payload} containerless material experiment cabinet].
SciBERT	[_{Space_Mission} 天和核心舱]是[_{Space_Mission} 中国空间站]的首发舱段，配备有[_{Scientific_Experiment_Payload} 高微重力实验柜]和[_{Scientific_Experiment_Payload} 无容器] [_{Scientific_Domain} 材料] [_{Scientific_Experiment_Payload} 实验柜]。 The [_{Space_Mission} Tianhe Core module] is the first segment of the [_{Space_Mission} China’s space station], equipped with a [_{Scientific_Experiment_Payload} high microgravity experiment cabinet] and a [_{Scientific_Experiment_Payload} containerless] [_{Scientific_Domain} material] [_{Scientific_Experiment_Payload} experiment cabinet].
BERT	[_{Space_Mission} 天和核心舱]是[_Organization 中国] [_{Space_Mission} 空间站]的首发舱段，配备有[_{Scientific_Domain} 高微重力] [_{Scientific_Experiment_Payload} 实验柜]和[_{Scientific_Experiment_Payload} 无容器] [_{Scientific_Domain} 材料] [_{Scientific_Experiment_Payload} 实验柜]。 The [_{Space_Mission} Tianhe Core module] is the first segment of the [_Organization China]’s [_{Space_Mission} space station], equipped with a [_{Scientific_Domain} high microgravity] [_{Scientific_Experiment_Payload} experiment cabinet] and a [_{Scientific_Experiment_Payload} containerless] [_{Scientific_Domain} material] [_{Scientific_Experiment_Payload} experiment cabinet].
Setence	两年以来，无容器材料实验柜中已开展多项关键研究项目，目前正在进行的项目包括偏晶合金壳/核型结构及弥散型组织形成机理研究、空间站静电悬浮复相合金相选择与无容器制备研究等，这些研究成果未来将会在许多领域发挥重要作用。 In the past two years, a number of key research projects have been carried out in the containerless material experimental cabinet, and the current projects include the study of monotectic alloy shell/karyotype structure and the formation mechanism of dispersed tissue, the selection of electrostatic suspension complex alloys and the study of containerless preparation in the space station, etc. These research results will play an important role in many fields in the future.
Ground Truth	两年以来，[_{Scientific_Experiment_Payload} 无容器材料实验柜]中已开展多项关键研究项目，目前正在进行的项目包括[_{Scientific_Experiment_Project} 偏晶合金壳/核型结构及弥散型组织形成机理研究]、[_{Scientific_Experiment_Project} 空间站静电悬浮复相合金相选择与无容器制备研究]等，这些研究成果未来将会在许多领域发挥重要作用。 In the past two years, a number of key research projects have been carried out in the [_{Scientific_Experiment_Payload} containerless material experimental cabinet], and the current projects include [_{Scientific_Experiment_Project} the study of monotectic alloy shell/karyotype structure and the formation mechanism of dispersed tissue], [_{Scientific_Experiment_Project} the study of the selection of electrostatic suspension complex alloys and containerless preparation in the space station], etc. These research results will play an important role in many fields in the future.
SSuieBERT	两年以来，[_{Scientific_Experiment_Payload} 无容器材料实验柜]中已开展多项关键研究项目，目前正在进行的项目包括偏晶合金壳/核型结构及[_{Scientific_Experiment_Project} 弥散型组织形成机理研究]、[_{Space_Mission} 空间站] [_{Scientific_domian} 静电悬浮]复相合金相选择与[_{Scientific_Experiment_Project} 无容器制备研究]等，这些研究成果未来将会在许多领域发挥重要作用。 In the past two years, a number of key research projects have been carried out in the [_{Scientific_Experiment_Payload} containerless material experimental cabinet], and the current projects include the study of monotectic alloy shell/karyotype structure and [_{Scientific_Experiment_Project} the formation mechanism of dispersed tissue], the study of the selection of [_{Scientific_domian} electrostatic suspension] complex alloys and [_{Scientific_Experiment_Project} containerless preparation] in the [_{Space_Mission} space station], etc. These research results will play an important role in many fields in the future.
SciBERT	两年以来，[_{Scientific_Experiment_Payload} 无容器] [_{Scientific_Domain} 材料] [_{Scientific_Experiment_Payload} 实验柜]中已开展多项关键研究项目，目前正在进行的项目包括[_{Scientific_Domain} 偏晶合金壳]/核型结构及弥散型[_{Scientific_Domain} 组织形成机理]研究、[_{Space_Mission} 空间站]静电悬浮复相合金相选择与[_{Scientific_Experiment_Payload} 无容器]制备研究等，这些研究成果未来将会在许多领域发挥重要作用。 In the past two years, a number of key research projects have been carried out in the [_{Scientific_Experiment_Payload} containerless] [_{Scientific_Domain} material] [_{Scientific_Experiment_Payload} experimental cabinet], and the current projects include the study of [_{Scientific_Domain} monotectic alloy shell]/karyotype structure and [_{Scientific_Domain} the formation mechanism of dispersed tissue], the study of the selection of electrostatic suspension complex alloys and [_{Scientific_Experiment_Payload} containerless] preparation in the [_{Space_Mission} space station], etc. These research results will play an important role in many fields in the future.
BERT	两年以来，[_{Scientific_Experiment_Payload}无容器] [_{Scientific_Domain}材料] [_{Scientific_Experiment_Payload}实验柜]中已开展多项关键研究项目，目前正在进行的项目包括偏晶合金壳/核型结构及弥散型组织形成机理研究、[_{Space_Mission}空间站]静电悬浮复相合金相选择与[_{Scientific_Experiment_Payload}无容器]制备研究等，这些研究成果未来将会在许多领域发挥重要作用。 In the past two years, a number of key research projects have been carried out in the [_{Scientific_Experiment_Payload} containerless] [_{Scientific_Domain} material] [_{Scientific_Experiment_Payload} experimental cabinet], and the current projects include the study of monotectic alloy shell/karyotype structure and the formation mechanism of dispersed tissue, the study of the selection of electrostatic suspension complex alloys and [_{Scientific_Experiment_Payload} containerless] preparation in the [_{Space_Mission} space station], etc. These research results will play an important role in many fields in the future.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Li, S.; Deng, Y.; Hao, S.; Wang, L. SSuieBERT: Domain Adaptation Model for Chinese Space Science Text Mining and Information Extraction. Electronics 2024, 13, 2949. https://doi.org/10.3390/electronics13152949

AMA Style

Liu Y, Li S, Deng Y, Hao S, Wang L. SSuieBERT: Domain Adaptation Model for Chinese Space Science Text Mining and Information Extraction. Electronics. 2024; 13(15):2949. https://doi.org/10.3390/electronics13152949

Chicago/Turabian Style

Liu, Yunfei, Shengyang Li, Yunziwei Deng, Shiyi Hao, and Linjie Wang. 2024. "SSuieBERT: Domain Adaptation Model for Chinese Space Science Text Mining and Information Extraction" Electronics 13, no. 15: 2949. https://doi.org/10.3390/electronics13152949

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

SSuieBERT: Domain Adaptation Model for Chinese Space Science Text Mining and Information Extraction

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Construction of Domain Pre-Training Corpus

3.2. Domain Chinese Word Segmentation

3.3. Informative Masking

3.4. Further Pretraining in SSUIE Domain

4. Experimental Setups

4.1. SSUIE Language Model Pretraining

4.2. Finetuning Tasks

4.3. Modeling

5. Results

5.1. Named Entity Recognition

5.2. Relation Extraction

5.3. Event Classification

6. Discussion

6.1. Fine-Tuning Strategies

6.1.1. Labeling Schemes

6.1.2. Learning Rates

6.2. Effectiveness of SSuieBERT

6.3. Investigation on MLM Task

6.4. Error Analysis

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI