Research on Chinese Semantic Named Entity Recognition in Marine Engine Room Systems Based on BERT

Shen, Henglong; Cao, Hui; Sun, Guangxi; Chen, Dong

doi:10.3390/jmse11071266

Open AccessArticle

Research on Chinese Semantic Named Entity Recognition in Marine Engine Room Systems Based on BERT

Marine Engineering College, Dalian Maritime University, Dalian 116026, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2023, 11(7), 1266; https://doi.org/10.3390/jmse11071266

Submission received: 19 May 2023 / Revised: 16 June 2023 / Accepted: 16 June 2023 / Published: 21 June 2023

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

With the development of intelligentization in maritime vessels, the pursuit of an organized and scalable knowledge storage approach for marine engine room systems has become one of the current research hotspots. This study addressed the foundational named entity recognition (NER) task in constructing a knowledge graph for marine engine rooms. It proposed an entity recognition algorithm for Chinese semantics in marine engine rooms that integrates language models. Firstly, the bidirectional encoder representation from transformers (BERT) language model is used to extract text features and obtain word-level granularity vector matrices. Secondly, the trained word embeddings are fed into a bidirectional long short-term memory network (BiLSTM) to extract contextual information. It considers the surrounding words and their sequential relationships, enabling a better understanding of the context. Additionally, the conditional random field (CRF) model was used to extract the globally optimal sequence of named entities in the marine engine room semantic. The CRF model considered the dependencies between adjacent entities that ensured a coherent and consistent final result for entity recognition in marine engine room semantics. The experiment results demonstrate that the proposed algorithm achieves superior F1 scores for all three entity types. Compared with BERT, the overall precision, recall, and F1 score of the entity recognition are improved by 1.36%, 1.41%, and 1.38%, respectively. Future research will be carried out on named entity recognition of a small sample set to provide basic support for more efficient entity relationship extraction and construction of a marine engine room knowledge graph.

Keywords:

marine engine room; named entity recognition; BERT model; deep learning; attention mechanism

1. Introduction

The development of automated vessels, artificial intelligence algorithms, and big data technologies is continuously driving the advancement of intelligent ships [1,2]. In the process of intelligent ship development, the safety and reliability of marine intelligent engine room systems directly impact the safety of ship navigation. Analyzing, mining, and preserving information data resources within the marine engine room, organizing knowledge resources in an orderly manner, and establishing a unified data source for the field of marine engine room systems are the foundations of complex knowledge accumulation. Text, one of the largest and most common data sources on ships, contains important information, such as engine logbooks and oil record books. Therefore, exploring flexible and scalable methods for accumulating knowledge in marine engine room systems is significant.

A knowledge graph is a graphical form of knowledge organization with good readability, scalability, and interpretability [3]. It can provide a decision-making basis for artificial intelligence. Entities are the most important component of the knowledge graph, and NER is of great significance for constructing knowledge graphs. NER technology can extract key information from text, such as proprietary terms such as time, equipment, and systems [4]. At the same time, the recognition effect significantly impacts future work, such as relationship extraction and knowledge graph construction. NER technology can rely on general or specific domain knowledge as prior knowledge. Although general domain knowledge covers more entities and has better universality, precision is more emphasized in specific domain applications, such as entity recognition in military knowledge domains, medical knowledge domains, and network operations [5,6,7]. Currently, there is limited research on NER in the field of marine engine room systems, and a quantity of unstructured semantic information has been accumulated in the operation and maintenance of marine engine room equipment in actual ship operations. Therefore, entity recognition for marine engine rooms is not only conducive to the later establishment of marine engine room information knowledge graphs, mining a large amount of implicit knowledge, but can also provide auxiliary decision making for intelligent engine room operation and maintenance tasks.

NER methods can be classified into three categories based on their development history: rule-based, machine learning-based, and deep learning-based [8]. Rule-based methods match named entities directly from sentences using dictionaries or rule templates specified by domain experts. Chiticariu, L. et al. (2010) [9] proposed an advanced rule language for building and customizing NER annotators, demonstrating their effectiveness across different domains. However, rule-based methods rely on expert-defined templates, which can be challenging to create and may only cover some possible variations. Additionally, they suffer from high transfer costs and are limited to handling simple text data, making it difficult to handle complex organizational data. Machine learning-based NER methods involve manually selecting features and then classifying them. These methods include hidden Markov models (HMM) [10], maximum entropy models (MEM) [11], support vector machines (SVM) [12], and others. The double-layer HMM model has performed well in Chinese term recognition and extraction. However, such methods heavily rely on text features and often need more generalizability.

On the other hand, deep learning-based named entity recognition can automatically learn hidden features from text information, eliminating the need for complex feature extraction processes. As a result, deep learning approaches have gained widespread attention in practical applications due to their ability to uncover meaningful patterns and their potential for better generalization. Collobert, R. et al. (2011) [13] introduced a word-level convolutional neural network (CNN) model that utilizes the output of the convolutional layer for prediction with a CRF layer. The model achieved a favorable F1 score of 89.59% based on the English CoNLL2003 dataset. Huang, Z. et al. (2015) [14] incorporated manually designed spelling features into a BiLSTM-CRF model and achieved an F1 score of 88.83% on the CoNLL2003 dataset. Additionally, Wu, F. et al. (2019) [15] proposed a joint segmentation and CNN-BiLSTM-CRF model, enhancing the model’s ability to recognize boundaries in Chinese-named entities. The study also introduced a method for generating pseudo-labeled samples from existing annotated data, further improving entity recognition performance. Liu, W. et al. (2019) [16] proposed a WC-LSTM model that achieved an F1 score of 93.74% on the Microsoft Research (MSRA) public dataset. Adding word information to the beginning or end position of characters enhanced the semantic information. Similarly, using a fragment-based neural network structure, Wang, L. et al. (2018) [17] achieved automatic feature learning and obtained an F1 score of 90.44% on the MSRA dataset.

However, the abovementioned research methods cannot effectively capture the polysemy of words because they mainly focus on feature extraction of individual words, characters, or word-to-word relationships, neglecting contextual semantic information. Consequently, the extracted representations are static word vectors that do not include contextual information, leading to decreased entity recognition performance. In 2018, researchers at Google introduced a pre-training model called BERT, based on the existing attention mechanism [18]. BERT utilizes bidirectional encoding using transformers and has shown promising performance in named entity recognition tasks [19]. Currently, the research of the BERT algorithm model in the field of NER of marine engine rooms is rare because of the following reasons:

(1) A single BERT model is insufficient for comprehensively capturing the semantic information of texts in the domain of marine engine rooms.

(2) The softmax layer in the BERT model is unable to effectively handle the requirements of the output sequence for entity labels in the marine engine room domain. As a result, the predicted output sequence may lack coherence and fail to align with the actual entity labeling requirements in real-world scenarios.

In order to solve these two issues, this study focuses on the NER for marine engine room semantics. A modified model called BERT-BiLSTM-CRF is proposed, which combines BERT with the bidirectional long short-term memory network and the conditional random field model. By leveraging the model’s contextual semantic extraction and feature prediction capabilities, it tackles the problems of incomplete semantic information extraction and unrealistic output sequences, thereby improving the accuracy of named entity recognition. The experiment results on marine engine room texts demonstrate promising recognition performance, laying the foundation for establishing a marine engine room information knowledge graph.

The remaining parts of this paper are organized as follows: Section 2 introduces the framework of the BERT-BiLSTM-CRF named entity recognition model. Section 3 presents the semantic entity recognition design process in the marine engine room, including data annotation methods and the selection of evaluation metrics. Section 4 presents the experimental results of the marine engine room semantic entity recognition case study and validates the effectiveness of the model improvements. Section 5 concludes the paper and outlines future work plans and directions for semantic recognition in marine engine room systems.

2. The Method for Semantic Entity Recognition in Marine Engine Room Systems

The BERT-BiLSTM-CRF entity recognition model consists of three components: the BERT pre-trained word embedding model, the BiLSTM semantic extraction layer, and the CRF decoding layer. The marine engine room semantic entity extraction model based on the improved BERT-BiLSTM-CRF algorithm is illustrated in Figure 1.

2.1. Bert Model

BERT is a pre-trained language representation model in natural language processing. It can infer the relationships between words in a text and optimize the weights to extract feature information from the text. Compared with traditional pre-trained models, BERT is based on a self-attention mechanism that allows it to learn relationships between consecutive text segments and capture contextual information. The pre-training structure of the BERT model is illustrated in Figure 2.

As shown in Figure 2, the BERT model consists of several layers. The first layer is the input layer, representing the word embeddings (E₁, E₂, …, E_N). The second and third layers are the encoding structure layers, where “Trm” denotes the transformer encoding transformers that utilize the self-attention mechanism. The fourth layer represents the model’s output vectors (T₁, T₂, …, T_N). Here, N refers to the total number of input tokens or words.

Trm, a key component of the BERT model, has an encoding structure as shown in Figure 3. To obtain word representations, Trm primarily optimizes the weight coefficient matrix based on the degree of correlation between words within the same sentence. The formula for the output of the self-attention mechanism is given by Equation (1).

A t t e n t i o n (Q, K, V) = S o f t \max (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(1)

In the equation, Q represents the feature matrix of the sample, K represents the feature matrix of the text information, V represents the content matrix of the text information, and d_k represents the dimension of the matrix K.

The BERT model utilizes a multi-head attention mechanism based on self-attention. The number of heads corresponds to the number of self-attention mechanisms. Each attention mechanism focuses on different contextual information of the same word in this mechanism. The calculation formula for the output matrix, MultiHead, is given by Equation (2).

\begin{array}{l} h_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \\ M u l i t i H e a d (Q, K, V) = C o n c a t (h_{1}, h_{2}, \dots, h_{k}) W^{O} \end{array}

(2)

In the equation, h_i represents the output matrix of the i-th word. W_i^Q, W_i^K, and W_i^V are the weight matrices for Q, K, and V, respectively. Concat denotes the concatenation of each h_i matrix, which is then multiplied by the concatenation matrix.

Finally, the fourth layer of the BERT model outputs the word vector representation. This model can obtain character vectors rich in semantic information from the ship’s cabin training text, enabling comprehensive storage of the text’s semantic information.

2.2. BiLSTM Model

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) that effectively addresses the issues of gradient explosion and vanishing gradients during training by introducing memory cells and gate mechanisms [20]. The core components of LSTM are the forget gate (f), the input gate (i), the output gate (o), and the memory cell (c). The forget gate is responsible for discarding irrelevant information from previous inputs, the input gate is used to retain relevant current information, the memory cell combines useful information from previous and current inputs, and the output gate outputs the final modified information. The structural expression of LSTM at time step t is as follows:

\begin{array}{l} f_{t} = σ (W_{f} [h_{t - 1}, x_{t}] + b_{f}) \\ i_{t} = σ (W_{i} [h_{t - 1}, x_{t}] + b_{i}) \\ c_{t}^{'} = \tan (W_{c} [h_{t - 1}, x_{t}] + b_{c}) \\ c_{t} = f_{t} * c_{t - 1} + i_{t} * c_{t}^{'} \\ o_{t} = σ (W_{o} [h_{t - 1}, x_{t}] + b_{o}) \\ h_{t} = o_{t} * \tan (c_{t}) \end{array}

(3)

At time t, h_t represents the output of LSTM, x_t denotes the input information, w_f is the weight matrix of the forget gate, b_f represents the bias of the forget gate, w_i stands for the weight matrix of the input gate, b_i corresponds to the bias of the input gate, w_o represents the weight matrix of the output gate, b_o is the bias of the output gate, σ represents the sigmoid function, and “*” denotes the multiplication operation.

Due to the inability of a unidirectional LSTM model to simultaneously process contextual information, Graves, A. and others proposed the BiLSTM neural network [21]. BiLSTM employs forward and backward LSTMs for each word sequence and combines the output information from the same time step. BiLSTM allows for storing information in both the forward and backward directions. The specific structure of BiLSTM is shown in Figure 4, and the output formula is given by Equation (4).

h_{t} = [h_{t p}, h_{t b}]

(4)

In the equation, h_tp represents the storage of forward information, and h_tb represents the storage of backward information.

2.3. CRF Model

In the task of NER, BiLSTM performs well in capturing long-distance contextual information but cannot capture dependencies between adjacent labels. Conditional random field (CRF), on the other hand, can leverage the relationships between neighboring labels to obtain an optimal predicted sequence. It helps address the limitations of BiLSTM and enables the segmentation and labeling of sequential data [22]. The scores for all possible labels of each word output by the BiLSTM model are used as the input score matrix for CRF, and the specific score transmission is illustrated in Figure 5.

For any sequence, X = (x₁, x₂, …, x_n), assuming W is a scoring matrix outputted by the BiLSTM model, with a size of n × k, where n represents the number of words and k represents the number of labels. Each element W_ij in the matrix represents the score for the j-th label of the i-th word. For a predicted sequence Y = (y₁, y₂, …, y_n), the scoring function s(X, Y) is calculated using Formula (5):

s (X, Y) = \sum_{i = 0}^{n} A_{y_{i}, y_{i + 1}} + \sum_{i = 1}^{n} W_{i, y_{i}}

(5)

where A represents the transition score matrix, and A_ij represents the transition score from label i to label j. The size of A is k + 2. The probability p(Y|X) of the predicted sequence Y given the input sequence X is calculated using Formula (6).

p (Y | X) = \frac{e^{s (X, Y)}}{\sum_{\tilde{Y} \in Y_{X}} s (X, \tilde{Y})}

(6)

After taking the logarithm on both sides of the equation, the log-likelihood function of the predicted sequence is obtained.

\ln (p (Y | X)) = s (X, Y) - \ln (\sum_{\tilde{Y} \in Y_{X}} s (X, \tilde{Y}))

(7)

In the equation, Y represents the correct labeled sequence, Y_X represents all possible labeled sequences, and decoding refers to obtaining the output sequence Y* with the maximum score.

Y^{*} = \underset{\tilde{Y} \in Y_{X}}{\arg \max} (X, \tilde{Y})

(8)

3. Design of Engine Room Semantic Entity Recognition

This chapter describes the workflow for the engine room semantic entity recognition task. Based on this, the recognition process is divided into two main parts: data preprocessing and experimental analysis under different models. This article established a Chinese semantic dataset in the domain of marine engine room systems, which has advantages in terms of professionalism and relevance compared with other public datasets. Therefore, this dataset is precious for research on marine engine room semantic named entity recognition. The dataset was annotated using the BIO labeling scheme, and the evaluation of the models was based on the Precision (P), Recall (R), and F1-score (F1) metrics.

3.1. The NER Process

Based on the above, the workflow for named entity recognition tasks on marine engine room semantics can be summarized as shown in Figure 6. The specific steps of the entity recognition task are as follows:

Step 1: Construct the marine engine room semantic dataset.

Step 2: Use appropriate text annotation software to annotate the text sets according to the BIO labeling scheme. Divide the dataset into training, validation, and testing sets according to a suitable ratio.

Step 3: Determine the model’s different parameter values and conduct multiple training rounds until the training requirements are met. Save the best-trained model.

Step 4: Use the saved best model to test the testing set, obtain the predicted label sequence, and complete the named entity recognition task on the testing texts.

This workflow outlines the key training steps and evaluates a model for named entity recognition on ship cabin training texts. It emphasizes data preparation, annotation, model training, and evaluation, ensuring the model is trained and tested on appropriate datasets to achieve accurate entity recognition results.

3.2. Dataset Annotation and Evaluation Indicators

There are two commonly named entity labeling schemes: BIO and BIEO. However, the performance differences between these two labeling schemes in named entity recognition are relatively small. Therefore, this article chooses the BIO scheme to annotate the engine room training text. In the BIO scheme, “B” represents the beginning of an entity, followed by the specific entity type, such as “B-SYS”; “I” represents the middle part of an entity, followed by the specific entity type, such as “I-SYS”; and “O” indicates that the word is not part of an entity.

Popular types of text annotation software mainly include BRAT, Doccano, YEDDA, and Chinese Annotator. YEDDA is an open-source text annotation software that has many advantages and overcomes the inefficiency of traditional text annotation tools. It annotates entities through the command line and shortcut keys and can configure custom labels for entities. Therefore, in the study of engine room semantic entity recognition, the YEDDA annotation software was used to annotate the semantic entities in the dataset. As shown in Figure 7, entities were annotated, such as "海水系统” (seawater system), “主海水泵” (main seawater pump), and “起动” (start) in the sentences. The annotated text information was exported and stored in the “.anns” file format. The annotation results are shown in Table 1.

The entity types annotated in the marine engine room training text corpus include three categories: SYS (system), EQU (equipment), and ACT (action). The entire dataset was divided into training, validation, and testing sets in an 8:1:1 ratio. Table 2 displays the distribution of the three entity types in the training, validation, and testing sets. The specific examples of each entity type in the dataset are presented in Table 3.

This paper adopts the NER evaluation metrics proposed in the MUC-2 conference [23], which include Precision (P), Recall (R), and F1 score (F1) as the main evaluation metrics. A higher F1 score indicates that the experimental method is more effective in performance evaluation.

P = \frac{N u m_{T}}{N u m_{p}} \times 100 %

(9)

R = \frac{N u m_{T}}{N u m_{R}} \times 100 %

(10)

F 1 = \frac{2 \times P \times R}{P + R}

(11)

In the formula, Num_T represents the number of correctly predicted entities, and Num_P stands for the count of entities in the predicted results. In contrast, Num_R corresponds to the count of entities in the original annotations of the dataset.

4. Design of Engine Room Semantic Entity Recognition

This chapter’s workflow is designed for the engine room semantic entity recognition task. Based on this, the recognition process is divided into two main parts: data preprocessing and experiment.

4.1. Experimental Environment and Parameter Configuration

The experiment used the Pytorch learning framework to build the training model for the engine room semantic entity recognition task. Table 4 displays the detailed configuration of the specific training environment.

The hyperparameter values were determined for the proposed model’s optimal performance during the experimental process. This study conducted the following comparative analysis and research. In each analysis study, the model only changed one hyperparameter; the performances are shown in Figure 8.

According to the trend of curve changes in Figure 8, the value of the F1 score will fluctuate with the changes in hyperparameters. Therefore, the hyperparameter value is selected in the following experiment when the F1 score reaches its maximum value.

This article used a pre-trained BERT model to represent the input text with vectors in the experiment. The BERT model has 12 layers, 12 attention heads, and outputs vectors of size 768. It has a total of 110 million parameters. The specific parameter settings for the BERT-BiLSTM-CRF model are shown in Table 5.

4.2. Experimental Result

The changes in the P-value, R-value, and F1-value of the BERT-BiLSTM-CRF model with the number of training epochs are shown in Figure 9. Figure 9a shows that the model significantly improves performance metrics during the first 15 epochs of training. This indicates that this study’s named entity recognition model learns quickly and performs well on the ship’s cabin training text data. After around 35 epochs, the metrics stabilize, and at the 42nd epoch, the F1-value reaches its peak at 91.68%. After 50 epochs of training, the parameters stabilize and show minimal changes, indicating that the model exhibits good stability.

Comparing (a) and (b) in Figure 8, it can be seen that the fluctuation in the test set curve is larger than that in the training set curve, and the optimal F1 score is smaller.

The neural network model achieves the maximum F1 score during multiple training epochs; the precision, recall, and F1 score for recognizing the three types of entities in the cabin training text are shown in Table 6.

From Table 6, it can be observed that the BERT-BiLSTM-CRF training model can achieve the task of named entity recognition in marine engine room training text, with an overall F1 score of 91.68%. The overall classification performance is good. The F1 scores for EQU and ACT are higher than SYS among the three entity types. Analyzing the dataset reveals that EQU and ACT entities have fixed expressions and frequently occur, while SYS entities are relatively fewer in number and have more complex semantic expressions. For example, in the training process, the SYS entity “货油泵透平系统 (Cargo oil pump turbine system)” may be misidentified as “货油泵 (Cargo oil pump)” and “透平系统 (Turbine system)” due to interference from the EQU entity “cargo oil pump.” Therefore, compared with the other two entity types, the performance of SYS is slightly lower.

4.3. Validation of Model Improvement Effectiveness

To validate the model’s effectiveness proposed in this study, the BERT model was first compared with the improved model, focusing on extracting contextual semantic information, feature prediction, and the fusion of contextual semantic information and feature prediction.

(1) Validation of contextual semantic information extraction

An experiment was conducted to validate the effectiveness of using BiLSTM as the improvement method, and the results are shown in Table 7. The comparison of results is illustrated in Figure 10.

Compared with the BERT model, the BERT-BiLSTM model showed improvements in the accuracy of all three entity types, with an overall accuracy increase of 0.17%. Specifically, the recognition performance for SYS increased by 0.32%, EQU increased by 0.07%, and ACT increased by 0.12%. These results indicate that incorporating the BiLSTM module to optimize the BERT model further enhances the extraction of contextual features from the text, confirming the effectiveness of extracting contextual semantic information.

(2) Validation of feature prediction

An experiment was conducted to validate the effectiveness of using CRF as the improvement method, and the results are shown in Table 8. The comparison of results is illustrated in Figure 11.

Compared with the BERT model, the BERT-CRF model showed an overall accuracy improvement of 0.38%. Specifically, the recognition performance for SYS increased by 0.46%, EQU by 0.37%, and ACT by 0.32%. These results indicate that using CRF for predicting the entity recognition results in marine engine room training texts, as opposed to the softmax layer of the BERT model, enables capturing more comprehensive global information. This confirms the effectiveness of feature prediction.

On the other hand, when optimizing the model, the dataset’s complexity also affects the model’s performance. Reducing the complexity of the dataset may also improve the model’s accuracy [24,25].

(3) Ablation Experiments and Analysis

To investigate the contributions of each component in the proposed method, ablation experiments were conducted on the marine engine room semantic dataset. The comparative results are shown in Table 9.

Table 9 shows that the BERT-BiLSTM model and BERT-CRF model, based on the BERT algorithm, improved the F1 score of marine engine room named entity recognition by 0.17% and 0.38%, respectively. This indicates that both improvement approaches have enhanced the performance of named entity recognition, with the latter showing a better improvement. When the BERT-BiLSTM-CRF model, which combines both approaches, is compared with the basic BERT model, the F1 score is increased by 1.38%. Therefore, the combined model performs better in marine engine room named entity recognition than individual algorithmic models.

(4) Comparative Analysis of the Improved Models

The proposed model algorithm was compared with five mainstream named entity recognition algorithms: machine learning algorithms HMM and CRF, deep learning algorithms BiLSTM and BERT, and the algorithm that combines machine learning and deep learning, BiLSTM-CRF. The comparison results of various algorithms are shown in Figure 12. Figure 12, shows that in the entity recognition of marine engine room semantics, the three evaluation metrics of machine learning algorithms are generally lower than those of deep learning algorithms. The performance of the machine learning and deep learning fusion algorithm, BiLSTM-CRF, is slightly inferior to the BERT model. The experimental results are shown in Table 10. According to Table 10, it can be observed that in recognition of EQU and ACT entities, all types of algorithms perform better than SYS. This is because these two types of entities have clear semantic information, and the recognition process does not require excessive reliance on a large amount of contextual semantic information. Compared with other algorithms, the constructed BERT-BiLSTM-CRF in this paper performs best in recognizing all types of entities in the marine engine room semantic dataset, with F1 scores of over 90% in all categories.

5. Conclusions

This paper addresses the problem of automatic extraction of named entities from marine engine room training texts in the automated construction phase of the marine engine room knowledge graph. It proposes a named entity recognition method based on BERT-BiLSTM-CRF. By leveraging the BERT pre-training model, the method tackles the issue of polysemy in text feature representation. It combines the advantages of the BiLSTM deep learning method in capturing contextual information and the CRF machine learning method in extracting globally optimal labeling sequences, thus obtaining marine engine room training entities. Experimental results demonstrate that this method outperforms common baseline algorithms in named entity recognition. It achieves Precision (P), Recall (R), and F1-score (F1) of 92.70%, 90.69%, and 91.68%, respectively, for the recognition of the three entity categories. The method completes the task of named entity recognition in marine engine room training texts and provides essential technical support for constructing the marine engine room knowledge graph.

In the future, the next step of our work is to apply the model to the entire engine room domain and further refine the entity types. For example, the next work plan is to divide the “Equipment (EQU)” category into subcategories such as “Main Engine Equipment (MEQU)” and “Auxiliary Engine Equipment (AEQU).” Additionally, the next work will continue to improve the model by incorporating domain-specific dictionary features, enhancing word analysis techniques, and expanding the marine engine room corpus. These efforts will help us achieve more accurate and comprehensive entity recognition in the marine engine room domain. At the same time, the article mainly conducts research and analysis based on the pre-training model of large-scale sample data. The amount of text data in the engine room is still lightweight. Later, research can be carried out on named-entity recognition of small sample sets, providing basic support for more efficient completion of entity relationship extraction and construction of a marine engine room knowledge graph.

Author Contributions

Conceptualization, H.S. and H.C.; methodology, H.S. and G.S.; validation, H.S., H.C., G.S., and D.C.; formal analysis, H.C.; investigation, G.S. and D.C.; resources, G.S. and D.C.; data curation, H.S. and G.S.; writing—original draft preparation, H.S.; writing—review and editing, H.S. and H.C.; visualization, G.S. and D.C.; supervision, H.C.; project administration, H.C.; funding acquisition, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the project Development of Liquid Cargo and Electromechanical Simulation Operation System for LNG Ship, grant number CBG3N21-3-3; the National Key R&D Program of China, grant number 2022YFB4301400.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing or financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Nomenclature

A	Transition score matrix
ACC	Accuracy
ACT	Action
A_ij	The score of transitioning from label i to label j
BERT	Bidirectional encoder representation from transformers
b_f	Bias of the forget gate
b_i	Bias of input gate
BiLSTM	Bidirectional long short-term memory network
b_o	Bias of output gate
CNN	Convolutional neural network
CRF	Conditional random field
d_k	The dimension of matrix K
EQU	Equipment
F1	F1 score
h_i	Output matrix of the i-th word
HMM Hidden Markov Models
h_t	Output of LSTM
h_tb	Backward information
h_tp	Forward information
K	Feature matrix of the text information
k	The number of labels
MEM	Maximum entropy models
MSRA Microsoft Research
n	The number of words
NER	Named entity recognition
Num_P	The count of entities in the predicted results
Num_R	The count of entities in the original annotations of the dataset
Num_T	Number of correctly predicted entities
P	Precision
Q	Feature matrix of the sample
R	Recall
SVM	Support vector machines
SYS	System
Trm	Transformer encoding
V	Content matrix of the text information
W	Output score matrix
w_f	Weight matrix of forget gate
w_i	Weight matrix of input gate
W_ij	The score for the j-th label of the i-th word
W_i^K	Weight matrices for K
W_i^Q	Weight matrices for Q
W_i^V	Weight matrices for V
w_o	Weight matrix of output gate
X	A sequence
x_t	Input information
Y	Correct labeled sequence
Y_X	All possible labeled sequences
σ	Sigmoid function
*	Multiplication operation

References

Duhaney, J.A. Mining and Fusing Data for Ocean Turbine Condition Monitoring. Ph.D. Thesis, Florida Atlantic University, Boca Raton, FL, USA, 2012. [Google Scholar]
Gao, M.; Shi, G.; Li, S. Online Prediction of Ship Behavior with Automatic Identification System Sensor Data Using Bidirectional Long Short-Term Memory Recurrent Neural Network. Sensors 2018, 18, 4211. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Pan, J.Z.; Vetere, G.; Gomez-Perez, J.M.; Wu, H. Exploiting Linked Data and Knowledge Graphs in Large Organizations, 1st ed.; Springer International Publishing: Cham, Switzerland, 2017; pp. 1–14. [Google Scholar] [CrossRef]
Maggini, M.; Marra, G.; Melacci, S.; Zugarini, A. Discovery and Disambiguation of Entity and Relation Instances. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 4475–4486. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Baigang, M.; Yi, F. A review: Development of named entity recognition (NER) technology for aeronautical information intelligence. Artif. Intell. Rev. 2023, 56, 1515–1542. [Google Scholar] [CrossRef]
Ning, L.; Qian, H.; Hua, Y.X.; Xing, X.; Meng, X.C. Med-BERT: A Pretraining Framework for Medical Records Named Entity Recognition. IEEE Trans. Ind. Inform. 2022, 18, 5600–5608. [Google Scholar] [CrossRef]
Fei, L.; Liang, L.M.; De, J.Y. Research on Construction Method of Knowledge Graph of US Military Equipment Based on BiLSTM model. In Proceedings of the 2019 International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS), Shenzhen, China, 9 May 2019. [Google Scholar] [CrossRef]
Shaalan, K. A Survey of Arabic Named Entity Recognition and Classification. Comput. Linguist. 2014, 40, 469–510. [Google Scholar] [CrossRef]
Chiticariu, L.; Krishnamurthy, R.; Li, Y.; Reiss, F.; Vaithyanathan, S. Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, USA, 11 September 2010. [Google Scholar]
Eddy, S.R. Hidden Markov models. Curr. Opin. Struct. Biol. 1996, 6, 361–365. [Google Scholar] [CrossRef] [PubMed]
Kapur, J.N. Maximum-Entropy Models in Science and Engineering, 1st ed.; Wiley Eastern: Maitland, FL, USA, 1989; pp. 5–16. [Google Scholar]
Cristianini, N.; Shawe, T.J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods: Preface, 1st ed.; Cambridge University Press: Cambridge, UK, 2000; pp. 93–122. [Google Scholar]
Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 2011, 12, 2493–2537. [Google Scholar]
Huang, Z.; Wei, X.; Kai, Y. Bidirectional LSTM-CRF Models for Sequence Tagging. Comput. Sci. 2015. [Google Scholar] [CrossRef]
Wu, F.; Liu, J.; Wu, C.; Huang, Y.; Xie, X. Neural Chinese named entity recognition via CNN-LSTM-CRF and joint training with word segmentation. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13 March 2019. [Google Scholar] [CrossRef] [Green Version]
Liu, W.; Xu, T.; Xu, Q.; Song, J.; Zu, Y. An Encoding Strategy Based Word-Character LSTM for Chinese NER. In Proceedings of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2 June 2019. [Google Scholar] [CrossRef] [Green Version]
Lei, W.; Yun, W.; Shen, J.Z.; Hui, Y.G.; Guang, W.Q. Segment-level Chinese Named Entity Recognition Based on Neural Network. J. Chin. Inf. Process. 2018, 32, 84–90+100. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4 December 2017. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 11 October 2018. [Google Scholar] [CrossRef]
Hochreiter, S.; Jürgen, A.S. LSTM can solve hard long time lag problems. In Proceedings of the 9th International Conference on Neural Information Processing Systems, Cambridge, MA, USA, 3 December 1996. [Google Scholar]
Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM networks. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef] [PubMed]
Lafferty, J.D.; McCallum, A.; Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the 18th International Conference on Machine Learning, San Francisco, CA, USA, 28 June 2001. [Google Scholar]
Grishman, R.; Sundheim, B. Message understanding conference-6: A brief history. In Proceedings of the 16th Conference on Computational Linguistics, Stroudsburg, PA, USA, 5 August 1996. [Google Scholar] [CrossRef] [Green Version]
Bolon-Canedo, V.; Remeseiro, B. Feature selection in image analysis: A survey. Artif. Intell. Rev. 2020, 53, 2905–2931. [Google Scholar] [CrossRef]
Kabir, H.; Garg, N. Machine learning enabled orthogonal camera goniometry for accurate and robust contact angle measurements. Sci. Rep. 2023, 13, 1497. [Google Scholar] [CrossRef] [PubMed]

Figure 1. BERT-BiLSTM-CRF model.

Figure 2. BERT model.

Figure 3. The structure diagram of multi-head attention.

Figure 4. BiLSTM model.

Figure 5. The score transmission from BiLSTM to the CRF diagram.

Figure 6. Process for named entity recognition of marine engine room.

Figure 7. YEDDA entity annotation.

Figure 8. F1 scores with different hyperparameter values: (a) performance with different learning rates; (b) performance with different hidden sizes; (c) performance with different batch sizes; and (d) performance with different epochs.

Figure 9. P, R, and F1 variation curve: (a) training set curve; (b) test set curve.

Figure 10. Comparison of the BERT and BERT-BiLSTM Named Entity Recognition Model Results.

Figure 11. Comparison of the BERT and BERT-CRF Named Entity Recognition Model Results.

Figure 12. Figure 12. Distribution chart of performance comparison of different models in marine engine room training text.

Table 1. Sample of engine room semantic annotation results.

Word	海	水	系	统	由	3	台	主
English remark	Sea water system							main
BIO label	B-SYS	I-SYS	I-SYS	I-SYS	O	O	O	B-EQU
Word	海	水	泵	,	2	台	中	央
English remark	sea water pump						central
BIO label	I-EQU	I-EQU	I-EQU	O	O	O	B-EQU	I-EQU
Word	冷	却	器	及	……	组	成	.
English remark	cooler
BIO label	I-EQU	I-EQU	I-EQU	O	……	O	O	O

Table 2. Dataset distribution.

	Training Set	Validation Set	Testing Set	Overall
SYS	237	30	30	297
EQU	1264	158	158	1580
ACT	454	57	57	568
Overall	1955	245	245	2445

Table 3. Examples of various entities.

Entity Type	Examples
SYS	主机遥控系统 (Main engine remote control system), 低温淡水系统 (Low temperature fresh water system), 燃油供给系统 (Fuel supply system) et al.
EQU	轮机模拟器 (Marine engine room simulator), 低温淡水泵 (Low temperature fresh water pump), 应急发电机 (Emergency generator) et al.
ACT	并车 (Synchronizing), 解列 (Disconnection), 启动 (Start) et al.

Table 4. Training environment configuration.

Operating System	Windows
CPU	Intel(R) Core(TM) i7-9700 CPU @ 3.00 GHz
GPU	NVIDIA GeForce GTX 1660 Ti
Python	3.7.15
Pytorch	1.6.0

Table 5. Model parameter settings.

Parameter	Value
Word vector dimension	100
Sentence length	30
Batch size	30
Epochs size	50
LSTM layers	2
LSTM hidden layers	128
Optimizer	Adam
Learning rate	0.001

Table 6. Recognition Results of Different Semantic Named Entities in a Marine Engine Room System.

Entity Type	P (%)	R (%)	F1 (%)
SYS	90.12	87.98	89.03
EQU	93.61	90.77	92.17
ACT	94.35	93.33	93.84
Overall	92.69	90.69	91.68

Table 7. Experimental Results of the BERT-BiLSTM Named Entity Recognition Model.

Entity Type	SYS (%)	EQU (%)	ACT (%)	Overall (%)
BERT	87.01	91.40	92.49	90.30
BERT-BiLSTM	87.33	91.47	92.61	90.47

Table 8. Experimental Results of the BERT-CRF Named Entity Recognition Model.

Entity Type	SYS (%)	EQU (%)	ACT (%)	Overall (%)
BERT	87.01	91.40	92.49	90.30
BERT-CRF	87.47	91.77	92.81	90.68

Table 9. Experimental results of named entity recognition for each model.

BERT	BiLSTM	CRF	F1 (%)
√			90.30
√	√		90.47
√		√	90.68
√	√	√	91.68

Table 10. Comparison of Model Performance for Different Entity Categories of Marine engine room Training Text.

Model	Evaluating Indicator	SYS (%)	EQU (%)	ACT (%)	Overall (%)
HMM	P	80.89	82.15	82.08	81.71
	R	75.69	80.26	79.91	78.62
	F1	78.2	81.19	80.98	80.12
BiLSTM	P	83.79	87.01	87.01	85.94
	R	83.17	87.15	86.68	85.67
	F1	83.48	87.08	86.84	85.80
CRF	P	83.55	87.25	86.98	85.93
	R	82.01	87.65	86.76	85.47
	F1	82.77	87.45	86.87	85.70
BERT	P	88.32	92.73	92.98	91.34
	R	85.73	90.11	92.01	89.28
	F1	87.01	91.40	92.49	90.30
BiLSTM-CRF	P	85.67	91.23	92.77	89.89
	R	85.78	90.11	91.63	89.17
	F1	85.72	90.67	92.20	89.53
BERT-BiLSTM-CRF	P	90.12	93.61	94.35	92.70
	R	87.98	90.77	93.33	90.69
	F1	89.03	92.17	93.84	91.68

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shen, H.; Cao, H.; Sun, G.; Chen, D. Research on Chinese Semantic Named Entity Recognition in Marine Engine Room Systems Based on BERT. J. Mar. Sci. Eng. 2023, 11, 1266. https://doi.org/10.3390/jmse11071266

AMA Style

Shen H, Cao H, Sun G, Chen D. Research on Chinese Semantic Named Entity Recognition in Marine Engine Room Systems Based on BERT. Journal of Marine Science and Engineering. 2023; 11(7):1266. https://doi.org/10.3390/jmse11071266

Chicago/Turabian Style

Shen, Henglong, Hui Cao, Guangxi Sun, and Dong Chen. 2023. "Research on Chinese Semantic Named Entity Recognition in Marine Engine Room Systems Based on BERT" Journal of Marine Science and Engineering 11, no. 7: 1266. https://doi.org/10.3390/jmse11071266

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Chinese Semantic Named Entity Recognition in Marine Engine Room Systems Based on BERT

Abstract

1. Introduction

2. The Method for Semantic Entity Recognition in Marine Engine Room Systems

2.1. Bert Model

2.2. BiLSTM Model

2.3. CRF Model

3. Design of Engine Room Semantic Entity Recognition

3.1. The NER Process

3.2. Dataset Annotation and Evaluation Indicators

4. Design of Engine Room Semantic Entity Recognition

4.1. Experimental Environment and Parameter Configuration

4.2. Experimental Result

4.3. Validation of Model Improvement Effectiveness

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI