Entity Recognition for Chinese Hazardous Chemical Accident Data Based on Rules and a Pre-Trained Model

Dai, Hui; Zhu, Mu; Yuan, Guan; Niu, Yaowei; Shi, Hongxing; Chen, Boxuan

doi:10.3390/app13010375

Open AccessArticle

Entity Recognition for Chinese Hazardous Chemical Accident Data Based on Rules and a Pre-Trained Model

by

Hui Dai

¹,

Mu Zhu

^2,*,

Guan Yuan

^1,3,*

,

Yaowei Niu

¹,

Hongxing Shi

² and

Boxuan Chen

¹

School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China

²

State Key Laboratory of NBC Protection for Civilian, Beijing 100038, China

³

Digitization of Mine, Engineering Research Center of Ministry of Education, Xuzhou 221116, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2023, 13(1), 375; https://doi.org/10.3390/app13010375

Submission received: 28 September 2022 / Revised: 15 December 2022 / Accepted: 21 December 2022 / Published: 28 December 2022

(This article belongs to the Special Issue Ontology Engineering and Knowledge Graphs Design in Decision Support Systems: Novel Advances and Use-Cases)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Due to the fragile physicochemical properties of hazardous chemicals, the chances of leakage and explosion during production, transportation, and storage are quite high. In recent years, hazardous chemical accidents have occurred frequently, posing a great threat to people’s lives and property. Hence, it is crucial to analyze hazardous chemical accidents and establish corresponding warning mechanisms and safeguard measures. At present, most hazardous-chemical-accident data exist in text format. However, named entity recognition (NER), as a method to extract useful information from text data, has not been fully utilized in the field of Chinese hazardous-chemical handling. The challenge is that Chinese NER is more difficult than English NER, because the boundaries of Chinese are fuzzy. In addition, the descriptions of hazardous chemical accidents are colloquial and lacks relevant labeling data. Further, most current models do not consider identifying the entities related to accident scenarios, losses, and causes. To tackle these issues, we propose a model based on a rule template and Bert-BiLSTM-CRF (RT-BBC) to recognize named entities from unstructured Chinese hazardous chemical accident reports. Comprehensive experiments on real-world datasets show the effectiveness of the proposed method. Specifically, RT-BBC outperformed the most competitive method by 6.6% and 3.6% in terms of accuracy and F1.

Keywords:

named entity recognition; rule templates; pre-trained model; hazardous chemicals

1. Introduction

Currently, chemical products are necessary for our daily lives; however, hazardous chemical materials are very dangerous because they are flammable, explosive, toxic, and even radioactive. Therefore, accurate detection and analysis of information concerning hazardous chemical accidents are crucial for social security and economic growth [1,2,3]. The majority of current studies related to accidents with hazardous chemicals concentrate on the data level; examine the occasions, locations, and types of the accidents; and provide pertinent recommendations [4,5,6]. They do not address the identification and analysis of accident causes, losses, scenarios, or other entities; nor have they processed or analyzed accident text data [7]. Additionally, various real scenarios cannot be adequately covered by the current emergency response plans for hazardous chemicals. Thus, the creation of a hazardous-chemical-accident knowledge map could enhance knowledge representation, aid in our understanding of their physical and chemical properties, and offer technical assistance for emergency decision making that pertains to hazardous chemicals.

Named entity recognition (NER) [8] is a key step for constructing a domain knowledge graph. Its goal is to recognize specified entities, such as time, place, organization, and persons, from unstructured texts. Compared with English NER, Chinese named entity recognition is more challenging. The principal challenges are listed as follows, taking Figure 1 as an example:

It is challenging to identify entity boundaries in Chinese writing, since there are no spaces;
Word segmentation is often required before Chinese named entity recognition, and various word segmentation algorithms yield substantially different results;
Even the same entity in Chinese can have completely different meanings.

In fact, hazardous chemical accidents are mainly reported in the form of radio news reports or text, with obvious colloquialisms, which makes entity recognition more challenging. At present, a series of methods have been proposed for Chinese named entity recognition, which focus on the extraction of domain features [9,10,11]. However, due to the lack of professional knowledge, current methods cannot accurately identify hazardous chemical entities. In addition, the existing methods also ignore the identification of accident scenarios, losses, and causal entities.

To solve the above problems, we propose a Chinese hazardous-chemical-accident information entity recognition approach based on a rule template and pre-trained model. Rule templates use pattern matching and string matching to recognize the entities with structural features that are few in number (such as date and time). The entities with poor structural features but present in large numbers (such as personnel and causes) are instead identified using the Bi-LSTM-CRF model. In order to increase the recognition accuracy, we also process the source data directly using pre-trained word vector technology.

The main contributions are as follows:

We design a collection tool to crawl the information of hazardous chemical accidents, and through pre-processing processes, we offer data support for accident analysis.
We built rule templates and employ dictionaries, libraries, and other techniques to recognize items with structured attributes but tiny quantities, such as time. To better learn the vector representation of text information and improve recognition performance, the pre-trained model is adopted for entities with weak characteristics but huge quantities, such as hazardous chemical scenarios.
Experimental results show that the proposed method (RT-BBC) can effectively improve the performance of entity recognition using chemical accident information.

The remainder of this paper is organized as follows. In Section 2, we summarize the related works in named entity recognition. Section 3 describes in detail the rule templates and pre-trained model we propose. Section 4 describes the datasets and discusses the empirical results. In Section 5, we discuss the contributions and limitations of our model. Additionally, the last section has the conclusions and future work.

2. Related Work

Named entity recognition (NER) [12], also known as entity extraction, entity segmentation, and entity identification, is one of the subtasks of information extraction. The goal of NER is to recognize and categorize particular entities in unstructured text data, such as time, place, organization, and persons [13]. Named entity recognition has seen more than 20 years of research and advancement, evolving from early rule-based methods [14,15] to probability-based methods [16,17], and then progressively evolving into deep-learning-based methods. Transfer-learning-based methods are currently very popular [18].

2.1. Rule-Based Methods

The majority of rule-based methods require linguists to choose language traits and manually create rule templates. Pattern and string matching can be used to identify entities. These techniques rely on knowledge bases and domain dictionaries, which are quick and easy to use. According to the literature [19], because complicated rules are written using domain expert knowledge, only a small amount of training data is required to provide high-performance entity recognition results. A novel Chinese Electronic Medical Record (EMR) entity recognition model with a domain lexicon and rules has been proposed in the literature [20]. In the task of Chinese EMR entity recognition, this method outperforms other methods. The method presented in [21] uses regular expressions to build grammatical lexical rules based on Urdu language features, and the results show that the rule-based entity recognition approach outperforms the statistical learning method. This approach, however, has a limited range of applications and poor generalization capabilities, necessitating the creation of features for many domains by experts, and it has constraints that make creating rules difficult. As a result, most rule-based methods are combined with other named entity methods to improve entity recognition performance.

2.2. Probability-Based Methods

Probability-based methods regard the named entity recognition task as a sequence annotation problem and carry out model training with a fully or partially annotated corpus. The hidden Markov model (HMM) [22], Bayesian network [23], maximum entropy (ME) [24], support vector machine (SVM) [25], and conditional random field (CRF) [26] are some of the approaches frequently employed in named entity recognition (NER) tasks. Although the advantages of using an HMM as a generative model include easy model development and short training times, the model classification performance is poor due to the Markov hypothesis. ME requires many calculations and is rarely used in real-world scenarios, despite the fact that it does not need to consider how to use feature information or the independence hypothesis. The CRF model has the advantage of allowing the position to be identified using extensive internal information and context elements. Its disadvantage is that the computations take time. In entity recognition tasks, probability-based algorithms frequently outperform manual rule-building techniques and require less human interaction, despite the need for a large training corpus.

2.3. Deep-Learning-Based Methods

Deep learning approaches have gradually gained popularity in named entity recognition (NER) tasks in recent years [27]. Their main advantage is their outstanding vector representation capabilities, which helps mine future data features. Collobert et al. [28] was an early proponent of neural networks (NN) for NER applications. They created two neural network topologies to perform NER tasks: the window approach and the sentence method. Researchers typically employ recurrent neural networks (RNN) [29] to effectively gather and analyze the temporal information of text context. Long short term memory (LSTM) [30] was specifically designed to address the problem of long-distance reliance in RNN models. In the literature [31,32,33], LSTM network structure was utilized to extract text semantic information, and on the NER task’s open dataset, F1 values higher than 90% were obtained. The authors of [34] proposed a NER model using bi-directional long short term memory (Bi-LSTM) and conditional random field (CRF), which addresses the problem of over-reliance on manual rule making and domain-specific knowledge. The authors of [35] constructed a three-layer structure NER model utilizing Bi-LSTM, a convolutional neural network (CNN), and CRF; this outperforms the model in [28] in NER. The authors of [36,37] developed a neural network model introducing character perception to extract character-level knowledge in order to further access the rich semantic information available in the text. When compared to the baseline technique, the model’s performance was considerably enhanced. IntNet [38] (a funnel-shaped, wide convolution neural network structure without down sampling) was designed to learn word features in text data, fully extract text semantic information, and improve NER performance.

2.4. Transfer-Learning-Based Methods

The goal of transfer learning [39] is to solve the lack of annotated data in the source domain. Yang et al. [40] investigated the migration capabilities of multiple deep recurrent neural network (RNN) presentation layers and used representation sharing at many levels to establish a single framework to manage cross-application, cross-language, and cross-domain migration activities. Their model performs better than the LSTM-CRF model on a public dataset for the named entity recognition (NER) task. There is a scarcity of Chinese NER labeling data. The problem can be handled utilizing an adversarial transfer learning strategy that fully uses the analogies between NER task boundary information and Chinese word segmentation, according to one paper [41].

Machine learning models can only process text input that has been transformed into a digital matrix form using the word embedding approach. Researchers discovered that pre-training with a large amount of text data produces a better vector representation of the data than training with limited datasets [42,43]. However, early word embedding approaches, known as static word vectors, such as Word2vec and GloVe, are incapable of addressing the polysemy problem. The training result is a static word vector matrix that cannot be altered dynamically, making it difficult to understand the semantics of the text accurately. The pre-trained language model (PLM) trains word vectors in two stages, first using the neural network language model, and then fine-tuning according to downstream tasks. In order to obtain better results, ELMO [44] utilized bi-directional long short term memory (Bi-LSTM) as a feature extractor in the beginning stages to extract bidirectional text information. To extract one-way text information in generative pre-training (GPT), a transformer feature extractor is used. The transformer feature extractor successfully completes machine translation tasks, and polysemy can be resolved using the trained word vector. Bidirectional encoder representation from transformers [45,46] (BERT) extracts bidirectional text features from transformers and trains bidirectional language models using continuous bag of words (CBOW). When using the masked language model (MLM) training approach, some text entities are randomly removed before the language model is trained. As a result, the training’s impact is substantially greater than with GPT.

3. Method

This section introduces a named hazardous-chemical-accident entity recognition approach based on rules and a pre-trained model. Meanwhile, there is a full illustration of the data preprocessing, rule template creation, and Bert-Bi-LSTM-CRF model. The overall hazardous chemical accident entity recognition framework is shown in Figure 2.

3.1. Rule Templates for the First-Class Entities

The entity recognition based on rule templates mainly includes the following parts: (1) data preprocessing and (2) rule templates design.

In order to analyze hazardous chemical accidents in detail, we collected and integrated accident reports from the open network platform. These reports often have problems such as lack of information, redundancy, and disordered structure. Therefore, raw report texts should be preprocessed using word segmentation, stop word removal, and parts-of-speech tagging in order to match rule templates.

3.1.1. Data Preprocessing

Word segmentation is the process of splitting a continuous sentence into a number of independent word segments. Word segmentation in Chinese differs from word segmentation in English. While there are distinct word segmentation signals between English words, there are none between Chinese words, making it difficult to separate Chinese words. Several effective word segmentation approaches are currently available, including Jieba, LTP, HanLP, THULAC, and NLPIR. Choosing the right word segmentation tool can considerably increase the accuracy of experimental results. When evaluating the word segmentation performance of various test corpora using the aforementioned approaches, Jieba and HanLP can be seen to perform the best, whereas HanLP requires more training data. As a result, the experimental data corpus uses the Jieba word segmentation tool for word segmentation.

Stop words are commonly used to describe punctuation, conjunctions, and quantifiers that lack any meaningful semantic content. In this experiment, we used a list of common stop words from the stop words list of the Harbin Institute of Technology.

The goal of parts-of-speech tagging is to recognize various types of words. Nouns, in general, contain the most different types of entities. Natural language processing technologies such as Jieba, LTP, and HanLP have parts-of-speech tagging components. However, Jieba’s segmentation of parts-of-speech tagging is more complete than the other two natural language systems. As a result, the parts-of-speech tagging stage, like the word segmentation step, employed Jieba, the natural language processing tool.

3.1.2. Design of Rule Templates

After word segmentation, stop word removal, and parts-of-speech tagging, the relationships between words and parts of speech may be identified. Entities with specific characteristics can be identified based on word construction, context, and part-of-speech information. Figure 3 depicts the overall process of rule templates identifying entities.

The experiment included nine different types of entity categories. Date and time in a chemical accident report are often noted as a specific day, month, year, or day of a certain month. The most frequent units of time expression are hours, minutes, and precise times. Date and time may generally be recognized by establishing rule templates because they all adhere to a set of preset building criteria.

Part rules satisfied by date and time are shown in Table 1.

The other two first-class entities can be identified by matching geographic information database and hazardous chemicals database. To date, all the entities of the first-class have been recognized.

3.2. Pre-Trained Model for the Second-Class Entities

For the second-class entities, if the rule template method is still used, the result must be disappointing, as each report on a hazardous chemical accident includes a separate account of the scene and the losses that does not adhere to any established guidelines. This makes it difficult to create rules or causes an incomplete list of rules, which has a negative impact on entity recognition. Based on this, other methods are needed to recognize the second class of entities. The statistical methods not only lead researchers to carry a massive burden, text annotation, but also cause extracted features to lose emotional information from the text itself, resulting in poor entity recognition performance. As a result, we use a deep learning method. Deep neural networks are more suited to dealing with the issue of sparse text features when employed with unstructured and dynamic data, since they can automatically extract meaningful qualities from data. On this foundation, we recognized second-class entities using the Bi-LSTM-CRF model. Instead of manually extracting features, the original data can be analyzed directly utilizing pre-trained word vector technology. The word2vec tool is used to train word vectors in the majority of the neural network models mentioned previously. The two most common training methods are skip–gram and continuous bag of words (CBOW). Skip–gram predicts context words from present words, whereas CBOW predicts context words from current words.

Despite having good text-sequence context-characteristics capturing, Word2Vec still has a large mining space. Word2Vec, GloVe, and other models are constrained by the model’s capacity for representation. The word vectors obtained have high degrees of context co-occurrence, and the impact of word order on meaning has not been fully taken into account. As a result, Google’s Jacob et al. introduced the bidirectional encoder representation BERT model based on the transformer. It employs a deep bidirectional representation pre-trained model, which has a positive effect on the field of natural language processing and can extract semantic information from text at a deeper level. We therefore suggest in this study using the Bert-Bi-LSTM-CRF model to recognize the second-class named entities.

A bidirectional encoder representation from transformers (BERT) layer, bi-directional long short term memory (Bi-LSTM) layer, and conditional random field (CRF) layer are the three layers that make up the entire model, as can be seen in Figure 4. The BERT layer contains embedded input characters. The Bi-LSTM layer extracts the overall vector features of the text using the word vector features from the BERT layer. The CRF layer is used for output control to make sure that the tag sequence gives the best results possible.

3.2.1. BERT Pre-Trained Language Model

Pre-trained language models have advanced quickly in recent years. Pre-trained models, which primarily have the following two qualities, are crucial for many natural language processing (NLP) tasks: (1) It can be trained using a huge unlabeled text corpus. (2) No particular network structure needs to be created in order to use it for a variety of downstream NLP jobs that are not specific. Selecting one of several predetermined network architectures for fine adjustment can yield good results. According to Figure 5, the unit sum of three embedded features makes up the coding vector input by the BERT layer (length 512). Token embedding is used to represent words as vectors. The CLS flag, which can be used for later downstream NLP operations, is the first word. Segment embedding is a sentence vector that is used in classification tasks to distinguish between two sentences. Position embeddings serve to represent the position vectors that the Bert model has learned.

Bidirectional encoder representation from transformers’ (Bert) network architecture adopts a multi-layer transformer structure. It uses an attention mechanism instead of conventional recurrent neural networks (RNNs) and convolutional neural networks (CNNs) to convert the distance between two words at any position into one. This effectively handles the challenge of long-term dependence in the process of feature extraction. The BERT network structure is shown in Figure 6, where

S_{0}

,

S_{1}

, …,

S_{n}

are the input vectors of the model;

W_{0}

,

W_{1}

, …,

W_{n}

are output vectors of the model; and

T_{r m}

is the transformer module.

Bert is a multi-task model including two self-supervised tasks: masked language model (MLM) and next sentence prediction (NSP).

A masked language model (MLM) describes the process of masking some words from the input corpus during the training stage, for example, “refined benzene tanker side [mask]”, and then predicting the masked words based on context. In the experiment of BERT, 15% of the words in training samples were randomly masked. Sentences will be fed into the model repeatedly in order to learn model parameters. However, these words will not always be hidden. Specifically, 80% of the words will be replaced with masked characters, 10% will retain the original characters, and the rest 10% will be replaced with any other words. It is because that if a character in a sentence is 100% masked, some unregistered words will appear when tuning the model. The reason for adding random characters is that the transformer structure maintains a distributed representation of each input character.

The purpose of the next sentence prediction (NSP) is to determine whether sentence B is the following of sentence A. If sentence B is the following of sentence A, it outputs “is next”, otherwise it outputs “not next”. Taking “[CLS] refined benzene tank car rollover [SEP] no one is trapped” for example, the output result of NSP is “is next”. While “[CLS] refined benzene tank car rollover [SEP] actual load of sand is 32 tons”, the output result is “not next”. Two consecutive sentences chosen at random from the parallel corpus serve as the training data. The two sentences that were extracted from them have a 50% chance of remaining, proving that they are related contextually. Another 50% chance suggests that sentence B was chosen at random from the corpus and does not relate to sentence A contextually.

3.2.2. Bi-LSTM Model

Recurrent Neural Network (RNN) is frequently utilized in the named entity recognition (NER) task to solve such sequence annotation issues. It is challenging for the model to acquire the properties of medium long-term dependence when the sequence length is too long, as the gradient would vanish. The Long Short Term Memory (LSTM) is a significant advance over the conventional RNN. It introduces a gating mechanism to regulate information input and output and a memory unit to learn which data should be remembered and lost during training.Therefore, LSTM can better capture long-distance dependencies and further learn the semantic feature information in the text.

At time t, the long short term memory (LSTM) model is made up of the following components: the input word

W_{t}

, the cell state

C_{t}

, the temporary cell state

\tilde{C_{t}}

, the hidden layer state

h_{t}

, the forget gate

f_{t}

, the input gate

i_{t}

, and the output gate

o_{t}

. Equations (1)–(7) show the calculation formulas for each of them. Figure 7 depicts the LSTM model’s frame diagram.

Forget Gate:

f_{t} = σ (U_{f} \cdot [h_{t - 1}, W_{t}] + b_{f})

(1)

Input Gate:

i_{t} = σ (U_{i} \cdot [h_{t - 1}, W_{t}] + b_{i})

(2)

\tilde{C_{t}} = tanh (U_{c} \cdot [h_{t - 1}, W_{t}] + b_{c})

(3)

Cell State:

C_{t} = f_{t} * C_{t - 1} + i_{t} * \tilde{C_{t}}

(4)

Output Gate:

o_{t} = σ (U_{o} \cdot [h_{t - 1}, W_{t}] + b_{o})

(5)

h_{t} = o_{t} * tanh (C_{t})

(6)

Y_{t} = s o f t m a x (h_{t})

(7)

where sigmoid, tanh and Softmax are activation functions of nonlinear neural network,

W_{t}

represents input characteristics of current network, U represents weight matrix of three gates, and b represents bias term of three gates.

Forget gate determines how much the unit state

C_{t - 1}

of the last moment is retained to the current moment

C_{t}

. Input gate determines how much of the current network input

W_{t}

is saved to the unit state

C_{t}

. Output gate controls how much the unit state

C_{t}

outputs to the current output value

h_{t}

of the long short term memory (LSTM). The final output consists of the current cell state

C_{t}

, the current output value

h_{t}

, and the current prediction

Y_{t}

obtained by the softmax function.

A one-way long short term memory (LSTM) model, however, is unable to process context data concurrently. The bidirectional long short term memory (Bi-LSTM) proposed by Graves et al. [26] takes forward and backward LSTM for each word sequence, combining the output simultaneously. As a result, it contains both forward and backward information for each moment. Figure 8 depicts its model structure.

3.2.3. CRF Model

The bidirectional long short term memory (Bi-LSTM) is effective at handling long-distance text information in the named entity recognition task, but it struggles with the dependency between adjacent tags. Bi-LSTM’s drawbacks can be remedied by conditional random field (CRF), which can obtain an ideal prediction sequence through the relationship between adjacent labels. CRF reasoning layer is thus added following the Bi-LSTM network layer.

Conditional random field is a distribution model of conditional probability

P (Z ∣ Y)

, which is used to predict another set of output sequence Z given a set of input sequence Y.

In a given training set Y and the corresponding marking sequence Z, K characteristic functions

f_{k} (y, z)

needs to learn the model parameter

v_{k}

of linear CRF and conditional probability

P_{v} (z ∣ y)

.

z_{v} (y)

is the generalization factor. The conditional probability

P_{v} (z ∣ y)

and model parameter

v_{k}

satisfies the relationship shown in Equation (8):

P_{v} (z ∣ y) = P (z ∣ y) = \frac{1}{z_{v} (y)} e x p (\sum_{k = 1}^{K} v_{k} f_{k} (y, z)) = \frac{e x p (\sum_{k = 1}^{K} v_{k} f_{k} (y, z))}{\sum_{z}^{} e x p (\sum_{k = 1}^{K} v_{k} f_{k} (y, z))}

(8)

According to Equation (8), given the conditional probability

P (z ∣ y)

of the conditional random field and an observation sequence

y = (y_{0}, y_{1}, \dots, y_{n})

, the sequence y satisfying the maximum of

P (z ∣ y)

can be found. Sentence level sequence features can be fully utilized by conditional random field (CRF) models to combine contextual information. Some limitations are applied to the label prediction of the sequence label results in order to ensure the reliability of the output label. This means that the label result cannot contain the beginning of a person entity (B-Person) followed by the inside of a chemical entity (I-Chemical). The output results may be in the wrong sequence if we simply use the bidirectional long short term memory (Bi-LSTM) layer since the Bi-LSTM model can only learn character-level features and lacks feature analysis at the complete sentence level. In other words, B-person may be followed by I-chemical. Similar issues can be avoided by adding a CRF reasoning layer after the Bi-LSTM network layer. Additionally, during data training, the CRF layer can automatically learn these constraints.

4. Experiments

4.1. Experimental Data Description

4.1.1. Data Acquisition

At present, chemical accident data is sparsely distributed, and there are few public datasets related to chemical accidents on the network platform. Therefore, we use crawler technology to collect experimental data from public network platforms such as chemical accident information network. The crawler process is depicted in Figure 9.

First, we send a request to the target site through the HTTP library, then we wait for the server to respond. The requested content can contain keywords and other information. Second, the server normally processes the request content and returns a response to the client. Third, the client obtains the target URL queue by parsing the response data, and obtains the data on the target URL web page. The data types here can be JSON data, binary data, HTML files, and so on, which can be parsed by corresponding tools. Finally, we choose the appropriate database to save the data from the web page. The chemical accident data collected in the experiment is text data, and we store it in MySQL relational database.

Data corpus: 2520 chemical accident report data from chemical accident information network were collected, cleaned, and sorted out.

4.1.2. Data Annotation

Common data annotation modes in named entity recognition (NER) tasks include the following two types.

BIO sequence labeling mode: B-begin, I-inside, O-outside. In this scenario, “B” for the entity’s beginning, “I” for its middle (which includes its end), and “O” for its non-entity portion;
BIOES sequence labeling modes: B-begin, I-inside, O-outside, E-end, S-single. Where, “B” represents the beginning part of the entity, “I” represents the middle part of the entity, “E” represents the end part of the entity, “O” represents the sub-entity part, and “S” represents the entity composed of a single character.

BIOES annotation mode is advantaged in dealing with datasets with few entity categories. However, when there are enormous entity categories, the BIO annotation mode can obtain the entity recognition results with little difference in accuracy in a shorter time.

The experiment’s data annotation phase involves using the brat annotation tool. In order to evaluate the entity recognition performance later, all entities are annotated during the annotation process. The annotation process is displayed in Figure 10 (all accident information is in Chinese, and the English part is translated for reference only). The results of data annotation can be converted into BIO or BIOES format according to needs, which is convenient for model recognizing and training.

4.2. Evaluation Metrics

To evaluate whether the boundary of the entity is correctly recognized and whether the type of the entity is accurately labeled, Precision rate, Recall rate, and F1-score are chosen.

There are four types of situation in entity recognition results, (1) If the predicted value is entity and the real value is entity, then the predicted result is true positive (TP). (2) If the predicted value is entity and the real value is non-entity, then the predicted result is false positive rate (FP). (3) If the predicted value is non-entity and the real value is entity, then the predicted result is False negative rate (FN). (4) If the predicted value is non-entity and the real value is non-entity, then the predicted result is a true negative (TN). Then, Precision rate, Recall rate and F1 score can thus be calculated as in Equations (9)–(11):

P r e c i s i o n = \frac{T P}{T P + F P}

(9)

R e c a l l = \frac{T P}{T P + F N}

(10)

F 1 = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(11)

Precision is the rate of recognized entities to actual entities. Recall is the rate of the number of identified entities to the number of predicted entities. F1 value is the balance between precision rate and recall rate. The larger the precision, recall and F1 values are, the better the overall recognition performance is.

4.3. Experimental Results and Analysis

The dataset used in the experiment was 2520 accident reports obtained from the chemical accident information network after cleaning and sorting. Nine types of named entities from the dataset were expected to be recognized in the experiment. For entities with structured features such as date, time, place, and hazardous chemicals, we constructed rule templates to recognize them; and the other entities are recognized by the Bert-Bi-LSTM-CRF method.

4.3.1. Parameter Analysis

Analysis of corpus proportion division: Before conducting an experimental analysis, a reasonable division of the training set, verification set, and test set is crucial in the fields of machine learning and deep learning. When the corpus’s sample size is small, it is common to use between 2/3 and 4/5 of the data for training and the remaining data for testing. In order to reduce the impact of data division on the experimental results, we validated the results of three different corpus division ratios, as shown in Figure 11.

The experimental corpus performs best with a segmentation ratio of 6:2:2, as shown in Figure 11. Thus, the corpus of 2520 chemical accident reports was separated into 1510 training sets, 500 validation sets, and 510 test sets.

Analysis of Bert pretraining: Pre-trained BERT models currently exist in three types: English, Chinese, and multilingual. BERT-Base and BERT-Large models are provided, and the corresponding hyperparameters are shown in Table 2.

BERT-Base model supports Chinese, simplified and traditional, so we adopted BERT-Base. The dimensions of bi-directional long short term memory (Bi-LSTM) were set to 128, and the training and learning rates of Bert were set to

5 \times 10^{- 5}

. Due to BERT’s strong feature fitting ability, the training learning rate of Bi-LSTM and the conditional random field (CRF) were set to

5 \times 10^{- 2}

, and the preheating value of the learning rate was set to 0.1. Dropout technology was adopted to prevent over-fitting in the experiments. The dropout value was set to 0.5.

4.3.2. Overall Performance

In this study, we propose a named entity recognition approach based on rule templates and Bert-Bi-LSTM-CRF. We contrast our approach with the generic entity recognition model and the deep-learning-based entity recognition model. Examples of traditional general models include the LTP model and the HanLP model. Since both of these models make use of probability and statistical concepts, such as the hidden Markov and cascade hidden Markov equations, they may be able to produce the final identification result without training. Deep-learning-based named entity models often employ the LSTM, Bi-LSTM, and Bi-LSTM-CRF models. The deep learning model’s input is character embedding, and training the model is required before achieving entity recognition.

The same experimental corpus was used for all trials. Only three categories of entities could be identified by the conventional general model: people’s names, places, and organizations. The corpus contains no entities with people’s names. In contrast to the conventional model, only the two categories of place and organization are counted. Entity recognition is deemed effective if the result is included into the labeled entity. The comparison of entity recognition performance among the model RT-BBC proposed in this paper, LTP [49], HanLP [50], LatticeLSTM [51], LexiconAugmentedNER [52], and ME-CNER [53] is shown in Table 3.

From Table 3, it is clear that RT-BBC has high F1 values for place and organization entities, demonstrating high sensitivity to place and organization entities. The effectiveness of the traditional model’s entity recognition is dependent on the primary job of natural language processing. The outcomes of parts-of-speech tagging and word segmentation have a direct impact on how accurately entities are recognized. Instead of relying on the outcomes of upstream tasks, the model in this study directly inputs the entire sentence to gather feature information, boosting organizational entity recognition performance in the process. The efficiency of RT-BBC is further demonstrated by the fact that, when compared to the other three deep learning methods, we outperformed them in all three measures. The recognition performances for the nine categories in the corpus are shown in Table 4.

The average F1 value for the four categories of entities based on rules is better than that of the LTP model and that of the HanLP model, was 86.23%, as can be seen in Table 4. The rate of date entity recognition was 97.20%. Date entities can be easily identified using the rule template method because of their high degree of regularity. The overall recognition performance of the method based on deep learning was poorer than that based on a rule template. This is due to the fact that the entities recognized by the deep learning method do not have distinguishing traits, and the number of entities is relatively small. Despite the huge number of organizational entities, each entity’s length—roughly 15 words—is excessive. It is easy to nest other disturbing ambiguous words in the entity, so it is difficult to identify the entity, and the F1 value will be relatively low. For entities such as reasons, losses, and scenes, the recognition performance was poor due to the small number of entities and unclear boundaries. In general, the overall F1 value of this model was 73.83%, and its entity recognition performance in the field of hazardous chemicals is significantly higher than that of the current popular entity recognition methods.

4.3.3. Ablation Study

The experimental results of the LSTM model, Bi-LSTM model, Bi-LSTM-CRF model, and BERT-Bi-LSTM-CRF model are shown in Table 5.

It can be seen in Table 5 that the Bi-LSTM model performed entity recognition better than the LSTM model, indicating the importance of the semantic information provided by the text context. The recognition performance was enhanced by 5.16% once the CRF model was added. To check the correctness of the output label, CRF adds various restrictions to the prediction label of the outcomes of sequence annotation. The performance of entity recognition is further enhanced by switching the text embedding layer to BERT embedding, and the significance of the BERT pre-training language model is demonstrated.

5. Discussion

5.1. Contribution

In the study, we designed collection tools to crawl hazardous-chemical-accident information. Additionally, we proposed a rule template and Bert-BiLSTM-CRF based model (RT-BBC) to construct a NER model specifically for the Chinese hazardous chemicals domain. With RT-BBC, the named entities such as reason, scene, and loss that not considered in other models can now be identified. As described in Section 4.3.2, the experimental results based on real data show that our model is superior to the compared models in terms of entity recognition precision and F1. In a word, this work provides a NER model for the field of hazardous chemical accidents, and provides new ideas for the field analysis of hazardous chemical accidents. At the same time, it has laid a solid foundation for the next step of identifying the relationships between hazardous chemical accident entities and building a knowledge graph of hazardous chemical accidents.

5.2. Limitations

The model based on a rule template and Bert-BiLSTM-CRF (RT-BBC) is effective when applied to the area of Chinese hazardous chemical incidents. This methodology enables us to analyze accident data in a more thorough manner. However, labeling a large amount of data using the pre-training model is unavoidable. A lot of labor and resources are needed for data annotation. In addition, the model was created for Chinese hazardous chemical accidents; thus, it is not entirely transferable to other similar situations. Relabeling the relevant data and retraining the model are required if this model is to be used in different countries. Therefore, there are some domain restrictions with the current RT-BBC. Although it has played an important role in the field of Chinese hazardous chemical accidents, it cannot undertake tasks outside this domain, such as recognizing the causes of accidents caused by non-hazardous chemicals and scene entities.

6. Conclusions and Future Work

In order to reduce the likelihood of chemical accidents, we took advantage of the knowledge graph by analyzing the chemical accident data automatically and finding the reasons. After this process, we proposed a named entity method combining a rule template and pre-trained model. First, we pre-processed the text data from hazardous chemical accidents by cleaning, annotating, word-segmenting, and removing stop words. Second, we designed rule templates to recognize entities with structural features that were few in number. Third, we proposed a pre-trained model to recognize entities with weak structural features that are large in number. Finally, thorough tests performed on actual datasets showed how well our proposed method works.

In the future, we will use rule-based and pre-training models to identify Chinese hazardous chemical accident entities, which will also necessitate a large amount of labeled data. In the future, we will think about reducing the need for data annotation by utilizing active learning or transfer learning. While this is going on, nested named entities with names such as "Longjiang County Public Security Bureau" and "local fire protection, first aid, public security, gas and other departments," frequently show up in hazardous chemical entities. It would also noteworthy to eb learn how to recognize complex nested entities. The model we put forth using named entity identification can be successfully applied to the study of hazardous chemical accidents in China, making it helpful for other countries who want to do the same. Following the same idea, comparable models can be used in domains other than Chinese hazardous chemical incidents, and they could provide various advantages.

Author Contributions

Validation, H.S., Y.N. and B.C.; writing—original draft, H.D.; Project administration, M.Z. and G.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant 71774159, the State Key Laboratory of NBC Protection for Civilian under grant number SKLNBC2020-23 and the Jiangsu Postdoctoral Science Foundation under grant number 2021K565C.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study can be obtained through the following link: https://github.com/D-hui78/Chinese-Hazardous-Chemical-Accident-Data (accessed on 29 July 2022).

Acknowledgments

The authors all thank Zhixiao Wang of China University of Mining and Technology for his kind help with improving the English.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

NER	Named Entity Recognition
Bert	Bidirectional Encoder Representation from Transformers
BiLSTM	Named Entity Recognition
Bi-LSTM	Named Entity Recognition
CRF	Named Entity Recognition
RT-BBC	Rule Templates and Bert-BiLSTM-CRF based model
EMR	Electronic Medical Record
HMM	Hidden Markov Model
ME	Maximum Entropy
SVM	Support Vector Machine
NN	Neural Networks
RNNs	Recurrent Neural Networks
LSTM	Long short Term Memory
CNN	Convolutional Neural Networks
RNN	Recurrent Neural Networks
PLM	Pre-trained Language Model
GPT	Generative Pre-Training
BERT	Bidirectional Encoder Representation from Transformers
CBOW	Continuous Bag of Words
MLM	Masked Language Model
NLP	Natural Language Processing
NSP	Next Sentence Prediction

References

Abbasi, A.R.; Mahmoudi, M.R. Application of statistical control charts to discriminate transformer winding defects. Electr. Power Syst. Res. 2021, 191, 106890. [Google Scholar] [CrossRef]
Abbasi, A.R.; Mahmoudi, M.R.; Arefi, M.M. Transformer winding faults detection based on time series analysis. IEEE Trans. Instrum. Meas. 2021, 70, 1–10. [Google Scholar] [CrossRef]
Mahmoudi, M.; Nematollahi, A.; Soltani, A. On the detection and estimation of the simple harmonizable processes. Iran. J. Sci. Technol. 2015, 39, 239. [Google Scholar]
Wang, B.; Wu, C.; Reniers, G.; Huang, L.; Kang, L.G.; Zhang, L.B. The future of hazardous chemical safety in China: Opportunities, problems, challenges and tasks. Sci. Total. Environ. 2018, 643, 1–11. [Google Scholar] [CrossRef]
Hou, J.; Gai, W.M.; Cheng, W.Y.; Deng, Y.F. Hazardous chemical leakage accidents and emergency evacuation response from 2009 to 2018 in China: A review. Saf. Sci. 2021, 135, 105101. [Google Scholar] [CrossRef]
Wang, B.; Li, D.L.; Wu, C. Characteristics of hazardous chemical accidents during hot season in China from 1989 to 2019: A statistical investigation. Saf. Sci. 2020, 129, 104788. [Google Scholar] [CrossRef]
Wang, R.J.; Xu, K.L.; Xu, Y.Y.; Wu, Y.J. Study on prediction model of hazardous chemical accidents. J. Loss. Prevent. Proc. 2020, 66, 104183. [Google Scholar] [CrossRef]
Nasar, Z.; Jaffry, S.W.; Malik, M.K. Named entity recognition and relation extraction: State-of-the-art. ACM Comput. Surv. 2021, 54, 1–39. [Google Scholar] [CrossRef]
Kryvinska, N. An analytical approach for the modeling of real-time services over IP network. Math. Comput. Simulat. 2008, 79, 980–990. [Google Scholar] [CrossRef]
Beshley, M.; Kryvinska, N.; Seliuchenko, M.; Beshley, H.; Shakshuki, E.M.; Yasar, A.U.H. End-to-End QoS “smart queue” management algorithms and traffic prioritization mechanisms for narrow-band internet of things services in 4G/5G networks. Sensors 2020, 20, 2324. [Google Scholar] [CrossRef] [Green Version]
Fedushko, S.; Mastykash, O.; Syerov, Y.; Peracek, T. Model of user data analysis complex for the management of diverse web projects during crises. Appl. Sci. 2020, 10, 9122. [Google Scholar] [CrossRef]
Cheng, J.R.; Liu, J.X.; Xu, X.B.; Xia, D.W.; Liu, L.; Sheng, V.S. A review of Chinese named entity recognition. KSII. Trans. Internet. Inf. 2021, 15, 2012–2030. [Google Scholar]
Humbel, M.; Nyhan, J.; Vlachidis, A.; Sloan, K.; Ortolja-Baird, A. Named-entity recognition for early modern textual documents: A review of capabilities and challenges with strategies for the future. J. DOC 2021, 77, 1–6. [Google Scholar] [CrossRef]
Dias, M.; Boné, J.; Ferreira, J.C.; Ribeiro, R.; Maia, R. Named entity recognition for sensitive data discovery in Portuguese. Appl. Sci. 2020, 10, 2303. [Google Scholar] [CrossRef] [Green Version]
Pushpalatha, M.; Thanamani, A.S. Rule Based kannada named entity recognition. J. Crit. Rev. 2019, 7, 2020. [Google Scholar]
Alves-Pinto, A.; Demus, C.; Spranger, M.; Labudde, D.; Hobley, E. Iterative Named Entity Recognition with Conditional Random Fields. Appl. Sci. 2021, 12, 330. [Google Scholar] [CrossRef]
Ronran, C.; Lee, S.; Jang, H.J. Delayed combination of feature embedding in bidirectional LSTM CRF for NER. Appl. Sci. 2020, 10, 7557. [Google Scholar] [CrossRef]
Li, J.; Sun, A.X.; Han, J.L.; Li, C.L. A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 2020, 34, 50–70. [Google Scholar] [CrossRef] [Green Version]
Kejriwal, M.; Shao, R.; Szekely, P. Expert-guided entity extraction using expressive rules. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–28 July 2019; pp. 1353–1356. [Google Scholar]
Li, Y.; Du, G.D.; Xiang, Y.; Li, S.Z.; Ma, L.; Shao, D.G.; Wang, X.B.; Chen, H.Y. Towards Chinese clinical named entity recognition by dynamic embedding using domain-specific knowledge. J. Biomed. Inform. 2020, 106, 103435. [Google Scholar] [CrossRef]
Kanwal, S.; Malik, K.; Shahzad, K.; Aslam, F.; Nawaz, Z. Urdu named entity recognition: Corpus generation and deep learning applications. ACM Trans. Asian Low-Resour. 2019, 19, 1–13. [Google Scholar] [CrossRef] [Green Version]
Grewal, J.K.; Krzywinski, M.; Altman, N. Markov models—Hidden Markov models. Nat. Methods 2019, 16, 795–796. [Google Scholar] [CrossRef] [PubMed]
Goyal, A.; Gupta, V.; Kumar, M. Analysis of different supervised techniques for named entity recognition. In Proceedings of the International Conference on Advanced Informatics for Computing Research, Shimla, India, 15–16 June 2019; pp. 184–195. [Google Scholar]
Iftikhar, A.; Jaffry, S.W.; Malik, M.K. Information mining from criminal judgments of lahore high court. IEEE Access 2019, 7, 59539–59547. [Google Scholar] [CrossRef]
Muhammad, M.; Rohaim, M.; Hamouda, A.; Abdel-Mageid, S. A comparison between conditional random field and structured support vector machine for Arabic named entity recognition. J. Comput. Sci. 2020, 16, 117–125. [Google Scholar] [CrossRef]
Vo, A.D.; Nguyen, Q.P.; Ock, C.Y. Semantic and syntactic analysis in learning representation based on a sentiment analysis model. Appl. Intell. 2020, 50, 663–680. [Google Scholar] [CrossRef]
Yadav, V.; Bethard, S. A survey on recent advances in named entity recognition from deep learning models. arXiv 2019, arXiv:1910.11470. [Google Scholar]
Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 2011, 12, 2493–2537. [Google Scholar]
Yu, Y.; Si, X.S.; Hu, C.H.; Zhang, J.X. A review of recurrent neural networks: LSTM cells and network architectures. Neural. Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef] [PubMed]
Van-Houdt, G.; Mosquera, C.; Nápoles, G. A review on the long short-term memory model. Artif. Intell. Rev. 2020, 53, 5929–5955. [Google Scholar] [CrossRef]
Jin, G.Z.; Yu, Z.Z. A Korean named entity recognition method using Bi-LSTM-CRF and masked self-attention. Comput. Speech Lang. 2021, 65, 101134. [Google Scholar] [CrossRef]
Huang, Z.H.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar]
Zhao, D.Y.; Huang, J.M.; Jia, Y. Chinese name entity recognition using Highway-LSTM-CRF. In Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence, Sanya, China, 21–23 December 2018; pp. 1–5. [Google Scholar]
Tang, P.; Yang, P.L.; Shi, Y.; Zhou, Y.; Lin, F.; Wang, Y. Recognizing Chinese judicial named entity using BiLSTM-CRF. In Proceedings of the Journal of Physics: Conference Series, Kunming, China, 20–22 May 2020; p. 012040. [Google Scholar]
Moqurrab, S.A.; Ayub, U.; Anjum, A.; Asghar, S.; Srivastava, G. An accurate deep learning model for clinical entity recognition from clinical notes. IEEE J. Biomed. Health. 2021, 25, 3804–3811. [Google Scholar] [CrossRef] [PubMed]
Liu, H.T.; Song, J.H.; Peng, W.M.; Sun, J.B.; Xin, X.W. TFM: A Triple Fusion Module for Integrating Lexicon Information in Chinese Named Entity Recognition. Neural Process. Lett. 2022, 54, 3425–3442. [Google Scholar] [CrossRef]
Niu, J.H.; Yang, Y.H.; Zhang, S.H.; Sun, Z.Y.; Zhang, W.S. Multi-task character-level attentional networks for medical concept normalization. Neural Process. Lett. 2019, 49, 1239–1256. [Google Scholar] [CrossRef]
Yan, R.G.; Jiang, X.; Dang, D.P. Named entity recognition by using XLNet-BiLSTM-CRF. Neural Process. Lett. 2021, 53, 3339–3356. [Google Scholar] [CrossRef]
Shoeleh, F.; Asadpour, M. Skill based transfer learning with domain adaptation for continuous reinforcement learning domains. Appl. Intell. 2020, 50, 502–518. [Google Scholar] [CrossRef]
Yang, Z.L.; Salakhutdinov, R.; Cohen, W.W. Transfer learning for sequence tagging with hierarchical recurrent networks. arXiv 2017, arXiv:1703.06345. [Google Scholar]
Kang, K.; Tian, S.W.; Yu, L. Named entity recognition of local adverse drug reactions in Xinjiang based on transfer learning. J. Intell. Fuzzy. Syst. 2021, 40, 8899–8914. [Google Scholar] [CrossRef]
Huang, J.X.; Li, C.Y.; Subudhi, K.; Jose, D.; Balakrishnan, S.; Chen, W.Z.; Peng, B.L.; Gao, J.F.; Han, J.W. Few-shot named entity recognition: A comprehensive study. arXiv 2020, arXiv:2012.14978. [Google Scholar]
Qiao, B.; Zou, Z.Y.; Huang, Y.; Fang, K.; Zhu, X.H.; Chen, Y.M. A joint model for entity and relation extraction based on BERT. Neural. Comput. Appl. 2022, 34, 3471–3481. [Google Scholar] [CrossRef]
Peng, Y.F.; Yan, S.K.; Lu, Z.Y. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. arXiv 2019, arXiv:1906.05474. [Google Scholar]
Souza, F.; Nogueira, R.; Lotufo, R. Portuguese named entity recognition using BERT-CRF. arXiv 2019, arXiv:1909.10649. [Google Scholar]
Zhao, S.; Zhang, T.Y.; Hu, M.; Chang, W.; You, F.C. AP-BERT: Enhanced pre-trained model through average pooling. Appl. Intell. 2022, 52, 15929–15937. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural. Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Che, W.X.; Feng, Y.L.; Qin, L.B.; Liu, T. N-LTP: An open-source neural language technology platform for Chinese. arXiv 2020, arXiv:2009.11616. [Google Scholar]
Guo, Y. Doing Natural Language Processing in A Natural Way: An NLP toolkit based on object-oriented knowledge base and multi-level grammar base. arXiv 2021, arXiv:2105.05227. [Google Scholar]
Zhang, Y.; Yang, J. Chinese NER using lattice LSTM. arXiv 2018, arXiv:1805.02023. [Google Scholar]
Ma, R.T.; Peng, M.L.; Zhang, Q.; Huang, X.J. Simplify the usage of lexicon in Chinese NER. arXiv 2019, arXiv:1908.05969. [Google Scholar]
Xu, C.W.; Wang, F.Y.; Han, J.L.; Li, C.L. Exploiting multiple embeddings for chinese named entity recognition. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 2269–2272. [Google Scholar]

Figure 1. An example of accident information entity recognition.

Figure 2. Named entity recognition framework.

Figure 3. Named entity recognition process based on rule templates.

Figure 4. NER model based on BERT-Bi-LSTM-CRF.

Figure 5. Input representation for the Bert model.

Figure 6. BERT network structure [47].

Figure 7. LSTM model diagram [48].

Figure 8. Bi-LSTM model diagram.

Figure 9. Process of getting power grid accident data by web crawler.

Figure 10. Process of labeling data on Brat platform.

Figure 11. The experimental results of the three division ratios. (a) Precision. (b) Recall. (c) F1.

Table 1. Some rule templates for dates and events.

Format	Regularization Rule
YYYY-MM-DD	∧(?:(?!0000)[0-9]4 ([-/.]year?) (?:(?:0?[1-9]\|1[0-2/1(?:0?[1-9]\|1[0-9]\|2[0-8])
YYYY-M-D	\|(?:0?[13-9]\|1[0-2])/1(?:29\|30) \|(?:0?[13578]
YYYY year MM month DD day	\|1[02])/1(?:31) )\|(?:[09]2(?:0[48]\|[2468][048]\|[13579][26])\|
YYYY/M/D	(?:0[48]\|[2468][048]\|[13579][26])00) ([-/.month]?) 0?2/(?:29))(day)?$
HH:MM:SS	∧(([0-1]?[0-9])\|([2][0-3])):([0-5]?[0-9])(:([0-5]?[0-9]))?$
MM:SS	∧(([1-9]1)\|([0-1][0-9])\|([1-2][0-3])):([0-5][0-9])$

Table 2. Hyperparameters of the two BERT models.

Version	Network Layer	Number of Hidden Units	Head	Total Parameters
BERT-Base	12	768	12	110 M
BERT-Large	24	1024	16	340 M

Table 3. Comparison of results.

	Place			Organization			Overall
Model	P	R	F1	P	R	F1	P	R	F1
LTP	$42.15 %$	$41.83 %$	$41.99 %$	$44.00 %$	$9.50 %$	$15.63 %$	-	-	-
HanLP	$71.48 %$	$57.52 %$	$63.75 %$	$73.43 %$	$44.08 %$	$55.09 %$	-	-	-
LatticeLSTM	$73.33 %$	$\underset{̲}{70.59 %}$	$71.93 %$	$\underset{̲}{74.00 %}$	$62.36 %$	$\underset{̲}{67.68 %}$	$\underset{̲}{70.97 %}$	$\underset{̲}{69.97 %}$	$\underset{̲}{70.47 %}$
LexiconAugmentedNER	$\underset{̲}{74.05 %}$	$73.26 %$	$73.66 %$	$68.07 %$	$\underset{̲}{63.48 %}$	$65.70 %$	$70.83 %$	$69.26 %$	$70.04 %$
ME-CNER	$72.89 %$	$64.71 %$	$68.56 %$	$59.78 %$	$60.45 %$	$60.11 %$	$63.59 %$	$65.15 %$	$64.35 %$
RT-BBC (ours)	$81.37 %$	$66.53 %$	$\underset{̲}{73.20 %}$	$77.33 %$	$74.30 %$	$75.78 %$	$77.57 %$	$71.30 %$	$73.83 %$

Table 4. Recognition results for 9 types of entities.

Entity Type	Precision	Recall	F1	Support
Date	$97.20 %$	$96.47 %$	$96.83 %$	1140
Time	$87.37 %$	$69.36 %$	$77.33 %$	710
Place	$81.37 %$	$66.53 %$	$73.20 %$	1870
Chemical	$81.01 %$	$58.01 %$	$67.61 %$	810
Person	$76.38 %$	$81.86 %$	$79.02 %$	2370
Organization	$77.33 %$	$74.30 %$	$75.78 %$	1790
Reason	$38.89 %$	$29.79 %$	$33.73 %$	470
Loss	$58.49 %$	$62.00 %$	$60.19 %$	500
Scene	$44.44 %$	$16.00 %$	$23.53 %$	250
micro avg	$77.57 %$	$71.30 %$	$73.83 %$	9910

Table 5. Overall entity recognition results of the four models.

Model	Precision	Recall	F1
$RT - {BBC}_{LSTM}$	$53.67 %$	$52.74 %$	$52.43 %$
$RT - {BBC}_{Bi - LSTM}$	$59.19 %$	$59.50 %$	$58.46 %$
$RT - {BBC}_{Bi - LSTM - CRF}$	$67.33 %$	$60.71 %$	$63.62 %$
$RT - BBC (ours)$	$77.57 %$	$71.30 %$	$73.83 %$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dai, H.; Zhu, M.; Yuan, G.; Niu, Y.; Shi, H.; Chen, B. Entity Recognition for Chinese Hazardous Chemical Accident Data Based on Rules and a Pre-Trained Model. Appl. Sci. 2023, 13, 375. https://doi.org/10.3390/app13010375

AMA Style

Dai H, Zhu M, Yuan G, Niu Y, Shi H, Chen B. Entity Recognition for Chinese Hazardous Chemical Accident Data Based on Rules and a Pre-Trained Model. Applied Sciences. 2023; 13(1):375. https://doi.org/10.3390/app13010375

Chicago/Turabian Style

Dai, Hui, Mu Zhu, Guan Yuan, Yaowei Niu, Hongxing Shi, and Boxuan Chen. 2023. "Entity Recognition for Chinese Hazardous Chemical Accident Data Based on Rules and a Pre-Trained Model" Applied Sciences 13, no. 1: 375. https://doi.org/10.3390/app13010375

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Entity Recognition for Chinese Hazardous Chemical Accident Data Based on Rules and a Pre-Trained Model

Abstract

1. Introduction

2. Related Work

2.1. Rule-Based Methods

2.2. Probability-Based Methods

2.3. Deep-Learning-Based Methods

2.4. Transfer-Learning-Based Methods

3. Method

3.1. Rule Templates for the First-Class Entities

3.1.1. Data Preprocessing

3.1.2. Design of Rule Templates

3.2. Pre-Trained Model for the Second-Class Entities

3.2.1. BERT Pre-Trained Language Model

3.2.2. Bi-LSTM Model

3.2.3. CRF Model

4. Experiments

4.1. Experimental Data Description

4.1.1. Data Acquisition

4.1.2. Data Annotation

4.2. Evaluation Metrics

4.3. Experimental Results and Analysis

4.3.1. Parameter Analysis

4.3.2. Overall Performance

4.3.3. Ablation Study

5. Discussion

5.1. Contribution

5.2. Limitations

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI