A Method of Sustainable Development for Three Chinese Short-Text Datasets Based on BERT-CAM

Pan, Li; Lim, Wei Hong; Gan, Yong

doi:10.3390/electronics12071531

Open AccessArticle

A Method of Sustainable Development for Three Chinese Short-Text Datasets Based on BERT-CAM

by

Li Pan

^1,*

,

Wei Hong Lim

²

and

Yong Gan

¹

Zhengzhou Institute of Engineering and Technology, Zhenzhou 450044, China

²

Faculty of Engineering, Technology and Built Environment, UCSI University, Cheras, Kuala Lumpur 56000, Malaysia

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(7), 1531; https://doi.org/10.3390/electronics12071531

Submission received: 22 February 2023 / Revised: 21 March 2023 / Accepted: 21 March 2023 / Published: 24 March 2023

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Considering the low accuracy of current short text classification (TC) methods and the difficulties they have with effective emotion prediction, a sustainable short TC (S-TC) method using deep learning (DL) in big data environments is proposed. First, the text is vectorized by introducing a BERT pre-training model. When processing language tasks, the TC accuracy is improved by removing a word from the text and using the information from previous words and the next words to predict. Then, a convolutional attention mechanism (CAM) model is proposed using a convolutional neural network (CNN) to capture feature interactions in the time dimension and using multiple convolutional kernels to obtain more comprehensive feature information. CAM can improve TC accuracy. Finally, by optimizing and merging bidirectional encoder representation from the transformers (BERT) pre-training model and CAM model, a corresponding BERT-CAM classification model for S-TC is proposed. Through simulation experiments, the proposed S-TC method and the other three methods are compared and analyzed using three datasets. The results show that the accuracy, precision, recall, F1 value, Ma_F and Mi_F are the largest, reaching 94.28%, 86.36%, 84.95%, 85.96%, 86.34% and 86.56, respectively. The algorithm’s performance is better than that of the other three comparison algorithms.

Keywords:

S-TC; sustainable; BERT; CAM; DL; big data

1. Introduction

The rise of artificial intelligence has greatly promoted the sustainable development of robots. In the field of retail customer service, intelligent customer service robots constructed using AI technologies can significantly reduce enterprises’ labor costs [1,2,3]. Natural Language Processing (NLP) is also the fastest growing and most widely used field in AI. The processing of natural language uses linguistics, computer, mathematics and other sciences to understand, transform, produce and other operations to carry out information exchanges between humans and computers [4,5,6]. NLP is important for the new generation of science and technology.

People are accustomed to using fragmented language to express their love for a book, movie or piece of news. These languages are generally displayed in text. To obtain valuable information from complex text information, these texts need to be classified to maximize the value of this information [7,8,9]. TC is a basic process in NLP. TC is a technology that converts text and automatically classifies it into a specific category. Hundreds of millions of pieces of data need to be analyzed and sorted. If only manual operations are used for this, it will generate huge labor and time costs [10,11]. TC technology using AI algorithms is efficient and economical. This is of great significance in the field of sustainable development, such as sentiment analysis, public opinion analysis, domain recognition and intention recognition. [12,13,14].

In the process of sustainable TC, the basic operation is to segment the text; that is, convert a text into words. This facilitates the later transformation of words in the word sequence into word vectors (WV) that are recognized by the computer. Unlike English, Chinese cannot separate words directly using spaces between characters. Therefore, AI-related technologies are required to summarize a general rule model of word segmentation, using massive amounts of data to achieve automatic word segmentation [15,16,17].

The progress of DL promotes the progress of AI, and the original feature information is processed in depth using the supervision method of artificial intelligence [18,19,20]. DL, similar to traditional machine learning, is also a process of solving specific problems in the real world by establishing mathematical models. [21,22]. At present, DL has good results in the fields of image, language, voice and other AI, which indicate that DL has prospects in TC tasks [23,24]. Applying the new generation of DL methods to the key technologies of TC can further improve the accuracy of key steps such as word segmentation, WV representation and classification [25].

Because DL does not need to extract features artificially and can make full use of unsupervised data, Chinese word segmentation methods based on DL have great advantages. DL-related technologies have started to be applied to TC. With the great success of CNN in the field of imaging, in 2014, Kim [26] created a S-TC model with CNN, which is composed of three single-layer CNNs with different sizes. This achieved good results. In 2016, Armand Joulin [27] created a fast TC model, FastText, which used a similar architecture to the continuous bag of words (CBOW) model in Word2Vec to integrate text representation and TC processes. FastText can also produce WVs while classifying. The FastText model training speed is very fast, which makes it very advantageous when implementing time-limited TC tasks. Using the AM application, Yang [28] put forward a hierarchical attention model to deal with the classification of long texts, which represents a text feature through its unique hierarchical structure. The hierarchical attention model performs better in TC. Even with the continuous development of DL in recent years, there is still a wide space for further applications of and improvements to DL in sustainable S-TC tasks. The second part of this study describes the relevant research; the third part describes the short text of the model; the fourth part describes the experiment and analysis, and the fifth part describes the conclusions of the article.

2. Related Research

Regarding the sustainable application approaches and DL methods pertaining to S-TC, some researchers have made unremitting efforts and obtained corresponding results. The authors of [29] conducted a comparative analysis of four DL models. On this basis, the authors carried out a comparative study of the DL model and discussed the two text preprocessing methods, clarifying the relationship between single-layer architecture and multi-layer architecture. However, there is no clear method to improve the performance of S-TC. In [30], focusing on the AM of effective classification, combined with the features of Wikipedia, the authors proposed a method to classify short Arabic texts by encoding short texts and related category sets using a DL model with multiple AMs. However, this cannot be applied to continuous and multidimensional emotional information analyses. The authors of [31] established a new Deep Pyramid Time Convolution Network (DPTCN) model for S-TC by optimizing the fusion of the time convolution network and deep pyramid CNN for TC. However, this only analyzes the emotions of tourism consumers, which has certain limitations. Reference [32] studied the effectiveness of the emotion analysis method for financial texts with DL by building a corpus and clarifying the prominent advantages of the deep network model based on its phrase structure and attention in S-TC. However, this method does not creatively propose an effective S-TC method. Reference [33] addressed the problem of the functional type of classification of short texts of interest and solved the limitation of text features with few names of points of interest by introducing text information to the original text. On this basis, a point of interest (POI) TC method with feature expansion and depth learning was proposed and the corresponding S-TC model was established. However, this has low accuracy. The authors of [34] proposed a long short-term memory (LSTM) model with emotional multi-channels in combination with AM and CNN, focusing on the problem that features in machine learning algorithms lack short-text semantic information in vector space models and cannot accurately identify grammatical features and potential emotional features. However, this model cannot fully obtain the information, and the TC accuracy is low. Focusing on the problem that classifiers using traditional Word Embedding (WoEm) cannot learn enough useful features, reference [35] proposed a dual-channel DL model with word features that sends the original data in the form of embedded characters and words to two different modules, composed of CNN and LSTM, through two channels. However, the embedded characters in this method struggle to achieve high accuracy in information annotation. Reference [36] proposed attention-based long-term and short-term memory results to perform a sentiment analysis on Arabic text, and the local interpretability model was used to perform deep sentiment analysis on the language. However, due to the limited data used and the small category of sentiment analysis, further optimization of the model is needed. Reference [37] proposed a Roberta sentence vector and error correction scheme for short texts. The semantic information of short texts is fully extracted by the proposed model, and the anisotropy of the Roberta output sentence vector is corrected by the standard Gaussianity of the model so that the sentence vector can represent the semantics more accurately. However, since the model is based on the identified short text for entity linking, its effectiveness in practical applications will be limited by the accuracy of the identified entities. Reference [38] proposed a deep integrated fake news detection model using sequence deep learning technology to improve detection accuracy. This is mainly used to distinguish the authenticity of fake news and real news and use natural language technology to process news. However, the in-depth analysis of the origin and context of news is insufficient, and further research is needed.

Focusing on the low accuracy of current S-TC methods and the difficulty of effective emotion prediction, a sustainable S-TC method that uses DL in big data environments is proposed. The basic ideas are as follows: ① the WVs are initialized by BERT to prevent polysemy. ② the CNN model with AM is used to enhance the judgment of text emotions. ③ the BERT and CAM models are fused to improve the overall performance of S-TC. Compared with the other ST-C methods, the innovations are:

(1) BERT model can dynamically adjust based on the semantic information of words to effectively prevent polysemy of a word and improve the accuracy of S-TC.

(2) RNN is used to obtain global semantic information and CNN is used to capture the correlation of high-level semantic features.

(3) BERT-CAM classification model is proposed by combining the BERT pre-training model and CAM model, which further improves the effect of S-TC.

3. S-TC Model based on BERT-CAM

3.1. Overall Framework of the Method

To solve the problem that traditional WoEm is not sufficient for emotional semantic expression, an interactive attention network based on hybrid WoEm is proposed, which tries to improve the TC performance by incorporating more relevant information features. At present, the preprocessing models for text include ALbert, Roberta, ERINIR, XLNet, etc. ALbert has disadvantages in processing big data, and Roberta has high requirements for batch processing; ERINIE relies heavily on word segmentation tools; XLNet has poor coherence for text information. Compared with ALbert, Roberta, ERINIR, XLNet, etc., the reason why BERT is used in this paper is that it can better dynamically obtain the vector representation of short text in different contexts, and can expand the model more conveniently, laying the foundation for subsequent and CAM fusion. The model uses BERT for pre-training and uses AM to mine deeper emotional semantic features and obtain the internal relevance of the text context. The characteristic of this method is the first preprocessing of short text data using bidirectional encoder representation from transformers (BERT), vectorizing the text, Then, convolutional attention mechanism (CAM) is used to capture the feature interaction in time dimension and use multiple convolution cores to obtain more comprehensive feature information; lastly, by optimizing and merging the bidirectional encoder representation of the transformer (BERT) pre-trained model and the CAM model, the corresponding BERT-CAM classification model for S-TC is proposed.

The construction of the model has two parts. (1) BERT-CAM for S-TC is established with the BERT and Convolutional AM (CAM) model, which is used to evaluate WoEm and the sentence vectors of BERT. (2) The BERT pre-training model is integrated with traditional WoEm, which improves the traditional WoEm by using parts of speech, and finally, AM interacts with two WoEm, mining deeper feature information.

The basic structure of the basic BERT-CAM model of emotion classification is shown in Figure 1.

In Figure 1, before the context sequence

A = {a_{1}, a_{2}, a_{3}, \dots, a_{m}}

is input into the model, special tags [CLS] and [SEP] are added to the head and tail, respectively. Finally, these are expressed as the sum of the tag, segment and position embedding. After being processed by the multi-level transformer encoding layer, BERT generates the corresponding hidden state vector for each input tag, and outputs the hidden vector sequence

Y = {E_{[C L S]}, E_{1}, E_{2}, \dots, E_{m}, E_{[S E P]}}

.

CNN is used on

Y

to obtain the n-gram features of the text. The calculation of

k

-th n-gram feature

F_{k}

in

Y

is as follows:

F_{k} = f (E_{k : k + l} \cdot C + b)

(1)

where:

C \in R^{n \times (w \times p)}

—convolution kernel sequence.

f

—activation function.

n

—the number of convolution kernels.

d

—BERT hidden layer vector dimension.

p

—one-dimension convolution window size.

After the convolution operation is completed in all windows, the n-gram feature sequence

F = {F_{1}, F_{2}, F_{3}, \dots, F_{h - k + 3}}

is generated, where

F \in R^{n \times (h - k + 3)}

,

h

is the size of the context input sequence. Then, the global maximum pooling operation (GP) is performed to generate the input sequence representation vector, as shown in the following Equation (2).

G_{A} = G P (F)

(2)

where

G_{A} \in R^{n}

.

E_{[C L S]}

and

E_{S}

are spliced, and then the fully connected network is used for linear transformation. Finally, the Softmax function is used to output category labels, as shown in Equations (3) and (4) below.

x = \tanh (W_{F} \cdot [E_{[C L S]}; E_{S}] + B)

(3)

y_{k} = \frac{\exp (x_{k})}{\sum_{l = 1}^{q} \exp (x_{l})}

(4)

where,

W_{F} \in R^{m \times (n + d)}

—weight transformation matrix.

B

—bias vector.

q

—category of classification label.

3.2. BERT

The BERT model is a pre-training model released by Google Formula in 2018, which is an improvement on the embeddings from language models (ELMO) model, GPT model and other models. BERT adopts a two-stage model, namely, the use of a fine-tuned model to solve downstream tasks. In the pre-training stage, BERT uses a bidirectional language model similar to the ELMO model, and uses the transformer model, which is prominent in the GPT model, as the feature extractor. In the second stage, the network structure of downstream tasks is modified using the pre-trained model to solve downstream NLP tasks such as TC. At present, 11 different NLP tasks have been improved by using the BERT model. The BERT model is shown in Figure 2.

In Figure 2, BERT is basically composed of multiple stacked Transformer models. The two biggest features of the BERT model are the bidirectional language model and the Transformer feature extractor. The bidirectional language model in BERT is not like the general bidirectional model. The representative of this bidirectional model is the Bi-LSTM model. The BERT model uses a similar idea to the CBOW training method of Word2Vec. When completing language tasks, a word is removed from the text and replaced with other symbols, then the information of the previous and next word is used to predict the word. This is the masked bidirectional language model of the BERT model. A total of 15% of the words are randomly selected from the training corpus and replaced with a mask. Of these 15% words, 80% will be replaced by the mask, 10% will be randomly replaced by other words and the remaining 10% will not be changed. This realizes the bidirectional language model.

Transformer is originally used for machine translation, so it consists of an encoder and a decoder. Each encoder and decoder consist of a multi-layer network. Compared with the Encoder, the Decoder has an additional layer of Encoder–Decoder attention. In this network layer, the Encoder output, as well as the output of the upper layer of the Decoder, is required. The Decoder of other layers is the same as the Encoder. The BERT model generally uses the Encoder part of the Transformer model as the Transformer block for training.

3.3. CAM

Obtaining excellent text representation is the first step of TC. At present, most of the popular methods of text representation are based on the DL framework; that is, WoEm and semantic combination. Because the recurrent neural network and its variants (GRU, LSTM, BLSTM) have structural advantages for time series data, they can model the semantic context information, meaning that this model is widely used in NLP tasks. The soft AM can overcome the bias of the RNN model. While reducing the dimension of high-level semantic features of RNN output, it can choose to capture semantic features that make important contributions to the task, regardless of the distance between features in the sequence. Therefore, the model based on AM has a good performance in long TC tasks.

However, in S-TC scenarios, the RNN model based on soft AM does not achieve an excellent performance in the face of long TC. Due to its short length, the problem of important feature information being forgotten due to the long distance between elements is not clear. For high-level semantic features, the semantic information density at each time is large, and the bias of the model is not obvious. It is difficult to learn the differences between high-level semantic features by simply setting reference vectors. Therefore, the focus of short-text representation is to design an AM with a stronger ability to capture the features of a moment.

Compared with RNN, CNN is more sensitive to local feature information. The convolution kernel of one-dimensional CNN can simultaneously capture and integrate the feature information of N words; that is, CNN is a model for the semantic combination of N-gram scale. However, due to its lack of consideration of global feature information, the performance of the text-based classification model is slightly inferior to that of RNN. If RNN is used to obtain global semantic information, that is, every moment is the sum of all semantic information in front of this word, and then CNN is used to capture the correlation of features at time t, more accurate attention weights can be obtained. On this basis, the end-to-end CAM model is provided. The model uses CNN to capture feature interactions in the time dimension and obtain more comprehensive features with multiple convolution kernels. Then, the obtained features are normalized to the weight of each t moment, which is multiplied by the output features of BLSTM through matrix multiplication to obtain the final text feature representation. When capturing high-level semantic feature information using convolution mutual AM, it not only considers the feature dimension, but also considers the correlation information between time series to obtain a more accurate attention weight vector. Through this method, this paper can overcome the shortcomings of soft AM in S-TC scenarios and obtain higher TC accuracy.

The CAM-based short-text representation and classification framework is an end-to-end DL framework. Its structure is shown in Figure 3.

The framework has four parts: WoEm layer, semantic combination, convolution attention and output classification. In the WoEm, the text sequence is a fixed-dimension shallow semantic feature matrix by pre-trained WVs; in the semantic combination layer, a CNN combines the feature vector matrix to generate a high-level semantic matrix; the convolutional attention layer uses CNN to extract features from high-level semantic features and generate adaptive attention weights; the output classification has two parts. One uses the fully connected to adjust the text representation feature dimension; the other one uses the Softmax function to normalize it. The final model selects the category represented by the highest dimension of the value as the prediction category of the model.

3.4. Max Pooling

As one of the representative algorithms of DL, CNN is good at image processing, and has also received widespread attention in NLP. This network model can classify the input information in a translation-invariant way, so it is also called a “translation invariant artificial neural network”.

Local connection, weight-sharing and down-sampling methods are added to CNN to solve the disadvantages of a fully connected network. Local connection can avoid large calculations due to the excessive number of input parameters; weight-sharing is also used to solve the problem of having too many parameters. A group of connections in the network structure can share the same weight.

During the application of TC, the input layer of CNN is composed of a

s \times d

matrix, where

s

represents the amount of words in the training data set and

d

represents the dimension of the WV. Next, the convolution layer performs the convolution operation on the input vector matrix. The convolution process is used to extract the feature regions that are similar to the distribution of the convolution kernel and extract the features of the input layer vector to obtain the local semantic combination information in the sentence. The purpose of using multiple convolution kernels is to extract the semantic information from multiple angles and ensure the diversity of the semantic combination.

The convolution kernel is composed of a weight matrix, which is generally smaller than the original feature. Values at corresponding positions and the convolution kernel are multiplied and added to obtain the corresponding values in the feature map. Finally, the values are input into the activation function.

In CNN, the pooling layer generally follows the convolution. The pooling aims to filter the feature output, removing unimportant features while retaining important features, and prevent the occurrence of over-fitting. In NLP, the maximum pooling is applicable to situations in which the feature separation is relatively sparse, and the features are retained. The average pooling aims to average the value. The advantage of average pooling is that it retains the features of the neurons in the region that are not the most excited.

After multiple rounds of convolution and pooling, there is one fully connected layer. Its role is to integrate the output of the convolution or pooling. The output is transferred to the output layer of the CNN, where the Softmax classification method is used to output the probability of each category.

3.5. Softmax

The model selects the Softmax function as the text semantic feature classifier. The Softmax function is an extension class function. Softmax function normalizes the output features of the fully connected layer, calculates the probability value of each type of feature output according to Equation (5), and maps the probability value to the interval (0,1).

S_{k} = \frac{\exp (k)}{\sum_{i} \exp (i)}

(5)

where:

S_{k}

—softmax value.

k

—

k

-th element.

i

—the text is divided into

i

categories.

The Softmax function first converts the prediction results of the classification model into an exponential function, ensuring the non-negative classification probability. Then, the converted result of the exponential function is divided by the sum of all the converted results of exponential functions; that is, the percentage of the converted results in the total number is calculated to ensure each prediction result is equal to 1. The Softmax classifier can further widen the score gap of each classification feature and make the classification effect more obvious.

4. Experiments and Analysis

4.1. Experimental Environment

In the experiment, the original dataset was first processed to obtain the segmentation result and the new data set was stored in the database for the convenience of word statistics. Then, the WoEm model was used to train the results after segmentation, and the corresponding WV was obtained. Finally, the vector was input into the classification model, and the model was saved as a .c format file. Because there were many super parameters during the model training, the super parameters’ selection was analyzed during the experiment.

The programming language used in the experiment was Python 3.8, and the DL framework was Tensorflow 1.14.0. It supports users who aim to deploy models on various servers and mobile devices and does not need to execute separate models and Python interpreters. The Python built-in libraries that were used include NumPy, Re, Pandas, etc. Their specific configuration is shown in Table 1.

4.2. Dataset

The data sources of Chinese short texts are diverse, and could comprise news headlines, hot online comments, microblogs and short messages. This experiment required annotated data sets. After investigation, this paper found that the TextGrocery8 project published a Chinese news headline data set with a certain scale. There are several classical Chinese datasets on the network. To obtain the short-text features in this paper, two operations were performed: if the data set contained a combination of title and article, this paper selected the title of the data set as the sample; if the data set contained only articles, the first few words were intercepted as samples of the data set. This experiment was conducted on three Chinese short-text datasets, as follows:

(1) Chinese News Title (CNT): this data set was compiled by researchers and others. It consists of 32 categories of Chinese news headlines, including 47,952 training samples and 15,986 test samples. A total of 48,000 training samples and 16,000 test samples were obtained after deleting the titles that contain special characters, garbled code, etc. The first 100 characters of each sample were intercepted to form a new short-text data set.

(2) Summarizations of Chinese Papers (SCP): this data set is composed of abstracts of Chinese papers from seven disciplines, including 25,900 samples. Here, 20,000 samples were randomly selected as training sets and 5900 samples as test sets. Similarly, this paper intercepted the first 100 characters of each sample to form a new set of short-text subject classification data sets.

(3) Sogou News (SNs): this data set consists of Sogou Chinese news, with a total of 10 categories. Each sample contains Sogou news content. The data set contains 50,000 training sets and 10,000 test sets. Since the data set belongs to social media news, the first 50 characters of each sample of the data set were intercepted to form a new set of short-text topic classification data sets.

A description of the data set is presented in Table 2.

4.3. Evaluation Indicators

As a subtask of TC, the model performance evaluation of classification also follows the general evaluation index of the classification task. The following indicators are mainly used to evaluate the prediction effect.

(1) Accuracy A. In the classification problem, the most simple and intuitive evaluation index is the accuracy. The accuracy represents the percentage of correctly classified samples in the total number of samples, as shown in Equation (6).

A = \frac{N_{C}}{N_{T}}

(6)

where:

N_{C}

—correctly classified samples.

N_{T}

—total number of samples.

(2) Precision P and Recall R

P and R, as evaluation indicators of common classification tasks, can effectively reflect the performance of unbalanced samples. Their calculation methods are shown in Equations (7) and (8).

P = \frac{T P}{T P + F P}

(7)

R = \frac{T P}{T P + F N}

(8)

where:

T P

—the number of positive samples predicted as positive samples.

F P

—the number of negative samples predicted as positive samples.

T N

—the number of negative samples predicted as negative samples.

F N

—the number of positive samples predicted as negative samples.

P

and

R

reflect different aspects of the performance. Precision focuses on the accuracy of the model, and recall focuses on the completeness of the model. Increasing one of the indicators may lead to a decline in the other. Therefore, the focus of the model should be determined according to the actual task requirements to select the precision or recall used to evaluate the model.

(3)

F 1

Value.

For a comprehensive evaluation of the performance in terms of

P

and

R

, the

F 1

value is used to reflect the model’s performance. The most commonly used value is the

F 1

value. This value is the harmonic average of P and R, which are both considered equally important. The calculation method is as follows:

F 1 = \frac{2 P R}{P + R}

(9)

(4) Macro-Average (

M a_F

) and Micro-Average (

M i_F

).

In multi-classification tasks,

M a_F

and

M i_F

are often used to evaluate model performance.

M a_F

represents the

M a_F

F 1

value. It calculates

F 1

values for each category, and then calculates the arithmetic mean of

F 1

values for all categories, as shown in Equation (10).

M a_F = \frac{\sum_{k = 1}^{n} F_{k}}{n}

(10)

M i

_

F

represents the

M i_F

F 1

, which is:

M i_F = \frac{2 M i_P \times M i_R}{M i_P + M i_R}

(11)

The calculation methods of

M i_P

and

M i_R

are shown in Equations (12) and (13), respectively.

M i_P = \frac{\sum_{k = 1}^{n} T P_{k}}{\sum_{k = 1}^{n} T P_{k} + \sum_{k = 1}^{n} F P_{k}}

(12)

M i_R = \frac{\sum_{k = 1}^{n} T P_{k}}{\sum_{k = 1}^{n} T P_{k} + \sum_{k = 1}^{n} F N_{k}}

(13)

M a_F

calculates the average of all categories of indicators and treats the weight of each category equally;

M i_F

can be combined with different weights to calculate the average value, so it is more persuasive to use the

M i_F

as the evaluation index when the samples are unbalanced in multi-classification tasks.

4.4. Influence of WV Dimension on F1 Value

The experiment based on neural network contained multiple hyperparameters, and the value of each hyperparameter had a great impact on the experiment. The values of each parameter during the experiment were obtained by experience. Before training the classification model, the dimension of the WV and the size of the convolution kernel should be tested, and the optimal value were selected. First, an experimental comparison of the WV dimensions was made. The WV dimensions were set as 50, 100, 150, 200, 250 and 300 dimensions in the WV model, and the other parameters in the classification model were consistent. The results are shown in Figure 4.

Figure 4 shows that the F1 of the classification model is the highest when the dimension of the WV is 200. When the dimension of the WV is low, words cannot be well represented. When the dimension reached 250 and 300, the dimension had little impact on the F1, but takes up a lot of memory. It can be concluded that the optimal choice of WV dimensions was 200 dimensions.

4.5. Comparison of TC accuracy of Various WVs on Different Data Sets

To evaluate the performance of different kinds of WVs in S-TC tasks, experiments were carried out on the proposed model. The BERT module used to generate word encoding in the BERT-CAM model was replaced with the WoEm layer, Word2Vec WoEm and GloVe WoEm, respectively, which are included in the pytorch model, and the CNN and output layer were retained. Table 3 shows the TC accuracies obtained by models using various WVs.

From the data in Table 3, it is obvious that the model that uses random words embedded in the three data sets achieves the worst TC performance. It is constantly updated during the model training process. Because random WoEm does not have any semantic information at the initial stage, its representation ability is insufficient. When the model uses the WoEm generated by the Word2Vec and GloVe pre-training models, the TC performance is advanced on the three datasets. The two traditional WoEm methods have implicit semantic information. Finally, the use of the BERT pre-training model greatly improved the TC performance of the task model. Compared with GloVe, the TC accuracy increased by 6.93%, 7% and 8.86% on the three data sets.

4.6. Change of Model Precision with Iteration Epochs

This experiment is a growth experiment, using CNT, SCP and SNs data sets. To more clearly explain the change in the precision of the growth model with the increase in iteration epochs, this experiment was used for analysis and the results are shown in Figure 5.

The curve trend and final results shown in Figure 5 show that the BERT-CAM model proposed for S-TC tended to be stable after the 300th epoch, and the precision was about 95% when using three different data sets: CNT, SCP and SNs. This is because BERT WV embedding eliminated the problem of entity ambiguity, enhanced the understanding of entities through knowledge outside the text and ultimately improved the model’s training precision. There were more entities in the longer text, which could introduce more entity embeddings, and a longer time sequence is conducive to more contexts. Therefore, in S-TC, text length and accuracy are closely related.

4.7. Loss Function of Models under Different Data Sets

The experimental analysis of the loss function is as follows. The WV dimension was 200; the maximum length of text was 100. To alleviate the over-fitting situation, the early stop mechanism was used. The loss function of the validation set was the monitoring point. When the loss function value exceeded 10 times and no longer decreased, the model training was terminated in advance to improve the model learning efficiency. The specific training process is shown in Figure 6. In Figure 6, after 100 epochs, the loss of the BERT-CAM model was reduced to about 0.12.

4.8. Change in Model Precision with Iteration Epochs

To further analyze the model’s advantages and study whether the semantics of the potential expression generated by BERT-CAM can provide clear semantic information when faced with TC tasks on three different data sets, the effectiveness of the S-TC method using DL in the big data environment proposed in this paper, and the sustainable methods in reference [30,31,34], using different data sets, were compared and analyzed.

For the S-TC model, the classifier could be trained based on the generated potential document expression, and the TC accuracy could be evaluated based on the test results. The training was conducted through the corresponding training data set, and the performance of the model was evaluated through the validation data set. The experimental results of the TC accuracy of different methods with different number of topics are shown in Figure 7, Figure 8 and Figure 9.

Figure 7, Figure 8 and Figure 9 show that the proposed S-TC model with BERT-CAM in the big data environment outperformed the other three comparison methods in all parameter settings when using three different data sets, namely, CNT, SCP and SNs, and achieved the best results when the number of topics was 50.

The results of the time efficiency of different methods with different numbers of topics are shown in Figure 10, Figure 11 and Figure 12.

In Figure 10, Figure 11 and Figure 12, the S-TC model with BERT-CAM required a longer time when the number of topics was less than 30. However, the training time after the topic number became greater than 30, which was significantly shorter than the other three comparison methods when the three different data sets of CNT, SCP and SNs were used. Therefore, the optimal number of topics is 50. At this time, the training times were 893 s, 897 s and 815 s when three different datasets were used. The training efficiency was better than the other three comparison methods.

4.9. Comparison of P–R Values under Different Data Sets

To evaluate the performance of proposed the BERT-CAM in complex multi-category event searches, the proposed model and other three comparison methods were used to conduct search experiments on three different data sets. The P–R curves from the emergency search experiments corresponding to different data sets are shown in Figure 13, Figure 14 and Figure 15.

Combining the results in Figure 13, Figure 14 and Figure 15, the BERT-CAM model achieved the best search effect. This is because, compared with the other three comparison methods, BERT-CAM designed a DL structure, retained the potential semantics in the data, integrated the attribute features using different perspectives in the learning process and effectively supplemented and mined the rich semantics of emergency messages. In addition, the model integrated more meaningful semantic information in the study of cross-modal data features. By representing different modal data from a separate perspective in the description and feature representation of cross-media emergencies, the feature space of emergency messages could be mapped by introducing multi-attribute features and time distribution features of social networks. The model comprehensively used all necessary and valuable semantic information in the emergency search to improve the search accuracy.

4.10. Comprehensive Comparative Analysis of Evaluation Indicators on Different Data Sets

To further verify the effectiveness of the proposed model, a comparative experiment was designed under the condition of using a unified data set, and the above six evaluation indicators were compared and analyzed. The experimental results are shown in Table 4.

Table 4 shows that, using the same data set, the S-TC sustainable method proposed in this paper is superior to the other three comparison methods in terms of the six evaluation indicators of accuracy, precision, recall, F1 value, Ma_F and Mi_F, reaching values of 94.28%, 86.36%, 84.95%, 85.96%, 86.34% and 86.56, respectively—an improvement compared with the other three comparison methods. This is because the introduction of the BERT pre-training model realized the dynamic adjustment of semantic information in the text. CAM makes full use of the word information aspect to judge the emotional polarity of the text and combines the WV and the text vector to serve as the input vector of the model, which greatly improves the accuracy of text emotion classification.

4.11. Comparison of TC accuracy of Ablation Models

To further verify the effectiveness of each layer structure in the BERT-CAM model to achieve a TC performance improvement, the functional components of the model were deleted without changing the experimental conditions to carry out the ablation model performance comparison experiment. The structure ablation experiment mainly focused on the three modules of part of speech embedding, position embedding and attention calculation to investigate the impact of the TC accuracy of the corresponding ablation model and evaluate its importance. The experimental results are shown in Table 5.

In Table 5, “w/o pos” means to delete part of speech embedding module in the traditional WoEm fusion part of the BERT-CAM from the model; “w/o position” means to delete the position embedding module from the model; “w/o attention” means that the entire attention computing layer in the BERT-CAM model is deleted from the model, and only the feature vectors generated by BERT and traditional WoEm are retained to generate the final text representation. From the experimental results, it can be seen that, using three different data sets, the -CAM w/o attachment model had poor TC performance, and its TC accuracy was only about 60%. BERT-CAM w/o pos and BERT-CAM w/o position models had a relatively good TC performance, and the TC accuracy was about 80%. The TC accuracy of the complete BERT-CAM model was more than 95%. The analysis shows that position embedding and part of speech embedding can improve the performance of the model, but the effect is not significant. After the attention computing layer was deleted, the performance of the model decreased more obviously, which shows that the addition of the attention module effectively improved the overall performance of the model.

5. Conclusions

Focusing on the low accuracy of current S-TC methods and the difficulty of effective emotion prediction, a sustainable S-TC method using DL in a big data environment was proposed. The effectiveness and progressiveness of the proposed method were verified through experiments. The test results show that:

(1) When processing language tasks, using the BERT pre-training model to vectorize text can solve the problem of polysemy in text information. The accuracy of TC can be effectively improved by removing a word from the text and replacing it with other symbols and then using the information of the previous and the next word to predict the word.

(2) Using RNN to obtain global semantic information can make the information contained at each moment contain the sum of all semantic information that is displayed. On this basis, using CNN to capture the correlations between high-level semantic features at that moment can obtain a more accurate attention weight.

(3) The BERT-CAM classification model obtained by combining the BERT pre-training model and CAM model can further improve the effect of sustainable S-TC.

In the future, we will further study how to effectively combine the real-time nature of social network data and the rapidity of message transmission to mine the change rule of emergencies and predict their development trends, as well as tracking and discovering emergencies in real time.

Author Contributions

Conceptualization, L.P. and W.H.L.; methodology, L.P.; software, Y.G.; validation, L.P. and Y.G.; formal analysis and vaildation, W.H.L.; resource and data curation, L.P.; writing—original draft preparation, L.P.; writing—review and editing, W.H.L.; visualization, Y.G.; supervision,project administration and funding acquisition, L.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the 2021 Key Scientific Research Project of colleges and universities in Henan Province—”Design and Development of College Dormitory Security Management System Based on Facial Expression Recognition”, Department of Education of Henan Province (No.21A520046).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflict of interest regarding the publication of this paper.

References

Borna, K.; Ghanbari, R. Hierarchical LSTM network for TC. SN Appl. Sci. 2019, 1, 1–4. [Google Scholar] [CrossRef] [Green Version]
Ji, L.; Wang, Y.; Shi, B.; Zhang, D.; Wang, Z.; Yan, J. Microsoft Concept Graph: Mining Semantic Concepts for Short Text Understanding. Data Intell. 2019, 1, 238–270. [Google Scholar] [CrossRef]
Yu, Y.; Si, X.; Hu, C.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef]
Liu, Z.; Kan, H.; Zhang, T.; Li, Y. DUKMSVM: A Framework of Deep Uniform Kernel Mapping Support Vector Machine for S-TC. Appl. Sci. 2020, 10, 2348. [Google Scholar] [CrossRef] [Green Version]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Bao, T.; Ren, N.; Luo, R.; Wang, B.; Shen, G.; Guo, T. A BERT-Based Hybrid S-TC Model Incorporating CNN and Attention-Based BiGRU. J. Organ. End. User Comput. 2021, 33, 120–128. [Google Scholar] [CrossRef]
Sharma, A.K.; Chaurasia, S.; Srivastava, D.K. Sentimental Short Sentences Classification by Using CNN DL Model with Fine Tuned Word2Vec. Procedia Comput. Sci. 2020, 167, 1139–1147. [Google Scholar] [CrossRef]
Wang, S.; Zhang, H.; Pan, Y. Autoencoder with improved SPNs and its application in sentiment analysis for short texts. J. Harbin Eng. Univ. 2020, 41, 411–419. [Google Scholar]
Yang, K.Y.; Gao, Y.J.; Liang, L.; Bian, S.; Chen, L.; Zheng, B. CrowdTC: Crowd-powered Learning for TC. Acm Trans. Knowl. Discov. Data 2021, 16, 205–214. [Google Scholar]
Ye, J.; Luo, D.; Chen, S. Short-text Sentiment Enhanced Achievement Prediction Method for Online Learners. Acta Autom. Sin. 2020, 46, 1927–1940. [Google Scholar]
Mittal, V.; Gangodkar, D.; Pant, B. Deep Graph-Long Short-Term Memory: A DL Based Approach for TC. Wirel. Pers. Commun. 2021, 119, 2287–2301. [Google Scholar] [CrossRef]
Li, J.; Zhang, D.Z.; Wulamu, A. Investigating Multi-Level Semantic Extraction with Squash Capsules for S-TC. Entropy 2022, 24, 164–173. [Google Scholar]
Salur, M.U.; Aydin, I. A Novel Hybrid DL Model for Sentiment Classification. IEEE Access. 2020, 8, 58080–58093. [Google Scholar] [CrossRef]
Moirangthem, D.S.; Lee, M. Hierarchical and lateral multiple timescales gated recurrent units with pre-trained encoder for long TC. Expert. Syst. Appl. 2021, 165, 87–96. [Google Scholar] [CrossRef]
Sun, X.J.; Huo, X.Y. Word-Level and Pinyin-Level Based Chinese S-TC. IEEE Access. 2022, 10, 125552–125563. [Google Scholar] [CrossRef]
Huang, X.; Qiu, D.; Xiang, C.; Chen, H. Hybrid Graph Neural Network Model Design and Modeling Reasoning for Text Feature Extraction and Recognition. Wirel. Commun. Mob. Comput. 2022, 2022, 63–71. [Google Scholar] [CrossRef]
Liu, R.; Liu, Y.; Yan, Y.G.; Wang, J. Iterative Deep Neighborhood: A DL Model Which Involves Both Input Data Points and Their Neighbors. Comput. Intell. Neurosci. 2020, 2020, 342–351. [Google Scholar] [CrossRef] [Green Version]
Zhang, L.; Sun, L.; Li, W.; Zhang, J.; Cai, W.; Cheng, C.; Ning, X. A Joint Bayesian Framework based on Partial Least Squares Discriminant Analysis for Finger Vein Recognition. IEEE Sens. J. 2021, 22, 785–794. [Google Scholar] [CrossRef]
Wang, C.; Ning, X.; Sun, L.; Zhang, L.; Li, W.; Bai, X. Learning Discriminative Features by Covering Local Geometric Space for Point Cloud Analysis. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 5703215. [Google Scholar] [CrossRef]
Wang, C.; Wang, X.; Zang, J.; Zhang, L.; Bai, X.; Ning, X.; Zhou, J.; Xiao, B.; Hancock, E. Uncertainty Estimation for Stereo Matching Based on Evidential Deep Learning. Pattern Recognit. 2021, 124, 108498. [Google Scholar] [CrossRef]
Prabhakar, S.K.; Won, D.O. Medical TC Using Hybrid DL Models with Multihead Attention. Comput. Intell. Neurosci. 2021, 2021, 95–105. [Google Scholar] [CrossRef]
Zulqarnain, M.; Alsaedi, A.K.Z.; Ghazali, R.; Ghouse, M.G.; Sharif, W.; Husaini, N.A. A comparative analysis on question classification task based on DL approaches. PeerJ Comput. Sci. 2021, 7, 77–86. [Google Scholar] [CrossRef] [PubMed]
Ning, X.; Tian, W.; Yu, Z.; Li, W.; Bai, X.; Wang, Y. HCFNN: High-order Coverage Function Neural Network for Image Classification. Pattern Recognit. 2022, 131, 108873. [Google Scholar] [CrossRef]
Ning, X.; Tian, W.; He, F.; Bai, X.; Sun, L.; Li, W. Hyper-sausage coverage function neuron model and learning algorithm for image classi cation. Pattern Recognit. 2022. [CrossRef]
Zhao, R.; Cai, Y.T. Research on online marketing effects based on multi-model fusion and AI algorithms. J. Ambient. Intell. Humaniz. Comput. 2021, 6, 162–170. [Google Scholar]
Kim, Y. CNNs for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in NLP, Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar]
Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of tricks for efficient TC. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, 3–7 April 2017; pp. 427–431. [Google Scholar]
Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar]
Akhter, M.P.; Jiangbin, Z.; Naqvi, I.R.; Zheng, J. Exploring DL approaches for Urdu TC in product manufacturing. Enterp. Inf. Syst. 2020, 16, 223–248. [Google Scholar] [CrossRef]
Alagha, I. Leveraging Knowledge-Based Features with Multilevel AMs for Short Arabic TC. IEEE Access. 2022, 10, 51908–51921. [Google Scholar] [CrossRef]
Yu, S.; Liu, D.; Zhang, Y.; Zhao, S.; Wang, W. DPTCN: A novel deep CNN model for S-TC. J. Intell. Fuzzy Syst. 2021, 41, 7093–7100. [Google Scholar] [CrossRef]
Rao, D.N.; Huang, S.H.; Jiang, Z.H.; Deverajan, G.G.; Patan, R. A dual deep neural network with phrase structure and AM for sentiment analysis an ablation experiment on Chinese short financial texts. Neural Comput. Appl. 2021, 33, 11297–11308. [Google Scholar] [CrossRef]
Zhou, C.; Yang, H.; Zhao, J.; Zhang, X. Paper: POI Classification Method Based on Feature Extension and DL. J. Adv. Comput. Intell. Intell. Inform. 2021, 24, 944–952. [Google Scholar] [CrossRef]
Zhou, Z.G. Research on Sentiment Analysis Model of Short Text Based on DL. Sci. Program 2022, 2022, 65–74. [Google Scholar]
Yang, Z.; Yan, H. Code-switching short-text sentiment classification method based on multi-channel DL network. Appl. Res. Comput. 2021, 38, 69–74. [Google Scholar]
Abdelwahab, Y.; Kholief, M.; Sedky, A.A.H. Justifying Arabic Text Sentiment Analysis Using Explainable AI (XAI): LASIK Surgeries Case Study. Information 2022, 13, 536. [Google Scholar] [CrossRef]
Gao, L.; Zhang, L.; Zhang, L.; Huang, J. RSVN: A RoBERTa Sentence Vector Normalization Scheme for Short Texts to Extract Semantic Information. Appl. Sci. 2022, 12, 11278. [Google Scholar] [CrossRef]
Ali, A.M.; Ghaleb, F.A.; Al-Rimy, B.A.S.; Alsolami, F.J.; Khan, A.I. Deep Ensemble Fake News Detection Model Using Sequential Deep Learning Technique. Sensors 2022, 22, 6970. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Structure of BERT-CAM network model.

Figure 2. BERT.

Figure 3. Single-feature S-TC model based on CAM.

Figure 4. Influence of WV dimension on F1.

Figure 5. Influence of iteration epoch on precision.

Figure 6. Loss of training set and validation set.

Figure 7. TC accuracy of different methods under CNT dataset [30,31,34].

Figure 8. TC accuracy of different methods under SCP dataset.

Figure 9. TC accuracy of different methods under SNs dataset.

Figure 10. Training time of different methods under CNT dataset.

Figure 11. Training time of different methods under SCP dataset.

Figure 12. Training time of different methods under SNs dataset.

Figure 13. P–R curve of different methods under CNT dataset.

Figure 14. P–R curve of different methods under SCP dataset.

Figure 15. P–R curve of different methods under SNs dataset.

Table 1. Experimental environment settings.

Experimental Environment	Configuration
Operating system	Ubuntu 18.04
CPU	Intel(R) Core(TM) i7-1038NG7 CPU @ 2.00GHz
Memory	32 G
Programming language	Python 3.8
DL framework	Tensorflow 1.14.0

Table 2. Details of Chinese short-text public data set.

Dataset	CNT	SCP	SNs
Training Set	48,000	20,000	50,000
Test Set	16,000	5900	10,000
Number of categories	32	7	10
Maximum length	27	70	25
Average length	9.5	65	17.32

Table 3. TC accuracies obtained using different WVs.

WV Type	Accuracy (%)
WV Type	CNT	SCP	SNs
Random	80.52	81.46	81.27
Word2Vec	85.58	85.36	82.41
GloVe	88.39	88.65	86.28
BERT	95.32	95.65	95.14

Table 4. Comparison of evaluation indicators of different methods.

Indicator	Proposed Method	Ref. [30]	Ref. [31]	Ref. [34]
A	94.28	90.56	86.74	82.55
P	86.36	78.25	66.49	60.38
R	84.95	75.88	63.57	62.29
F1	85.96	76.48	64.79	61.25
Ma_F	86.34	77.21	64.86	61.93
Mi_F	86.56	77.35	64.87	61.59

Table 5. TC accuracy of ablation model.

Model	Accuracy (%)
Model	CNT	SCP	SNs
BERT-CAM w/o pos	78.30	78.98	78.84
BERT-CAM w/o position	81.96	81.80	83.66
BERT-CAM w/o attention	63.99	64.18	62.47
BERT-CAM	95.32	95.65	95.14

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pan, L.; Lim, W.H.; Gan, Y. A Method of Sustainable Development for Three Chinese Short-Text Datasets Based on BERT-CAM. Electronics 2023, 12, 1531. https://doi.org/10.3390/electronics12071531

AMA Style

Pan L, Lim WH, Gan Y. A Method of Sustainable Development for Three Chinese Short-Text Datasets Based on BERT-CAM. Electronics. 2023; 12(7):1531. https://doi.org/10.3390/electronics12071531

Chicago/Turabian Style

Pan, Li, Wei Hong Lim, and Yong Gan. 2023. "A Method of Sustainable Development for Three Chinese Short-Text Datasets Based on BERT-CAM" Electronics 12, no. 7: 1531. https://doi.org/10.3390/electronics12071531

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Method of Sustainable Development for Three Chinese Short-Text Datasets Based on BERT-CAM

Abstract

1. Introduction

2. Related Research

3. S-TC Model based on BERT-CAM

3.1. Overall Framework of the Method

3.2. BERT

3.3. CAM

3.4. Max Pooling

3.5. Softmax

4. Experiments and Analysis

4.1. Experimental Environment

4.2. Dataset

4.3. Evaluation Indicators

4.4. Influence of WV Dimension on F1 Value

4.5. Comparison of TC accuracy of Various WVs on Different Data Sets

4.6. Change of Model Precision with Iteration Epochs

4.7. Loss Function of Models under Different Data Sets

4.8. Change in Model Precision with Iteration Epochs

4.9. Comparison of P–R Values under Different Data Sets

4.10. Comprehensive Comparative Analysis of Evaluation Indicators on Different Data Sets

4.11. Comparison of TC accuracy of Ablation Models

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI