Research on Intelligent Perception Algorithm for Sensitive Information

Huo, Lin; Jiang, Juncong

doi:10.3390/app13063383

Open AccessArticle

Research on Intelligent Perception Algorithm for Sensitive Information

by

Lin Huo

^1,*,† and

Juncong Jiang

^2,†

¹

International College, Guangxi University, Nanning 530004, China

²

School of Computer and Electronic Information, Guangxi University, Nanning 530004, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2023, 13(6), 3383; https://doi.org/10.3390/app13063383

Submission received: 21 January 2023 / Revised: 3 March 2023 / Accepted: 5 March 2023 / Published: 7 March 2023

(This article belongs to the Special Issue New Techniques of Machine Learning and Deep Learning in Text Classification)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In the big data era, a tremendous volume of electronic documents is transmitted via the network, many of which include sensitive information about the country and businesses. There is a pressing need to be able to perform intelligent sensing of sensitive information on these documents in order to be able to discover and guarantee the security of sensitive information in this enormous volume of documents. Although the low effectiveness of manual detection is resolved by the current method of handling sensitive information, there are still downsides, such as poor processing effects and slow speed. This study creatively proposes the Text Sensitive Information Intelligent Perception algorithm (TSIIP), which detects sensitive words at the word level and sensitive statements at the statement level to obtain the final assessment score of the text. We experimentally compare this algorithm with other methods on an existing dataset of sensitive Chinese information. We use the metrics measuring the accuracy of the binary classification model, where the F1 score reaches 0.938 (+0.6%), and the F2 score reaches 0.946 (+1%), and the experimental results fully demonstrate the superiority of this algorithm.

Keywords:

sensitive information; intelligent perception; intelligent algorithm

1. Introduction

With the development of technology, countries and enterprises are gradually reducing the number of paper documents in favor of electronic documents. The information contained in these documents is likely to contain sensitive information, so the organizations concerned need to conduct sensitivity checks on these huge volumes of documents. However, manual reviews are too time-consuming and costly, given the sheer volume of electronic papers. As a result, countries and enterprises have started utilizing artificial intelligence to automatically perceive and categorize electronic documents based on whether they contain sensitive information. However, the speed has been increased while the accuracy rate has been less than ideal when using conventional machine classification methods. Therefore, it has become an urgent goal to improve the quality of sensitive information in electronic documents while ensuring the speed of intelligent perception.

There are two types of electronic document classification methods, the first of which is the vector space model. These models can analyze document features by computing vectors, and they can assess the semantic relationships between words by measuring the distance of vectors in space. The earliest one-hot technique [1] turns the vocabulary into a binary code, transforming the discrete text into Euclidean space, where each word corresponds to a point, and the feature relations between the words can be determined by measuring the distance between points. The one-hot method, however, is unable to capture the semantics of words and the links between words. When the vocabulary is exceedingly broad, the word vector it represents has an excessively large and sparse dimensionality, which makes the calculation difficult. A sentence is represented by a vector whose dimension is equal to the number of words in the lexicon in the bag of words model [2], which introduces word frequency on the basis of one-hot. The bag-of-words approach partially solves the problem of vector sparsity compared to the one-hot method, but it does not take into account word order as well as relationships between words and is unable to capture contextual information. The N-gram method [3], which is based on the bag-of-words model, takes all neighboring N words into account as a whole, assuming that each word is related to the n-1 words that came before it or that the n-th word can be inferred from the preceding n-1 words. It is common to use the bi-gram method with two words as a whole and the tri-gram method with three words as a whole. The probability of the text as a whole is the product of the probabilities of the word occurrences. Although this approach includes contextual information and takes word order inside phrases into account, it results in more severe data sparsity when the vocabulary is large. The term frequency-inverse document frequency (TF-IDF) method [4] determines whether a word is unique to this document by taking into account its frequency in this document and its frequency in other documents in the corpus. This method incorporates lexical features and textual information into the word vector. However, it still does not reflect word position information and contextual information, and the semantic relationships that exist between words cannot be characterized by textual semantic relationships through frequency alone.

The second method is the language model based on neural networks, which has become very popular in recent years. Inspired by the CNN neural network structure [5], Bengio et al. [6] used the first n-1 words to predict the probability of occurrence of the n-th word using contextual modeling in order to produce a word vector representation of each word. However, manual extraction results in a restricted ability to represent complicated functions and perform challenging classification problems, as well as an incomplete and inaccurate representation of emotion. Mikolov et al. proposed the Word2vec model [7], which contains the CBOW and Skip-gram model approaches to obtain word vectors by computing the association of words with their contexts. The CBOW model predicts the middle target word given the surrounding words, whereas the Skip-gram model predicts the surrounding words given the middle words. These two methods’ computing complexity is too great when the vocabulary is broad, though. Pennington et al. proposed the Glove (global vector for word representations) model [8] to solve the issue that the Word2vec model does not take global aspects into account. This model applies global features while also taking into account local contextual data. Language model pre-training combined with word vector training has grown to be the most popular technique available. Later, Google made improvements to the RNN model’s slow parallel computation issue [9] and introduced the Transformer model [10], an encoder and decoder model made up entirely of attention mechanisms that addressed the RNN model’s inability to be calculated in parallel. Devlin et al. [11] then built the Bert model by extracting the encoder from the Transformer model, which produced the best test scores on 11 natural language processing tasks. In 2019, Baidu enhanced the Bert model’s masking technique and offered the ERNIE model with entity machine masks and phrase-level masks [12], and its trial results on five Chinese NLP tasks produced the best results. Baidu proposed the ERNIE 2.0 model [13] in 2020, which is based on the ERNIE model and continuously learns multi-tasks while performing pre-trained models on created tasks to produce the best results on 16 tasks. However, the word embedding of the result shows that the information about significant entities in it has not been properly applied.

To solve these problems, we propose a sensitive information perception model based on ERNIE and knowledge graph embedding (KG-ERNIE), which can use a pre-trained text encoder to encode the text, then the convolutional learning of intermediate vectors and the extraction of semantic information to perform intelligent perception of sensitive information on the text. The TSIIP algorithm, which uses word-level, statement-level, and text-level text detection, is proposed. The following are the significant contributions of the work in this paper:

(1) Based on THUNews and Chinese Wikipedia, a Chinese sensitive information dataset, JWBD, and sensitive vocabulary are built, and the proposed KG-ERNIE model and the TSIIP algorithm are trained and tested using these datasets.

(2) The KG-ERNIE sensitive information detection model is proposed, which employs a knowledge graph-based entity embedding technique and an ERNIE-based pre-training model to encode input text and extract semantic information and features. A convolutional neural network-based model (CNN) is then used to recognize and categorize the encoded information.

(3) The TSIIP algorithm, which perceives sensitive words at the word level and sensitive statements at the statement level to determine the final evaluation score of the text, is created and applied to the proposed KG-ERNIE model to solve the problems with low machine recognition accuracy of text-sensitive information and only taking into account single-level semantics.

The rest of the paper is organized as follows: Section 2 presents related work. Section 3 describes, in detail, the KG-ERNIE sensitive information detection model and the TSIIP algorithm proposed in this paper. Section 4 presents the experiments and analysis related to this work. Section 5 summarizes our work and provides a future outlook.

2. Related Work

With the proliferation of information on a global scale, intelligent perception of sensitive information has become an increasingly important area of research. Sensitive information keyword-matching algorithms were utilized in 2015 by Berardi et al. [14] to locate classified documents. Since keyword lexicons are manually generated, subjectivity can affect how accurately things are classified. Recurrent neural networks (RNN) were then employed by Neerbeky et al. [15] in 2017 to analyze the syntactic and grammatical structure of texts, find sensitive information in text documents, and assess the sensitivity values of the semantic component of the text structure. Neerbeky et al. [16] then developed a sensitive phrase recurrent neural network with an Acc value of 76.78 % to capture the intricacy of detecting sensitive information. In 2018, Xu et al. [17] presented their ideas and introduced a new topic-tracking algorithm. It monitors sensitive words over a period of time and assigns different weights to them, taking into account the number of occurrences of sensitive words in the text and the different positions in the text where the sensitive words appear, but the algorithm ignores the semantic information before and after the words as well as the semantic information at the utterance level. Then, Xu et al. [18] presented a novel approach for textual CNN-based sensitive information identification. More quickly and accurately than previous models and techniques, the training time of the detection model was reduced while maintaining detection accuracy. A framework to more comprehensively extract data features in 2020 was put forth by Lin et al. [19] in order to improve detection outcomes. The Bi-LSTM and CNN structures are the foundation of the system. Convolutional neural networks are utilized to effectively extract local features from unstructured documents, and Bi-LSTM networks are constructed to extract global features. In the same year, Hong et al. designed a brand-new framework to identify sensitive data using a network traffic recovery method [20]. A scalable convolutional neural network (CNN) and a bidirectional long short-term memory model (Bi-LSTM) with a multi-channel structure were proposed in 2021 by Gan et al. [21] to detect sensitive information in Chinese texts, analyze the sentiment tendency of Chinese texts, and introduce attention mechanisms to improve the model’s performance.

Due to the widespread use of the Transformer model [10] and the Bert model [11] in 2019, a number of academics have presented numerous solutions based on these two models. A sensitive information classification model based on BERT-CNN was presented by Wang et al. [22], which enhances the word embedding’s generalization capability and effectively classifies network sensitive information in short text datasets. Using a variation recognition and similarity calculation method that includes variations in synonyms, pronouns, acronyms, and word forms, followed by a combination of rules, Fu et al. [23] proposed sensitive word detection method uses variation recognition and association analysis to identify and judge sensitive words. However, the running cost is too high, and it is more difficult to apply to constantly changing sensitive words. Pablos et al. [24] used a sequence tagging algorithm based on the BERT model to find and delete sensitive data from Spanish clinical literature in 2020 to protect the privacy of personal data. A method to extract sensitive information from unstructured data using a hybrid content-based and context-based extract mechanism was proposed by Guo et al. [25] in 2021. In 2022, Cong et al. [26] combined the BERT framework and knowledge graph to form the KGDetector framework for detecting Chinese sensitive information, and the CNN+FC classifier proposed in the paper was used in their construction of a Chinese sensitive information dataset that was experimented on, and the F1 value was 93.7%. In the same year, Gibert et al. [27] successfully generated four datasets for Spanish-named entity detection in the legal sector, utilizing the MAPA project framework and the Transformer model for the de-identification of sensitive data in real-world scenarios. Lelio et al. [28] proposed a method for the automatic anonymization of personal data that can extract linguistic features from the text while masking the sensitive information found. The dataset utilized for the tests in the paper is made up of court documents from the Italian Supreme Court. In order to improve the model’s ability to recognize sensitive features and to increase the rate at which sensitive information is detected, Huang et al. [29] proposed an approach to sensitive information detection approach based on language model word embedding [30] and attention mechanisms (A-ELMo). However, the ability of the ELMo model to extract features is different from that of the Transformer model [10].

Therefore, algorithm [17] based on the review of sensitive information based only on the frequency and location of occurrence of sensitive words was improved, and we proposed the TSIIP algorithm, which perceives sensitive words at the word level and the semantics of sensitive sentences at the sentence level, and finally obtains the overall evaluation score of the input text.

3. Methodology

In this section, we will introduce the KG-ERNIE model and the TSIIP algorithm in detail. The architecture of the KG-ERNIE model is shown in Figure 1, and the KG-ERNIE model mainly consists of two parts: text encoder and classifier. The following sections will introduce these two parts in turn.

3.1. Text Encoder

The task of the text encoder is to extract useful intermediate semantic information from the incoming text and to encode it in a way that the computer can understand. It has an entity embedding module based on knowledge graphs and an ERNIE-based encoding module.

3.1.1. ERNIE-Based Encoder Module

Text consists of sentences, each of which consists of one word, but the classifier needs a vector or tensor as its input. We extract semantic information from the raw input text in a contextual representation using the ERNIE model, which represents text as vectors. The dynamic word vectors produced by the ERNIE pre-training model, as opposed to static word vectors, are better able to represent contextual cues and prevent the production of multiple meanings for a single word. The architecture of the ERNIE model is similar to that of the BERT model, which is based on the bi-directional Transformer’s encoder.

ERNIE and BERT models are both trained using a Masked Language Model (MLM). For the differences between Chinese and English, the ERNIE model is an improvement over the BERT model, which masks words as units. Since in Chinese, the meaning of a single word is often very different from that of its constituent words, and at the same time, words occur very frequently, using words alone for masking training would lose the information about the features between words, and would not allow good learning of the information and knowledge expressed in the sentence by the word in which the word is located. The ERNIE model includes both word-based and entity-based masking training strategies. A comparison of the two masking approaches is shown in Figure 2. For example, in the sentence “Harbin is the capital of Heilongjiang, an international city of ice and snow culture”, the BERT model covers up the “r” of “Harbin” and the “hei” of “heilongjiang”, the “internal” and “snow” in “international ice and snow”. “The ERNIE model covers up the complete words “Harbin” and “ice and snow”. This can better enhance the association between words better learn the feature information and knowledge of words, and strengthen the ability of word vectors to express Chinese semantic information.

Specifically, the operational flow of the ERNIE module is shown in Figure 3.

The pre-processed text sequence t of length n is input into the ERNIE module. First, t is fed into the embedding layer, which transforms the text sequence t into a vector

\{e_{1}, \dots, e_{n}\}

, as shown in (1).

\{e_{1}, \dots, e_{n}\} = E m b e d d i n g (t), t = [C L S] t_{1} t_{1} \dots, t_{n}

(1)

where

t_{1} t_{2} \dots t_{n}

is the word in the text and [CLS] is the statement start symbol. Then the vector

\{e_{1}, \dots, e_{n}\}

is input into the multi-headed self-attentive module (MH-ATT) [10] to obtain

\{w_{1}, \dots, w_{n}\}

, as shown in (2).

\{w_{1}, \dots, w_{n}\} = M H - A T T (\{e_{1}, \dots, e_{n}\})

(2)

The obtained

\{w_{1}, \dots, w_{n}\}

is input into the prefix network (FFN) [10] to obtain the final output of the encoder module of

\{w_{1}, \dots, w_{n}\}

, as shown in (3), consisting of two linear transformations with a ReLU activation function.

\{w_{1}^{0}, \dots, w_{n}^{0}\} = F F N (\{w_{1}, \dots, w_{n}\}) = m a x (0, \{w_{1}, \dots, w_{n}\} W_{1} + b_{1}) W_{2} + b_{2}

(3)

3.1.2. Knowledge Graph-Based Entity Embedding Module

The output

\{w_{1}^{0}, \dots, w_{n}^{0}\}

represents the semantic information of the input context from Section 3.1.1, but the specific and valuable information of the entities contained in it may not be highlighted, for example, if the text contains entities: names of people, places, etc., then this text may be biased to introduce the content related to that entity. To this end, in addition to using the ERNIE model to encode the text input to extract the text semantic information of the text, we also need to extract the entities

\{e_{1}^{0}, \dots, e_{n}^{0}\}

from the text and use the entity embedding module of the knowledge graph to obtain the representative intermediate information among them. We refer to the work of Zhang et al. [31] and train a Knowledge Graph Entity Embedding Module (KG) using CN-DBpedia to help construct semantically richer intermediate vectors with specific information of named entities.

Specifically, the operational flow of the KG module is shown in Figure 4.

We combine the semantic information vector

\{w_{1}^{(i - 1)}, \dots, w_{n}^{(i - 1)}\}

and the entity information vector

\{e_{1}^{(i - 1)}, \dots, e_{n}^{(i - 1)}\}

as inputs into layer i (i starts from 1), and the two vectors are fed into two multi-headed self-attentive modules (MH-ATTs) to obtain the layer i outputs of

\{w_{1}^{(i)}, \dots, w_{n}^{(i)}\}

and

\{e_{1}^{(i)}, \dots, e_{n}^{(i)}\}

, as shown in (4) and (5):

\{{\tilde{w}}_{1}^{(i)}, \dots, {\tilde{w}}_{n}^{(i)}\} = M H - A T T (\{w_{1}^{(i - 1)}, \dots, w_{n}^{(i - 1)}\})

(4)

\{{\tilde{e}}_{1}^{(i)}, \dots, {\tilde{e}}_{n}^{(i)}\} = M H - A T T (\{e_{1}^{(i - 1)}, \dots, e_{n}^{(i - 1)}\})

(5)

where we integrate the contextual semantic information sequences and entity sequences and calculate the input embedding for each semantic and entity for the semantic

w_{j}^{(i)}

and the corresponding entity

e_{k}^{(i)}

,

h_{j}

is the hidden internal state of the integrated semantic and entity information, “

σ

” is the nonlinear activation function Gelu, b is the deviation term,

W_{h t}^{(i)}

,

W_{h e}^{(i)}

,

W_{t}^{(i)}

, and

W_{e}^{(i)}

are the corresponding weights, and

b^{(i)}

,

b_{t}^{(i)}

,

b_{e}^{(i)}

are the corresponding bias, as shown in (6)–(8), respectively.

h_{j} = σ (W_{h t}^{(i)} {\tilde{w}}_{j}^{(i)} + W_{h e}^{(i)} {\tilde{e}}_{k}^{(i)} + b^{(i)})

(6)

w_{j}^{(i)} = σ (W_{t}^{(i)} h_{j} + b_{t}^{(i)})

(7)

e_{k}^{(i)} = σ (W_{e}^{(i)} h_{j} + b_{e}^{(i)})

(8)

For the semantics of

w_{j}^{i}

without corresponding entities, the word embedding is output directly without semantic and entity information integration, as shown in (9) and (10).

h_{j} = σ (W_{h t}^{(i)} {\tilde{w}}_{j}^{i} + b^{(i)})

(9)

w_{j}^{(i)} = σ (W_{t}^{(i)} h_{j} + b_{t}^{(i)})

(10)

The semantic vector

\{w_{1}^{(i)}, \dots, w_{n}^{(i)}\}

of the last layer is represented as

R_{E}

and the corresponding entity vector

\{e_{1}^{(i)}, \dots, e_{n}^{(i)}\}

as

R_{K}

is output to the classifier.

Therefore, the text encoder works as follows First, the pre-processed text sequence t of length n is input into the embedding layer of the ERNIE module, and t is transformed into the vector

\{e_{1}, \dots, e_{n}\}

, then the vector

\{e_{1}, \dots, e_{n}\}

is input into the MH-ATT layer to extract the semantic information

\{w_{1}, \dots, w_{n}\}

, and finally the semantic information

\{w_{1}, \dots, w_{n}\}

is fed into the FFN layer to obtain the final output of the ERNIE module

\{w_{1}^{0}, \dots, w_{n}^{0}\}

and the entity information

\{e_{1}^{0}, \dots, e_{n}^{0}\}

. Then, the semantic information

\{w_{1}^{0}, \dots, w_{n}^{0}\}

and the entity information

\{e_{1}^{0}, \dots, e_{n}^{0}\}

are input into the two MH-ATT layers of the KG module to extract the semantic information to obtain the intermediate semantic information vector

\{{\tilde{w}}_{1}^{(i)}, \dots, {\tilde{w}}_{n}^{(i)}\}

and the intermediate entity information vector

\{{\tilde{e}}_{1}^{(i)}, \dots, {\tilde{e}}_{n}^{(i)}\}

, and finally infuse the two intermediate vectors to receive the final output of the whole model

\{w_{1}^{(i)}, \dots, w_{n}^{(i)}\}

and

\{e_{1}^{(i)}, \dots, e_{n}^{(i)}\}

, i.e.,

R_{E}

and

R_{K}

.

3.2. Classifier

We input the context semantic vector

R_{E}

and the entity vector

R_{K}

obtained by the KG module in Section 3.1.2 into the concat module to concatenate the two vectors into forming a 968-dimensional vector R, as shown in (11). The flow of the whole classifier is shown in Figure 5.

R = R_{E} ⨁ R_{K}

(11)

The classifier compares the words in the obtained vector R. The similarity calculation formula used for the comparison is shown in (12). We compare the word,

t_{i}

, in the input sentence with the different types of the words,

t_{j}

, in the sensitive word category library and take the largest Sim(

t_{i}

,

t_{j}

) to assign to Smw(

t_{i}

) to receive the similarity score, Smw(

t_{i}

), for the word,

t_{i}

, and classify the sentence into the class c in which the

t_{j}

is located, as shown in (13).

\begin{matrix} S i m (X, Y) = \frac{\sum_{i = 1}^{n} (X_{i} \cdot Y_{i})}{\sqrt{\sum_{i = 1}^{n} X_{i}^{2}} \cdot \sqrt{\sum_{i = 1}^{n} Y_{i}^{2}}} \end{matrix}

(12)

\begin{matrix} S m w (t_{i}) = m a x (S i m (t_{i}, t_{j})) \end{matrix}

(13)

Then, the classifier compares the semantic vectors V(

S e n_{i}

) of the input sentences with the semantic vectors V(

S e n_{j}

) of the class c of sensitive sentences in the sensitive sentences library C.

S i m (V (s e n_{i}), V (s e n_{j}))

, the largest

S i m (V (s e n_{i}), V (s e n_{j}))

is assigned to

S m s (s e n_{i} i)

, as shown in (14):

S m s (s e n_{i}) = m a x (S i m (V (s e n_{i}), V (s e n_{j})))

(14)

The final assessment Score of the input text is calculated by attaching weights to the statements of different categories

(α_{1}, \dots, α_{m})

is the weight of each category of sensitive statements, which is set to 1 in this paper and can be adjusted according to the requirements, as shown in (15). If the final assessment Score exceeds the set threshold, the input text is determined to be sensitive text.

\begin{matrix} S c o r e = α_{1} * S m s (C_{1}) + α_{2} * S m s (C_{2}) \\ \dots + α_{m} * S m s (C_{m}) \end{matrix}

(15)

Therefore, the TSIIP algorithm comes from this. The operation mechanism of the algorithm is shown in Algorithm 1. We use an example to illustrate our algorithm. The input text “US-Malaysian navy assembles in South China Sea for joint military exercises” is split into “US-Malaysian navy”, “South China Sea”, “assembly”, and “joint military exercises”. First, its segmentations are compared with the sensitive words in the sensitive vocabulary, T, and the similarity score, smw, is calculated for each word to classify it as a military sentence. Then it is compared with the sensitive sentences in the sensitive sentence library C and the similarity score, sms, is calculated for the whole sentence. If the similarity score, sms, of the sentence exceeds the threshold, S, it is classified as a sensitive military sentence; otherwise, it is a military non-sensitive sensitive sentence.

Algorithm 1 Text Sensitive Information Intelligent Perception (TSIIP) algorithm.

Require: Pending articles Text, Sensitive vocabulary T, Sensitive Threshold S, Sensitive sentences library C
Ensure: Categories of Text and The sensitivity judgment of Text

1:: There are m types of sensitive categories in the Sensitive vocabulary T and Sensitive sentences library C.
2:: Split the pending article text into sentences, split sentences into words, and remove the stop words to get n statements.
3:: for $i = 1 t o n$ do
4:: $S m s (s e n_{i}) = 0$
5:: while $w o r d \neq$ “ ” do
6:: $S m w (t_{j}) = 0$
7:: Compute the similarity score $S m w (t_{j})$ of the input word $t_{j}$ based on Equation (13)
8:: end while
9:: classify the sentence $s e n_{i}$ into the highest score of class c
10:: Compute the similarity score $S m s (s e n_{i})$ of the input sentence $s e n_{i}$ based on Equation (14)
11:: end for
12:: Get all similarity score of sentence Sms in the Text
13:: for $i = 1 t o m$ do
14:: $S m s (C_{i}) = a v g (a l l S m s o f s e n_{i} \in C_{i})$
15:: end for
16:: Classify Text as the category $C_{i}$ of the highest $S m s (C_{i})$ based on Equation (15)
17:: if Score≥ S then
18:: Determine Text has sensitive information and belong to class $C_{i}$
19:: else
20:: Determine Text has not sensitive information and belong to class $C_{i}$
21:: end if

4. Experiment

The proposed KG-ERNIE model and the TSIIP algorithm are described in detail in Section 3. To further demonstrate the advancement and superiority of our proposed model and algorithm, we apply the KG-ERNIE model and the TSIIP algorithm to sensitive information intelligent perception datasets and compare their effectiveness with that of other state-of-the-art methods to verify the effectiveness of the KG-ERNIE model and the TSIIP algorithm in solving this problem. Since, in reality, the damage caused by leakage due to misclassification of sensitive information as non-sensitive information is far greater than the damage caused by over-protection due to misclassification of non-sensitive information as sensitive information, we set a higher penalty for misclassification of sensitive information as non-sensitive information. Therefore, we put more emphasis on the recall rate than the accuracy rate, which may result in a decrease in the accuracy rate, but will improve the recall rate.

4.1. Evaluation Criteria

For binary category classification problems, a true positive (TP) indicates that the model correctly predicts sensitive information. False positive (FP) indicates that the model predicts sensitive information for non-sensitive information. False negative (FN) indicates that the model predicts sensitive information as non-sensitive information. True negative (TN) indicates that the model correctly predicts non-sensitive information. The evaluation criteria used in this paper to measure the model performance are as follows. Since we put more emphasis on the recall, we have introduced the F2 score as an evaluation metric in addition to the F1 score.

Precision: The proportion of correctly classified positive samples to the total number of samples predicted to be positive, as shown in (16)

\begin{matrix} Precision = \frac{TP}{TP + FP} \end{matrix}

(16)

Recall: The proportion of correctly classified positive samples to the total number of all positive samples, as shown in (17)

\begin{matrix} Recall = \frac{TP}{TP + TN} \end{matrix}

(17)

F1 score: Coordinated average of accuracy and recall, as shown in (18)

\begin{matrix} F 1 = \frac{2 * Precision * Recall}{Precisi + Recall} \end{matrix}

(18)

F2 score: The recall rate is weighted higher than the evaluation metric of precision, as shown in (19)

\begin{matrix} F 1 = \frac{5 * Precision * Recall}{4 * Precisi + Recall} \end{matrix}

(19)

4.2. Datasets and Parameters

In order to compare different models and algorithms, we use the THUNews dataset from Tsinghua University as the white sample of the experimental dataset and the JWBD dataset, which was composed of data manually collected from overseas blogs and Wikipedia using a crawler program, as the black sample of the experimental dataset, and we also compose our sensitive library from the collected data. We used the crawler to crawl the data of several fields and filtered out the texts containing sensitive information in different fields as black samples. For example, the military category contains information about combat, the political category contains the words of officials, and the economic category contains the economic income of companies, etc. As indicated in Table 1, we split the dataset into training, validation, and test sets in the proportions of 3:1:1. The parameters of the experiment are listed in Table 2.

4.3. Experiment Result

We compared the KG-ERNIE model with the schemes proposed by other researchers and used the TSIIP algorithm for sensitive information detection. As shown in Table 3, we compared the KGDectetor model proposed by Cong et al. [26] and the CNN-BiLSTM model proposed by Gan et al. [21] and the TextCNN model proposed by Kim et al. [32] and the Bi-LSTM model proposed by Guo et al. [25]. The comparison results prove that the KG-ERNIE+TSIIP scheme outperforms the other methods in terms of recall, F1 score, and F2 score. Among them, the accuracy reached 0.925, the recall reached 0.951, which is 1% better than the CNN-BiLSTM model, the F1 score reached 0.938, which is 0.6% better than the KGDectetor model, and the F2 score reached 0.946, which is 1% better than the CNN-BiLSTM model. The results were as we had previously expected, sacrificing some accuracy but improving recall, with the best experimental results for recall, F1 scores, and F2 scores.

Then, in order to study the influence of word embedding models on the experimental results and to determine whether the word embedding model used has a greater impact on the semantic integrity extraction of the text, we replaced the word embedding ERNIE model in the text encoder with the BERT model [11], the Glove model [7], and the Word2vec model [6] for experimental comparison. The results show that the word embedding ERNIE model used in this paper gets better experimental results in terms of recall and F1 score and F2 score, with the recall reaching 0.951, the F1 score reaching 0.938, and the F2 score reaching 0.946, as shown in Table 4. The results are also in line with our expectations, as the ERNIE model uses a masking approach that includes full word masking as well as entity masking, extracting more complete contextual semantic information and, therefore, outperforming the Bert model in most metrics. The accuracy of the ERNIE model is slightly lower than that of the Bert model, as we sacrificed a small amount of precision to achieve a higher recall, where the accuracy is 0.4% lower than the Bert model, but the recall is 2.3% higher than the Bert model.

To verify the stability of the model, we use accuracy and recall as indicators and judge the stability of the model based on the changes in these indicators. The experimental results are shown in Figure 6. It can be seen that the accuracy and recall of the KG-ERNIE+TSIIP solution grow gradually as the number of training times increases, and the changes in accuracy and recall tend to stabilize after about the 15th epoch. This shows that the solution of KG-ERNIE+TSIIP has strong stability.

To demonstrate the generalization of our method, we used 2044 private data from a company as a test, including 1842 non-sensitive texts and 202 sensitive texts. The results are shown in Table 5, where the precision and recall, as well as F1 score and F2 score are lower than in Table 3. According to these results, we analyze that the sensitive words and sensitive phrases in the training dataset cannot cover all the sensitive words and sensitive sentences in the real data because the construction of the sensitive sentences library and the sensitive vocabulary are not perfect and the scale of the data is insufficient.

5. Conclusions

In this research, we propose an intelligent perception algorithm (TSIIP) and the detection and recognition model KG-ERNIE for the problem of text sensitive information perception. We design an algorithm to identify sensitive words at the word level and sensitive statements at the statement level in order to obtain the evaluation score of the final text for the issue of considering only single-level semantics. We encode the input text using an entity embedding model based on the ERNIE pre-training model and knowledge graph and then use our intelligent perception algorithm to intelligently perceive sensitive information on the input text using a convolutional neural network (CNN) as the underlying architecture. For the dataset, we created a sensitive vocabulary based on THUNews and Chinese Wikipedia, as well as the Chinese sensitive information dataset JWBD. Numerous experiments were performed on this dataset, and the F1 and F2 scores for the KG-ERNIE model and the TSIIP method proposed in this study were 0.938 (0.6% improvement) and 0.946 (1% improvement), respectively. Thus, it can be said that the TSIIP algorithm and the KG-ERNIE model surpass other current methods. Specific application scenarios may include: applying the method to the review of documents transmitted by enterprises in the network to prevent the leakage of sensitive information of enterprises or individuals by perceiving whether the transmitted text contains sensitive information; it can also be applied to the review of detecting whether sensitive information is contained in electronic documents of countries or governments to improve the efficiency and quality of the review of electronic text documents.

Although this solution has achieved good results, there is still room for improvement. In future work, we can continue to expand and improve the sensitive vocabulary and sensitive sentences library and further investigate this part of the classifier to improve both the recognition precision and the recall rate of sensitive information. Furthermore, we intend to further develop our current binary classification system into a quantitative assessment of the text sensitivity level, with sensitive information text being classified into three levels: Top Secret Level, Confidential Level, and Secret Level. We will further investigate this part of the classifier to improve both the recognition accuracy and the recall rate of sensitive information.

Author Contributions

Conceptualization, L.H. and J.J.; methodology, L.H. and J.J.; software, J.J.; validation, L.H. and J.J.; formal analysis, L.H. and J.J.; investigation, L.H. and J.J.; resources, L.H. and J.J.; data curation, L.H. and J.J.; writing—original draft preparation, J.J.; writing—review and editing, L.H. and J.J.; visualization, L.H. and J.J.; supervision, L.H.; project administration, L.H. and J.J.; funding acquisition, L.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (Grant 61962005) for financial supports.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We thank all of the authors of the primary studies included in this article.

Conflicts of Interest

The authors declare no conflict of interest.

Sample Availability

Samples of the compounds are available from the authors.

Abbreviations

The following abbreviations are used in this manuscript:

Bert	Bidirectional Transformers
ERNIE	Enhanced Representation through Knowledge Integration
KG	Knowledge Graph

References

Seger, C. An Investigation of Categorical Variable Encoding Techniques in Machine Learning: Binary versus One-Hot And Feature Hashing. Independent Thesis Basic Level (Degree of Bachelor), Royal Institute of Technology, Stockholm, Sweden, 2018; p. 34. [Google Scholar]
Salton, G.; Wong, A.; Yang, C.S. A vector space model for automatic indexing. Commun. ACM 1975, 18, 613–620. [Google Scholar] [CrossRef] [Green Version]
Cavnar, W.B.; Trenkle, J.M. N-gram-based text categorization. In Proceedings of the SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, USA, 26–28 April 1994; p. 161175. [Google Scholar]
Aizawa, A. An information-theoretic perspective of tf–idf measures. Inf. Process. Manag. 2003, 39, 45–65. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Bengio, Y.; Schwenk, H.; Senécal, J.; Morin, F.; Gauvain, J. Neural Probabilistic Language Models. In Innovations in Machine Learning: Theory and Applications; Holmes, D.E., Jain, L.C., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 137–186. [Google Scholar]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; Volume 2, pp. 3111–3119. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 26–28 October 2014; pp. 1532–1543. [Google Scholar]
Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent neural network regularization. arXiv 2014, arXiv:1409.2329. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Chen, X.; Zhang, H.; Tian, X.; Zhu, D.; Tian, H.; Wu, H. Ernie: Enhanced representation through knowledge integration. arXiv 2019, arXiv:1904.09223. [Google Scholar]
Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Tian, H.; Wu, H.; Wang, H. Ernie 2.0: A continual pre-training framework for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8968–8975. [Google Scholar]
Berardi, G.; Esuli, A.; Macdonald, C.; Ounis, I.; Sebastiani, F. Semi-automated text classification for sensitivity identification. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, 19–23 October 2015; pp. 1711–1714. [Google Scholar]
Neerbeky, J.; Assentz, I.; Dolog, P. TABOO: Detecting unstructured sensitive information using recursive neural networks. In Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering (ICDE), San Diego, CA, USA, 19–22 April 2017; pp. 1399–1400. [Google Scholar]
Neerbek, J.; Assent, I.; Dolog, P. Detecting complex sensitive information via phrase structure in recursive neural networks. In Proceedings of the Pacific-Asia Conference on Knowledge Discover and Data Mining, Melbourne, VIC, Australia, 3–6 June 2018; pp. 373–385. [Google Scholar]
Xu, G.; Yu, Z.; Qi, Q. Efficient sensitive information classificatio and topic tracking based on tibetan Web pages. IEEE Access 2018, 6, 55643–55652. [Google Scholar] [CrossRef]
Xu, G.; Qi, C.; Yu, H.; Xu, S.; Zhao, C.; Yuan, J. Detecting sensitive information of unstructured text using convolutional neural network. In Proceedings of the 2019 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), Guilin, China, 17–19 October 2019; pp. 474–479. [Google Scholar]
Lin, Y.; Xu, G.; Xu, G.; Chen, Y.; Sun, D. Sensitive information detection based on convolution neural network and bi-directional LSTM. In Proceedings of the 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Guangzhou, China, 10–13 November 2020; pp. 1614–1621. [Google Scholar]
Hong, Q.; Zheng, T.; Wenli, L.; Jianwei, T.; Hongyu, Z. A sensitive information detection method based on network traffic restore. In Proceedings of the 2020 12th International Conference on Measuring Technolog and Mechatronics Automation (ICMTMA), Phuket, Thailand, 28–29 February 2020; pp. 832–836. [Google Scholar]
Gan, C.; Feng, Q.; Zhang, Z. Scalable multi-channel dilated CNN–BiLSTM model with attention mechanism for Chinese textual sentiment analysis. Future Gener. Comput. Syst. 2021, 118, 297–309. [Google Scholar] [CrossRef]
Wang, Y.; Shen, X.; Yang, Y. The classification of Chinese sensitive information based on BERT-CNN. In Proceedings of the International Symposium on Intelligence Computation and Applications, Guangzhou, China, 16–17 November 2019; pp. 269–280. [Google Scholar]
Fu, Y.; Yu, Y.; Wu, X. A sensitive word detection method based on variants recognition. In Proceedings of the 2019 International Conference on Machine Learning, Big Date and Business Intelligence (MLBDBI), Taiyuan, China, 8–10 November 2019; pp. 47–52. [Google Scholar]
García-Pablos, A.; Perez, N.; Cuadros, M. Sensitive data detectio.; classification in Spanish clinical text: Experiments with BERT. arXiv 2020, arXiv:2003.03106. [Google Scholar]
Guo, Y.; Liu, J.; Tang, W.; Huang, C. Exsense: Extract sensitive information from unstructured data. Comput. Secur. 2021, 102, 102156. [Google Scholar] [CrossRef]
Cong, K.; Li, T.; Li, B.; Gao, Z.; Xu, Y.; Gao, F.; Peng, H. KGDetector: Detecting Chinese Sensitive Information via Knowledge Graph-Enhanced BERT. Secur. Commun. Netw. 2022, 2022, 4656837. [Google Scholar] [CrossRef]
de Gibert Bonet, O.; García-Pablos, A.; Cuadros, M.; Melero, M. Spanish datasets for sensitive entity detection in the legal domain. In Proceedings of the Thirteenth International Conference on Language Resource and Evaluation (LREC’22), Marseille, France, 20–25 June 2022. [Google Scholar]
Campanile, L.; de Biase, M.S.; Marrone, S.; Marulli, F.; Raimondo, M.; Verde, L. Sensitive Information Detection Adopting Named Entity Recognition: A Proposed Methodology. In Proceedings of the International Conference on Computational Scienc and Its Applications, Malaga, Spain, 4–7 July 2022; pp. 377–388. [Google Scholar]
Huang, C.; Zhao, Q. Sensitive information detection method based on attention mechanism-based ELMo. J. Comput. Appl. 2022, 42, 2009. [Google Scholar]
Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Volume 1, pp. 2227–2237. [Google Scholar]
Zhang, Z.; Han, X.; Liu, Z.; Jiang, X.; Sun, M.; Liu, Q. ERNIE: Enhanced language representation with informative entities. arXiv 2019, arXiv:1905.07129. [Google Scholar]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the EMNLP 2014, Doha, Qatar, 25–29 October 2014. [Google Scholar]

Figure 1. KG-ERNIE model.

Figure 2. Comparison of Bert model masking method and ERNIE model masking method. (a) BERT model masking method; (b) ERNIE model masking method.

Figure 3. The operational flow of the ERNIE module.

Figure 4. The operational flow of the KG module.

Figure 5. The flow of the whole classifier.

Figure 6. Variation of precision and recall.

Table 1. Experimental dataset.

Datasets	Training Sets	Validation Sets	Test Sets	Total
THUNews	37,800	12,600	12,600	63,000
JWBD	2998	998	998	4994

Table 2. Parameters of the experiment.

Parameter	Value
Optimizer	Adam
Learning rate	4 × 10⁻⁴
Epoch	30
Batch size	32
Active function	Relu
Dropout	0.5

Table 3. Comparisonof experimental results. The bold numbers are the best results.

Method	Precision	Recall	F1 Score	F2 Score
KG-ERNIE+TSIIP	0.925	0.951	0.938	0.946
KGDetector	0.931	0.934	0.932	0.933
CNN-BiLSTM	0.922	0.941	0.931	0.936
TextCNN	0.866	0.914	0.889	0.904
BERT-BiLSTM-Attention	0.865	0.905	0.885	0.897

Table 4. Comparison of experimental results of different word embedding models. The bold numbers are the best results.

Method	Precision	Recall	F1 Score	F2 Score
KG-ERNIE+TSIIP	0.925	0.951	0.938	0.946
Bert	0.929	0.928	0.928	0.928
Glove	0.801	0.812	0.806	0.812
Word2vec	0.834	0.827	0.83	0.828

Table 5. Experimental results from the company’s private data. The bold numbers are the best results.

	Precision	Recall	F1 Score	F2 Score
KG-ERNIE+TSIIP	0.851	0.867	0.859	0.864
KGDetector	0.701	0.681	0.691	0.685
CNN-BiLSTM	0.684	0.703	0.693	0.699
TextCNN	0.653	0.681	0.667	0.675
BERT-BiLSTM-Attention	0.710	0.702	0.706	0.704

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huo, L.; Jiang, J. Research on Intelligent Perception Algorithm for Sensitive Information. Appl. Sci. 2023, 13, 3383. https://doi.org/10.3390/app13063383

AMA Style

Huo L, Jiang J. Research on Intelligent Perception Algorithm for Sensitive Information. Applied Sciences. 2023; 13(6):3383. https://doi.org/10.3390/app13063383

Chicago/Turabian Style

Huo, Lin, and Juncong Jiang. 2023. "Research on Intelligent Perception Algorithm for Sensitive Information" Applied Sciences 13, no. 6: 3383. https://doi.org/10.3390/app13063383

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Intelligent Perception Algorithm for Sensitive Information

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Text Encoder

3.1.1. ERNIE-Based Encoder Module

3.1.2. Knowledge Graph-Based Entity Embedding Module

3.2. Classifier

4. Experiment

4.1. Evaluation Criteria

4.2. Datasets and Parameters

4.3. Experiment Result

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Sample Availability

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI