1. Introduction
With the development of technology, countries and enterprises are gradually reducing the number of paper documents in favor of electronic documents. The information contained in these documents is likely to contain sensitive information, so the organizations concerned need to conduct sensitivity checks on these huge volumes of documents. However, manual reviews are too time-consuming and costly, given the sheer volume of electronic papers. As a result, countries and enterprises have started utilizing artificial intelligence to automatically perceive and categorize electronic documents based on whether they contain sensitive information. However, the speed has been increased while the accuracy rate has been less than ideal when using conventional machine classification methods. Therefore, it has become an urgent goal to improve the quality of sensitive information in electronic documents while ensuring the speed of intelligent perception.
There are two types of electronic document classification methods, the first of which is the vector space model. These models can analyze document features by computing vectors, and they can assess the semantic relationships between words by measuring the distance of vectors in space. The earliest one-hot technique [
1] turns the vocabulary into a binary code, transforming the discrete text into Euclidean space, where each word corresponds to a point, and the feature relations between the words can be determined by measuring the distance between points. The one-hot method, however, is unable to capture the semantics of words and the links between words. When the vocabulary is exceedingly broad, the word vector it represents has an excessively large and sparse dimensionality, which makes the calculation difficult. A sentence is represented by a vector whose dimension is equal to the number of words in the lexicon in the bag of words model [
2], which introduces word frequency on the basis of one-hot. The bag-of-words approach partially solves the problem of vector sparsity compared to the one-hot method, but it does not take into account word order as well as relationships between words and is unable to capture contextual information. The N-gram method [
3], which is based on the bag-of-words model, takes all neighboring N words into account as a whole, assuming that each word is related to the n-1 words that came before it or that the n-th word can be inferred from the preceding n-1 words. It is common to use the bi-gram method with two words as a whole and the tri-gram method with three words as a whole. The probability of the text as a whole is the product of the probabilities of the word occurrences. Although this approach includes contextual information and takes word order inside phrases into account, it results in more severe data sparsity when the vocabulary is large. The term frequency-inverse document frequency (TF-IDF) method [
4] determines whether a word is unique to this document by taking into account its frequency in this document and its frequency in other documents in the corpus. This method incorporates lexical features and textual information into the word vector. However, it still does not reflect word position information and contextual information, and the semantic relationships that exist between words cannot be characterized by textual semantic relationships through frequency alone.
The second method is the language model based on neural networks, which has become very popular in recent years. Inspired by the CNN neural network structure [
5], Bengio et al. [
6] used the first n-1 words to predict the probability of occurrence of the n-th word using contextual modeling in order to produce a word vector representation of each word. However, manual extraction results in a restricted ability to represent complicated functions and perform challenging classification problems, as well as an incomplete and inaccurate representation of emotion. Mikolov et al. proposed the Word2vec model [
7], which contains the CBOW and Skip-gram model approaches to obtain word vectors by computing the association of words with their contexts. The CBOW model predicts the middle target word given the surrounding words, whereas the Skip-gram model predicts the surrounding words given the middle words. These two methods’ computing complexity is too great when the vocabulary is broad, though. Pennington et al. proposed the Glove (global vector for word representations) model [
8] to solve the issue that the Word2vec model does not take global aspects into account. This model applies global features while also taking into account local contextual data. Language model pre-training combined with word vector training has grown to be the most popular technique available. Later, Google made improvements to the RNN model’s slow parallel computation issue [
9] and introduced the Transformer model [
10], an encoder and decoder model made up entirely of attention mechanisms that addressed the RNN model’s inability to be calculated in parallel. Devlin et al. [
11] then built the Bert model by extracting the encoder from the Transformer model, which produced the best test scores on 11 natural language processing tasks. In 2019, Baidu enhanced the Bert model’s masking technique and offered the ERNIE model with entity machine masks and phrase-level masks [
12], and its trial results on five Chinese NLP tasks produced the best results. Baidu proposed the ERNIE 2.0 model [
13] in 2020, which is based on the ERNIE model and continuously learns multi-tasks while performing pre-trained models on created tasks to produce the best results on 16 tasks. However, the word embedding of the result shows that the information about significant entities in it has not been properly applied.
To solve these problems, we propose a sensitive information perception model based on ERNIE and knowledge graph embedding (KG-ERNIE), which can use a pre-trained text encoder to encode the text, then the convolutional learning of intermediate vectors and the extraction of semantic information to perform intelligent perception of sensitive information on the text. The TSIIP algorithm, which uses word-level, statement-level, and text-level text detection, is proposed. The following are the significant contributions of the work in this paper:
(1) Based on THUNews and Chinese Wikipedia, a Chinese sensitive information dataset, JWBD, and sensitive vocabulary are built, and the proposed KG-ERNIE model and the TSIIP algorithm are trained and tested using these datasets.
(2) The KG-ERNIE sensitive information detection model is proposed, which employs a knowledge graph-based entity embedding technique and an ERNIE-based pre-training model to encode input text and extract semantic information and features. A convolutional neural network-based model (CNN) is then used to recognize and categorize the encoded information.
(3) The TSIIP algorithm, which perceives sensitive words at the word level and sensitive statements at the statement level to determine the final evaluation score of the text, is created and applied to the proposed KG-ERNIE model to solve the problems with low machine recognition accuracy of text-sensitive information and only taking into account single-level semantics.
The rest of the paper is organized as follows:
Section 2 presents related work.
Section 3 describes, in detail, the KG-ERNIE sensitive information detection model and the TSIIP algorithm proposed in this paper.
Section 4 presents the experiments and analysis related to this work.
Section 5 summarizes our work and provides a future outlook.
2. Related Work
With the proliferation of information on a global scale, intelligent perception of sensitive information has become an increasingly important area of research. Sensitive information keyword-matching algorithms were utilized in 2015 by Berardi et al. [
14] to locate classified documents. Since keyword lexicons are manually generated, subjectivity can affect how accurately things are classified. Recurrent neural networks (RNN) were then employed by Neerbeky et al. [
15] in 2017 to analyze the syntactic and grammatical structure of texts, find sensitive information in text documents, and assess the sensitivity values of the semantic component of the text structure. Neerbeky et al. [
16] then developed a sensitive phrase recurrent neural network with an Acc value of 76.78 % to capture the intricacy of detecting sensitive information. In 2018, Xu et al. [
17] presented their ideas and introduced a new topic-tracking algorithm. It monitors sensitive words over a period of time and assigns different weights to them, taking into account the number of occurrences of sensitive words in the text and the different positions in the text where the sensitive words appear, but the algorithm ignores the semantic information before and after the words as well as the semantic information at the utterance level. Then, Xu et al. [
18] presented a novel approach for textual CNN-based sensitive information identification. More quickly and accurately than previous models and techniques, the training time of the detection model was reduced while maintaining detection accuracy. A framework to more comprehensively extract data features in 2020 was put forth by Lin et al. [
19] in order to improve detection outcomes. The Bi-LSTM and CNN structures are the foundation of the system. Convolutional neural networks are utilized to effectively extract local features from unstructured documents, and Bi-LSTM networks are constructed to extract global features. In the same year, Hong et al. designed a brand-new framework to identify sensitive data using a network traffic recovery method [
20]. A scalable convolutional neural network (CNN) and a bidirectional long short-term memory model (Bi-LSTM) with a multi-channel structure were proposed in 2021 by Gan et al. [
21] to detect sensitive information in Chinese texts, analyze the sentiment tendency of Chinese texts, and introduce attention mechanisms to improve the model’s performance.
Due to the widespread use of the Transformer model [
10] and the Bert model [
11] in 2019, a number of academics have presented numerous solutions based on these two models. A sensitive information classification model based on BERT-CNN was presented by Wang et al. [
22], which enhances the word embedding’s generalization capability and effectively classifies network sensitive information in short text datasets. Using a variation recognition and similarity calculation method that includes variations in synonyms, pronouns, acronyms, and word forms, followed by a combination of rules, Fu et al. [
23] proposed sensitive word detection method uses variation recognition and association analysis to identify and judge sensitive words. However, the running cost is too high, and it is more difficult to apply to constantly changing sensitive words. Pablos et al. [
24] used a sequence tagging algorithm based on the BERT model to find and delete sensitive data from Spanish clinical literature in 2020 to protect the privacy of personal data. A method to extract sensitive information from unstructured data using a hybrid content-based and context-based extract mechanism was proposed by Guo et al. [
25] in 2021. In 2022, Cong et al. [
26] combined the BERT framework and knowledge graph to form the KGDetector framework for detecting Chinese sensitive information, and the CNN+FC classifier proposed in the paper was used in their construction of a Chinese sensitive information dataset that was experimented on, and the F1 value was 93.7%. In the same year, Gibert et al. [
27] successfully generated four datasets for Spanish-named entity detection in the legal sector, utilizing the MAPA project framework and the Transformer model for the de-identification of sensitive data in real-world scenarios. Lelio et al. [
28] proposed a method for the automatic anonymization of personal data that can extract linguistic features from the text while masking the sensitive information found. The dataset utilized for the tests in the paper is made up of court documents from the Italian Supreme Court. In order to improve the model’s ability to recognize sensitive features and to increase the rate at which sensitive information is detected, Huang et al. [
29] proposed an approach to sensitive information detection approach based on language model word embedding [
30] and attention mechanisms (A-ELMo). However, the ability of the ELMo model to extract features is different from that of the Transformer model [
10].
Therefore, algorithm [
17] based on the review of sensitive information based only on the frequency and location of occurrence of sensitive words was improved, and we proposed the TSIIP algorithm, which perceives sensitive words at the word level and the semantics of sensitive sentences at the sentence level, and finally obtains the overall evaluation score of the input text.
5. Conclusions
In this research, we propose an intelligent perception algorithm (TSIIP) and the detection and recognition model KG-ERNIE for the problem of text sensitive information perception. We design an algorithm to identify sensitive words at the word level and sensitive statements at the statement level in order to obtain the evaluation score of the final text for the issue of considering only single-level semantics. We encode the input text using an entity embedding model based on the ERNIE pre-training model and knowledge graph and then use our intelligent perception algorithm to intelligently perceive sensitive information on the input text using a convolutional neural network (CNN) as the underlying architecture. For the dataset, we created a sensitive vocabulary based on THUNews and Chinese Wikipedia, as well as the Chinese sensitive information dataset JWBD. Numerous experiments were performed on this dataset, and the F1 and F2 scores for the KG-ERNIE model and the TSIIP method proposed in this study were 0.938 (0.6% improvement) and 0.946 (1% improvement), respectively. Thus, it can be said that the TSIIP algorithm and the KG-ERNIE model surpass other current methods. Specific application scenarios may include: applying the method to the review of documents transmitted by enterprises in the network to prevent the leakage of sensitive information of enterprises or individuals by perceiving whether the transmitted text contains sensitive information; it can also be applied to the review of detecting whether sensitive information is contained in electronic documents of countries or governments to improve the efficiency and quality of the review of electronic text documents.
Although this solution has achieved good results, there is still room for improvement. In future work, we can continue to expand and improve the sensitive vocabulary and sensitive sentences library and further investigate this part of the classifier to improve both the recognition precision and the recall rate of sensitive information. Furthermore, we intend to further develop our current binary classification system into a quantitative assessment of the text sensitivity level, with sensitive information text being classified into three levels: Top Secret Level, Confidential Level, and Secret Level. We will further investigate this part of the classifier to improve both the recognition accuracy and the recall rate of sensitive information.