3.2. Privacy-Sensitive Data Identification Based on Regular Expressions
Determine the fixed structure of sensitive data,
, for each specific user
. For each rule,
is a privacy rule with a specific structure associated with the user.
where
represents the target-sensitive rule, and each
is represented as a logical expression of an instance attribute.
mainly represents the length of the rule.
For structured power business sensitive data interacting between the State Grid business platform and third-party platforms, regular expressions can be employed for matching. Structured data typically consist of ID numbers, phone numbers, email addresses, etc., and are composed of alphanumeric characters. Such data types can be directly matched using regular expressions.
For structured data, use regular expressions for matching. Structured data include ID numbers, phone numbers, email addresses, IP addresses, etc.
Use the regular expression/^[a-z]([a-z0-9]*[-_]?[a-z0-9]+)*@([a-z0-9]*[-_]?[a-z0-9]+)+[\.][a-z]{2,3}([\.][a-z]{2})?$/i can detect email addresses.
[a-z] means the first character of the username must be a lowercase letter from a to z. ([a-z0-9]*[-_]?[a-z0-9]+)* is used to represent the username part. [a-z0-9]* represents a character class that includes all lowercase letters (a to z) and digits (0 to 9). The asterisk (*) means that this character class can appear zero or more times. [-_]? indicates that there can be zero or one hyphen (-) or underscore (_).
[a-z0-9]+ indicates that there must be one or more lowercase letters or numbers. The plus sign [...]+ means that this character class must appear at least once. The combination part (...)* indicates that the above pattern can be repeated zero or more times.
([a-z0-9]*[-_]?[a-z0-9]+)+ is used to represent the domain name part. The rules for this part are similar to the rules for the username. (...)+ indicates that the above pattern must appear at least once.
[\.][a-z]{2,3} indicates that the top-level domain part, such as ‘.com’. [\.] represents the dot (.). [a-z]{2,3} means that the preceding character class [a-z] should match at least two times and at most three times.
([\.][a-z]{2})? indicates an optional second-level top-level domain, such as ‘.co.uk’. The specific meaning of the characters is similar to the above.
For example, for ‘john.doe@gmail.com’, ‘john.doe’ matches the username part [a-z]([a-z0-9]*[-_]?[a-z0-9]+)*, ‘gmail’ matches the domain part ([a-z0-9]*[-_]?[a-z0-9]+)+, and ‘.com’ matches the top-level domain part [\.][a-z]{2,3}.
Use the regular expression ^[1-9]\d{5}[1-9]\d{3}((0\d)|(1[0-2]))(([0|1|2]\d)|3[0-1])\d{4}[0-9Xx] to detect an ID number.
[1-9]\d{5} matches the area code of the ID number, and [1-9] matches any single digit from 1 to 9. It does not match 0. ‘\d’ matches any digit from 0 to 9; {5} is a quantifier that specifies exactly five occurrences of the preceding element; and [1-9]\d{3} matches the year of birth in the ID number.
((0\d)|(1[0-2])) ensures the validity of the month. ‘(0\d)’ matches ’01’ to ‘09’ and (1[0-2]) matches ‘10’ to ‘12’; (([0|1|2]\d)|3[0-1]) indicates the date, matching ‘01’ to ‘31’. \d{4} matches the sequence code in the ID number. \d{4} specifies exactly four occurrences of the preceding element, which in this case is a digit (\d). [0-9Xx] indicates that the last digit can be a number 0–9 or a letter X/x (check digit).
Use the regular expression ((d{11})|^((d{7,8})|(d{4}|d{3})-(d{7,8})|(d{4}|d{3})-(d{7,8})-(d{4}|d{3}|d{2}|d{1})|(d{7,8})-(d{4 }|d{3}|d{2}|d{1}))$) to detect a phone number.
(\d{11}) indicates 11 consecutive digits, usually a normal mobile phone number. The vertical bar ‘|’ is an alternation operator that acts like a logical OR. (\d{7,8}) matches a local number with seven to eight digits.
(\d{4}|\d{3})-(\d{7,8}) matches a phone number with an area code. (\d{4}|\d{3}) indicates the area code, which can be three or four digits.
(\d{4}|\d{3})-(\d{7,8})-(\d{4}|\d{3}|\d{2}|\d{1}) matches a phone number with an area code and an extension. (\d{4}|\d{3}|\d{2}|\d{1}) indicates the extension, which can be one to four digits.
(\d{7,8})-(\d{4}|\d{3}|\d{2}|\d{1}) matches a local number with an extension. For example, ‘18812341354’ matches an 11-digit mobile phone number, and ‘010-12345678’ matches a local number with a three-digit area code.
An IPv4 address usually consists of four groups of numbers, each ranging from 0 to 255. Use the regular expression ((?:[0,1]?\d{1,2}|2(?:[0-4][0-9]|5[0-5]))(? :\ .(?:[0,1]?\d{1,2}|2(?:[0-4][0-9] |5[0-6]))) +{3})+ to detect an IP address.
(?:[0-1]?\d{1,2}|2(?:[0-4]\d|5[0-5])) is used to match values from 0 to 255. (?: ... ) represents non-capturing groups that are used to group parts of the pattern without capturing the matched text for back-referencing, and [0-1]? matches zero or one occurrence of the digits 0 or 1. \d{1,2} matches one or two digits (0-9). [0-1]?\d{1,2} matches numbers from 0 to 199, and 2(?:[0-4][0-9]|5[0-5]) covers numbers from 200 to 255.
‘\.’ matches the period. (?:[0-1]?\d{1,2}|2(?:[0-4]\d|5[0-5])) repeats the logic of the first octet and matches the value from 0 to 255, while {3} means repeating the previous pattern three times, i.e., the first three values. For example, the four groups of numbers in ‘192.168.0.1’ are all in the range of 0 to 255.
These data adhere to fixed format requirements, making detection and identification relatively straightforward. However, when dealing with a large volume of sensitive data requiring prevention from leakage, more advanced semantic analysis and machine learning technologies become essential for accurate identification.
3.3. Privacy-Sensitive Data Identification Based on DeBERTa-BiLSTM-CRF
3.3.1. DeBERTa Layer
Unstructured sensitive data usually have a flexible format with no fixed format requirements. These data may be in documents or images. The identification of these data requires semantic understanding and context analysis, and based on the context, it is determined whether the data contain sensitive information that needs to be prevented from being leaked. The electric power business has many types of data structures and a large amount of data. There are correlations between different data types. However, the huge amount of data is difficult to monitor manually.
For detecting leak-proof private data within these datasets, the DeBERTa-BiLSTM-CRF model is utilized. The model architecture is shown in
Figure 2.
A sequence of Transformer blocks with a self-attention mechanism layered on top of each other make up the BERT architecture [
44]. The formula for calculating the attention score is expressed as follows:
Among them, are all word-embedding representations. represents the weight matrix of the multi-head attention mechanism. Each transformer block takes word embeddings, which are constructed through the encoding of word vectors, as the input.
During the utilization of BERT, two main steps are involved: pre-training and fine-tuning. In pre-training, the model undergoes unsupervised training across various tasks. Subsequently, in fine-tuning, the model is initialized with pre-trained parameters and further refined for specific tasks under supervision. Leveraging the pre-trained DeBERTa model, training and fine-tuning operations are performed on the power business dataset to align it more closely with the scenario of power business data recognition.
The DeBERTa model uses a decoupled attention mechanism and enhanced mask decoder to improve BERT model. In the original BERT model, each word or character is represented by a vector. In this model, each word is represented by two vectors, one encoding content and one encoding position, respectively, and the attention weights among words are computed using disentangled matrices on theirs content and relative position. When calculating the attention weight, content-related calculations use the content matrix, and position-related calculations use the position matrix, resulting in the decoupling of content and position.
represents the content vector and
represents the relative position vector of position
relative to position
in the sequence. The attention weight of the word pair
can be calculated by using the decoupled matrix of content and position, as shown in Equation (3).
Masked Language Modeling (MLM) is used in the pre-training of the DeBERTa model, akin to the BERT model. In this process, the model learns to predict masked words using surrounding context. However, DeBERTa enhances MLM by incorporating both content and positional information from the context. Although the decoupled attention mechanism in DeBERTa considers content and relative position, it overlooks the absolute position of words, which is often pivotal for accurate prediction. Many grammatical subtleties rely heavily on the absolute position of words within a sentence.
There are two approaches to incorporate absolute positions. The BERT model includes absolute positions in the input layer. On the other hand, in DeBERTa, absolute positions are merged across all Transformer layers, with mask token prediction conducted before the softmax layer, as depicted in
Figure 2. This design enables DeBERTa to capture relative positions across all Transformer layers while utilizing absolute positions as additional information during masked word decoding. This mechanism is referred to as DeBERTa enhanced mask decoder (EMD).
Due to the numerous circulation paths of power business data, the high processing demands for large-scale data, and the diverse range of data types, power business data typically necessitate relatively standardized formatting requirements, with variations across different data types. Consequently, textual sequences require character-level representation. Beyond absolute positions, the model also needs to consider relative positions between characters to accurately capture dependencies between words, encompassing their content and relative positions. Moreover, the DeBERTa model excels at capturing long-distance dependencies between words and outperforms the RoBERTa model on extended sequences, thus addressing the challenges posed by lengthy data sequences in the power business domain and enhancing recognition and matching capabilities.
3.3.2. BiLSTM Layer
Following the BERT model, a Bidirectional Long Short-Term Memory Network (BiLSTM) is introduced. BiLSTM is adept at capturing contextual information within a sequence by simultaneously considering both preceding and succeeding words. It is composed of forward and backward LSTMs, whose outputs are concatenated after processing. BiLSTM is commonly employed in NLP tasks for modeling context.
Effective information is preserved and selected through processes of forgetting and remembering within Long Short-Term Memory (LSTM). At each time step, forgetting, remembering, and output are regulated by forget gates, memory gates, and output gates, respectively. These gates are computed based on the hidden state from the previous time step and the current input.
The principle and calculation processes of a certain unit at a certain moment within the BiLSTM structure are as follows.
Step 1: Calculate the forgetting gate to decide what information to forget or discard from the unit state. Obtain both the current and the concealed state
from the previous instant. The sigmoid function σ uses the preceding time
as input, and outputs a value
between (0,1) to show the degree of forgetting the knowledge in the unit state
(0: totally forgotten; 1: completely accepted). The calculation formula is
In the formula, is the forget gate bias vector.
Step 2: To decide what new information to keep from the cell state, compute the input gate. (1) Take the input of the hidden state
from the previous moment and the current moment
. (2) Use the tanh layer to create a new candidate vector
, which is added to the unit state. (3) Calculate and output a value
between (0,1) to indicate which information in the unit state
needs to be updated. The calculation formula is
where
is the bias vector of the memory unit and
is the bias vector of the update gate.
Step 3: Update the unit state
at the previous moment to the unit state
at the current moment. The calculation formula is
Step 4: Determine what information needs to be output by computing the output gate and the hidden state
at the present time. Select which information from the unit state
needs to be produced by receiving the input of the hidden state
at the previous moment and the present moment
. Then, input the unit state
into the tanh layer for processing, finally perform a product operation with
to output the information we need. The calculation formula is
where
is the output gate bias vector.
The BiLSTM layer can extract sequence context semantic features and input the features to the CRF layer.
3.3.3. CRF Layer
After the BiLSTM layer, a Conditional Random Field (CRF) layer is added [
45]. The CRF layers dependencies between tags across the entire annotation sequence, ensuring the consistency of the generated annotation sequences.
CRF is a probabilistic graphical model used to solve sequence labeling problems. It receives an observation sequence (
,
, …,
) and outputs a state sequence (
,
, …,
). The calculation method is to calculate the corresponding score of the sentence label through the emission score and transition score output by BiLSTM. The calculation formula is
where
is the score of the i-th character predicted to be the
label;
is the score of the
label transferred to the label
.
Consider a sentence with n words, each having m possible tags. This yields possible tag sequences for the sentence. The CRF layer assigns a score to each possible label sequence by learning the adjacent dependencies between label sequences. The sequence with the highest score is identified as the optimal label sequence, determining the category of the named entity. For instance, in the sequence labeling task, the first word in a sequence is typically label as “B-” or “O-”, and cannot be labeled as “I-” according to certain rules. By adhering to these rules, the CRF layer outputs the optimal label sequence.
3.4. Response Handling
If the model identifies certain data as sensitive information within the electric power business, we will desensitize any privacy-sensitive data it contains. Concurrently, we will issue real-time alerts and implement interception measures for data flows containing sensitive information to safeguard data security. The original data will be retained for risk analysis, enabling us to develop a method for preventing leakage in power business data interaction.
The power business data interactive leakage prevention system encompasses abnormal traffic detection, which involves establishing a traffic-monitoring system to conduct the real-time monitoring of data traffic and access patterns. This monitoring includes tracking data sources, destinations, frequencies, and other relevant metrics. Anomaly detection algorithms, such as statistical methods, machine learning, or deep learning techniques, are employed to identify irregular behaviors like large-scale data downloads or abnormal access frequencies.
For post-event response, an alarm system is established to promptly detect and notify the security team or administrator of abnormal events. Additionally, a comprehensive emergency response plan is formulated, outlining processing procedures, responsible personnel, communication channels, etc., to address security incidents promptly. Data are backed up regularly, and a data-recovery mechanism is established to mitigate potential data leakage or damage.
Furthermore, investigations and audits are conducted following security incidents to analyze their causes and impacts. Corresponding measures are then implemented to reinforce security protection based on the findings of these investigations.
The specific process of this electric power business data interaction and leakage prevention method is shown in Algorithm 1.
Algorithm 1 Method for preventing leakage of electric power business data interaction |
Input: electric power business data Output: sensitive information of electric power business data |
1: | Divide traffic data according to fields and perform preprocessing operations on the data |
2: | Determine for each specific user |
3: | Use regular expressions rule to match structured power business sensitive data |
4: | Use the pre-trained DeBERTa model to extract text feature vectors F |
5: | |
6: | Calculate isentangled Attention: |
7: | for do |
8: | |
9: | end for |
10: | for do |
11: | for do |
12: |
|
13: |
end for |
14: | end for |
15: | for do |
16: | |
17: | end for |
18: | for do |
19: | for do |
20: |
|
21: |
end for |
22: | end for |
23: | |
24: | Use BiLSTM model to capture contextual information in input sequences S |
25: | |
26: | |
27: | Carry out alarm and blocking operations for sensitive power business data |