Expert System for Extracting Hidden Information from Electronic Documents during Outgoing Control

Tan, Lingling; Yi, Junkai

doi:10.3390/electronics13152924

Open AccessArticle

Expert System for Extracting Hidden Information from Electronic Documents during Outgoing Control

by

Lingling Tan

^1,*

and

Junkai Yi

²

¹

Institute of Automation, Beijing Information Science and Technology University, Beijing 100192, China

²

Institute of Information Management, Beijing Information Science and Technology University, Beijing 100192, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 2924; https://doi.org/10.3390/electronics13152924

Submission received: 12 June 2024 / Revised: 21 July 2024 / Accepted: 23 July 2024 / Published: 24 July 2024

(This article belongs to the Special Issue Knowledge Engineering and Data Mining Volume II)

Download

Browse Figures

Versions Notes

Abstract

:

For confidential and sensitive electronic documents within enterprises and organizations, failure to conduct proper checks before sending can easily lead to incidents such as security degradation. Sensitive information transmission has become one of the main ways of internal data leakage. However, existing methods or systems cannot extract hidden data and do not support mining the binary structure of hidden information in files. In this paper, an expert system for mining electronic document hidden information is designed for various office documents, compressed files, and image files. This system can quickly mine various forms of concealment in more than 40 common types of electronic documents and extract hidden information, such as file type tampering, encryption concealment, structure concealment, redundant data concealment, etc. Additionally, feature information in the binary structure of the document is extracted to form a feature information base. Subsequently, an expert knowledge base is constructed. Finally, a hidden information mining engine is designed using the knowledge base to realize the security control of corresponding outgoing files with good expansibility and integration. By controlling the exit of documents through scanning for sensitive information contained within them, the security level contents can be obtained effectively, avoiding data leakage by technical means while also facilitating forensics. The actual test result proves that this system can quickly mine various means used to conceal information, extract their respective information, and provide a fast, practical diagnostic way for outgoing control over these electronic documents.

Keywords:

expert system; knowledge base; AC automaton; mining hidden information; document outgoing control

1. Introduction

The issue of data security protection in the era of big data has become increasingly prominent. In the field of information security, common means of data theft often involve using “attacks” to breach the protection barrier from outside, such as password cracking and searching for system vulnerabilities [1], thus gaining direct access to sensitive data. Some enterprises with a strong awareness of data protection have incorporated fine-grained data classification into their security measures. However, many enterprises still face various problems, including simple data protection strategies, incomplete protective measures, low levels of protection, and potential data leakage [2]. Most instances of enterprise information security breaches occur within the system itself [3]. Documents without obvious confidential markings can easily bypass regulatory measures and circulate freely both inside and outside the enterprise, ultimately leading to unauthorized disclosure [4]. Implementing control over all documents at the terminal is an effective way to prevent this type of leakage. Data leakage prevention measures include physical isolation techniques such as isolating networks or setting up network gates [5]; prohibiting private storage devices from entering or leaving; and implementing encryption for sensitive data storage, access, and operations [6]. Another important measure for preventing data leakage is channel control through sensitive content identification technology. This involves identifying documents containing sensitive information and establishing appropriate control measures to close potential loopholes [7]. A comprehensive solution for preventing data leakage based on sensitive content identification includes scanning all outgoing documents sent by terminals to check for sensitive information and then taking relevant preset actions such as warnings or blocking if necessary [8].

Steganography [9] is a technique used to conceal secret information within normal files. It is often employed for covertly transmitting information, which poses significant challenges to information security and stands as one of the primary threats to data leakage protection.

While text can be utilized as a medium for steganography, steganography for text primarily involves altering the content of the text. This can be achieved through methods such as line or word movement [10], synonym substitution [11], word abbreviation substitution [12], and word spelling substitution [13]. Changhao Ding et al. proposed a joint language steganography method that combined conditional generative steganography and substitution-based steganography using a pre-trained BERT model to embed secret information [14]. Yu, L et al. proposed a steganography method based on multiple time steps, which inherited the decoding advantages of fixed-length encoding and improved text quality by integrating multiple time steps to select words that can carry secret information and conform to statistical distribution [15]. In addition to text, data in other formats can also be the target of steganography. However, the majority of these techniques are based on multimedia and text-based steganography with corresponding extensions. Liu, G et al. designed a steganographic cost function based on the statistical distribution of JPEG images in the spatial domain and proposed a high-security steganography for JPEG images using a “microscope” that can enhance the texture region of JPEG images in the spatial domain [16]. The embedded steganography scheme involves the concealment of secret information within a carrier by altering the carrier itself. However, this modification inevitably alters the statistical characteristics of the carrier, making it challenging to evade detection by various steganography tools. In contrast, the generative steganography scheme operates by directly generating cryptographic vectors based on secret information. This approach offers superior anti-detection performance compared to the embedded steganography scheme. Sultan, B. et al. used generative adversarial networks (GANs) in combination with color models to hide secret data, evaluating the quality of steganographic images by three metrics: capacity, security, and concealment [17].

In recent years, research on steganography has rapidly advanced. Concurrently, the development of steganalysis and the method for cracking steganography technology have also experienced rapid growth. Both areas have become focal points in information security research. Currently, most technologies for analyzing hidden information in electronic documents detect secret data by identifying statistical characteristics within the file or utilize artificial intelligence algorithms such as feature extraction [18] and classification [19]. The steganographic detection methods for text files are mainly based on formatted text information [20] and natural language information analysis [21]. Xue, Y. et al. proposed an innovative text steganalysis method based on hierarchical supervised learning and a dual attention mechanism. They introduced a dual attention mechanism that dynamically fuses the semantic and statistical features, thereby creating discriminative feature representations essential for text steganalysis [22].

The binary format of various types of electronic documents varies greatly, and conveying sensitive information through data hiding in electronic documents has become one of the main means of internal data leakage. The existing methods or systems are generally based on probabilistic judgment, which cannot extract hidden data. In order to enhance the ability to prevent illegal document leakage, this paper is dedicated to constructing a preliminary and comprehensive expert system for document outgoing control. The remainder of this paper is structured as follows: Section 2 conducts the design of the expert system for mining hidden information in electronic documents, which includes a feature information base, rule base, system mining engine, permission tree, and human–computer interface. In Section 3, the knowledge base is designed for the expert system, and features are extracted to form the feature information base. File permissions are determined according to rights management rules to provide specified users with the necessary permissions for document use, fully protecting document security. In Section 4, the sensitive word matching algorithm based on the AC automaton is used for multi-mode matching and hierarchical management of multiple risk levels of keywords to achieve files’ level classification. The hidden information mining engine is designed using knowledge from the knowledge base in Section 5. In Section 6, a test dataset for electronic document steganography check is created by steganography injection to verify the effectiveness of the expert system designed in this paper. The test results confirm the advantages of the expert system in electronic document steganography check and document outgoing permission management control. Finally, in Section 7, we discuss conclusions from our experiment and make assumptions about future work. Thus, an expert system for mining electronic document hidden information is designed for various office documents, compressed files, and image files. This system can quickly mine various forms of concealment in more than 40 common types of electronic documents and extract hidden information, such as file type tampering, encryption concealment, structure concealment, redundant data concealment, etc. The benefit of this work is to enable automatic batch inspection of electronic documents containing sensitive information while implementing protection measures, but not to constitute a form of information leakage.

2. Structure Design of the Expert System

The text content check is designed to safeguard the security of important documents. Utilizing advanced content recognition technology, it analyzes outgoing documents for sensitive and confidential information, providing a precise assessment of disclosure risk levels to accurately identify violations within a large volume of outgoing documents. The expert system for mining hidden information in electronic documents outlined in this paper consists of a feature information base, rule base, system mining engine, permission tree, and human–computer interface. In addition to evaluating the presence of sensitive words in outgoing files, the system also supports structural checks on outgoing documents to prevent the use of binary structures for embedding other binary content into ordinary files as a means to conceal secret files and evade inspection methods.

Therefore, the electronic document hidden information check is divided into two parts: security analysis for file structure and sensitivity classification for text content. The security analysis of file structure involves targeted passive attack steganography technology [23]. Its purpose is to identify hidden channels that may be used for information hiding by analyzing the structure of various commonly used document files and then designing the corresponding analysis method. Due to the complicated and diverse nature of file formats, a deep analysis of the binary format of electronic documents is conducted, along with an examination of the format of various types of files. Subsequently, binary structure features [24] are extracted based on format analysis, followed by a detailed analysis of various data hiding methods and the construction of a knowledge base [25] for electronic document hidden information. Finally, this knowledge base is utilized to design an electronic document hidden information mining engine that displays its results based on a display template. The system structure diagram can be seen in Figure 1.

There are four steps in the use of the expert system for mining electronic document hidden information facing document outgoing control.

Step 1: Electronic document preprocessing.

Binary format analysis technology for electronic documents is utilized to analyze various common file types, including compressed files, image files, and office documents. This includes the examination of the content and structure of the file header, the analysis of the structure of the file data block, and the characteristics of the file tail. By parsing the binary format of electronic documents, potential locations for data hiding are also analyzed.

Step 2: Construction of the electronic document hidden information knowledge base.

After conducting a detailed analysis of the binary format of common electronic document types, the feature information base is established based on the binary structure features of various file types. This includes the examination of file headers, data blocks, file tails, and file organization modes. Using this rule base, the feature information of the document is verified, and the location of any hidden data is analyzed.

Step 3: Design of the electronic document hidden information inference engine.

Based on the knowledge base of hidden information in electronic documents, an inference engine for electronic document hidden information is developed. This engine utilizes an information mining rule-based matching inference algorithm to quickly identify and extract hidden information. Additionally, a fast keyword search algorithm is designed to efficiently retrieve a large number of keywords within high-capacity documents and promptly display the results of mining hidden information. Furthermore, a rapid character replacement algorithm for keywords is implemented to decrypt sensitive information within documents. The use of XML format for result representation facilitates the display of results for secondary development in the mining system, providing excellent expandability.

Step 4: Document permission management and outgoing control

Users have the capability to transmit documents from a terminal computer to other terminals or external devices, known as outgoing operations. These operations include email, printing, file transfer, and web disk sharing. This section provides details on how to establish the policy for intercepting outgoing file transmissions and sending them to expert systems for inspection. The expert system assesses the file type based on its binary structure, enforces the relevant inspection policy according to document permission management rules, and triggers alarms when intercepting outgoing files.

3. Knowledge Base Design of the Expert System

For a document file, it involves traversing all feature information bases supported by the file type. It utilizes specific feature extraction methods to compile feature data information and employs specific feature detection methods to verify whether the features adhere to the specified rules. Ultimately, based on the classification of the file level and permission management rules, it determines the appropriate file permissions. This not only grants necessary permissions to specified users for accessing the document but also ensures comprehensive security measures for document usage.

The knowledge base comprises various file checking classes, including file format filtering, file size verification, encryption validation, file integrity inspection, keyword examination, merged file assessment, compressed file scrutiny, NTFS (New Technology File System) data stream analysis [26], and Office document evaluation. Among these classes, the keyword check knowledge base supports functions such as keyword input, keyword library management, keyword validation, and keyword cleansing. Additionally, it facilitates the conversion between ASCII characters and Unicode characters [27]. The following diagram in Figure 2 illustrates the structure of the knowledge base along with a description of the fifteen file checking classes within the knowledge base.

Class 1: Scanner class

The Scanner class completes the related scanning functions, supporting both asynchronous and synchronous scanning modes, cleaning functions, and achieving specific check functions through the subclass. In asynchronous scanning mode, it provides functions for starting, pausing, continuing, and stopping the scanning program. Synchronous scanning mode offers functions such as scanning programs and querying scanning status.

Class 2: Permission class

The Permission class offers functions for managing permissions. It determines the security level of a document based on scan results and sets the user’s permissions for various operations on the document according to the current user’s identity.

Class 3: ReportGenerator class

The ReportGenerator class contains the function for generating reports, which can produce a scan result report based on the specified output path, content, and template. It supports HTML and XML output report templates and is capable of sorting the result content.

Class 4: ConfigManager class

The ConfigManager class allows users to read the configuration file, parse its contents one by one, set software functions individually, and modify the configuration file based on user settings.

Class 5: FileIterator class

The FileIterator class implements the functionality of iteratively reading files. When scanning a path or a compressed package containing multiple levels of folders and files, the system needs to read the files iteratively. Once the file iteration reading function is completed, the path or compressed package is parsed, and the files within are found layer by layer. The path and file name of each file are recorded, generating a list of file information for subsequent scanning classes.

Class 6: UnPack/UnRAR/UnZip/TarFile class

The UnPack/UnRAR/UnZip/TarFile belongs to the UnZip class, which is responsible for completing the functions of compression and decompression. When dealing with compressed files in various formats, these classes can be utilized to decompress and retrieve a list of file information within the compressed package. It is possible to decompress a specific file by its serial number and save it to a designated directory for subsequent scanning by other classes.

Class 7: CmdRunner class

CmdRunner is a part of the add-in program run class. It constructs corresponding command content based on the path and interface of the add-in program and invokes and runs the add-in program to achieve various functions. External programs include compression or decompression programs such as 7z; file parsing or format conversion, such as RTF to TXT, PDF to TXT, Office file parsing, and chm file parsing; and OCR programs, such as Tesseract-OCR [28].

Class 8: FileTypeFilter class

FileTypeFilter class implements the file’s format filtering.

Class 9: FileSizeFilter class

FileSizeFilter class implements the file’s size filtering.

Class 10: FileChecker class

The FileChecker class is the base class for all checkers below, providing basic functions such as parameter setting, document scanning, result generation, etc.

Class 11: EncryptedChecker class

The EncryptedChecker class checks whether the file is encrypted.

Class 12: NTFSChecker class

The NTFS data stream is a feature of the NTFS disk format, commonly utilized as a method to conceal data. Because of its hidden nature, there are no anything unusual found for reading, writing, copying, or viewing file properties of files carrying an NTFS data stream. The NTFSChecker class is capable of identifying and extracting the NTFS data stream to a separate file, thereby detecting the hidden flow of data.

Class 13: MergedFileChecker class

By using the Windows command “copy file1 + file2 + ... file3”, multiple files can be combined into one. When the merged file is opened, only the first file is accessible, while the remaining files are disregarded. The MergedFileChecker class is responsible for identifying such merged files and separating any redundant data at the end of the file in order to detect any hidden content within the merged file.

Class 14: OfficeChecker class

The OfficeChecker class parses the format of Office files, extracts data from Office documents, and scans various data in Office documents.

Class 15: KeywordsChecker class

The KeywordsChecker class conducts a rapid scan for sensitive words within the file by examining each byte in the target file for a predetermined sensitive word.

3.1. Feature Information Base

The primary function of the feature information base is to facilitate the integration of different feature information from documents in different formats. This allows for a unified format and attributes for the extraction of these features and rule judgment information. This paper presents the design and implementation of a feature information database for uncovering hidden information within electronic documents. The feature information database encompasses the expression of hidden data features, supported file types, preprocessing of file binary data, extraction of hidden data, methods for saving extraction results, functional settings, detection methods for hidden data, etc. The feature information base, as shown in Table 1, adopts an object-oriented structure and is implemented in the following ways:

The content of the object-oriented feature information base includes four kinds of feature information: the feature information on file type, the feature information on document structure, the feature information on document content, and the feature information on document permissions. The following is a description of these four kinds of feature information.

(1): Feature information on file type

The feature information on the file type includes its file suffixes and file sizes. The file suffixes are whitelisted, meaning that only files matching the enumerated file types in the whitelist can be approved. Simultaneously, it is important to check for any tampering with the file type and ensure that the actual content of the file matches its suffix name. Any manual changes to the suffix name of a file will result in it being reported as suspicious.

(2): Feature information on document structure

By analyzing the format and structural features of various electronic documents, we extract the structural features of different types of electronic files and process them accordingly. The extracted features include the binary format header of the file, description of the file data block, and tail features of the file. These provide a solid foundation for subsequent analysis and mining hidden information. The following Figure 3 illustrates the binary structure of the document.

(a): The header information in binary format typically includes the file type, feature description, file length verification information, file content description, and the start address of the data block.
(b): The file data block provides information about each individual data block within a file.
(c): The tail feature of a file includes the end identifier of the file and a description of the tail feature.

(3): Feature information on document content

According to the custom thesaurus, identify and retrieve the sensitive words in the document, and extract the document content from any hidden data. There are various means of data hiding, with the most common including multiple compressed package compression, encryption hiding, structure hiding, redundant data hiding, embedding hiding, data flow hiding, metadata hiding, self-hiding, and image content hiding. Additionally, determining the file type and level based on sensitive words can provide a basis for document permission management.

(4): Feature information on document permissions

Control the access permissions of protected files, establish the corresponding document access levels based on the user’s permissions, and manage the user’s operations including reading, copying, printing, modifying, saving, and sending documents. Users must have at least one read-only permission if they are authorized to use a document. The user’s permission information can be categorized into seven permissions based on their actions.

3.2. Rule Base for Mining Hidden Information

Specific detection rules are utilized to determine whether the features meet the specified criteria, such as the expression of the file format filtering rule, the expression of the file size filtering rule, and the files’ formats check rule. If certain conditions are met, the document is classified as a sensitive file. In this paper, production rules [29] are used to represent feature detection rules. What follows is a full introduction to the feature detection rules.

(1): The expression of the file format filtering rule realized by the FileTypeFilter class is:

IF file suffixes belong to the whitelist

THEN this document is considered a supported file.

The expert system employs a whitelist filtering policy to verify the file formats in the whitelist. As research and development work continues in the future, the whitelist can be continuously expanded. The whitelist includes:

(a): Office documents: .doc, .docx, .xls, .xlsx, .ppt, and .pptx.
(b): Image files: .jpg, .JPEG, .bmp, .png, .tif, and .gif.
(c): Compressed packages: .zip, .rar, .7z, .tar, and .gz.
(d): Other types: .pdf, .txt, .chm, .cpp, .csv, .java, .apk, .mf, .sf, .rsa, .arsc, .exe, and .dll.

(2): The expression of the file size filtering rule realized by FileSizeFilter class is:

IF the file size is smaller than the maximum value

THEN this document is a supported file.

The expert system utilizes a file size filtering policy to identify oversized files and designate them as sensitive documents.

(3): The files’ format check rule realized by the FileChecker class means:

Based on a comprehensive understanding of document format and structure, corresponding mining methods are designed. With the rapid advancement in computer technology, electronic document formats are freely defined with numerous types, and many systems have their own specific formats. Therefore, the knowledge base uses both format and content-checking strategies. Currently, the main prevailing electronic document formats include those shown in Table 2.

Based on the file format, this article establishes the following checking rules:

(a): Develop validation rules for nested files, including but not limited to: drag and drop embedding, creating from a file, a new embedded file, etc.
(b): Develop a multi-layer document hiding system with nested detection rules. For ZIP and RAR packages, the system will decompress all files and conduct a comprehensive security scan. If the decompressed file still contains a compressed package, it will continue to decompress layer by layer until all subfiles in all compressed packages have been scanned to ensure that no files are overlooked.
(c): Design guidelines for verifying the endpoints of files, such as docx, xlsx, pptx, and other file formats, to determine whether there are any redundant Office documents at the end of the file.
(d): Develop inspection rules for Object Linking and Embedding (OLE) embedded objects [30], analyze the binary stream file of Office documents, implement analysis and inspection of OLE embedded objects, and extract information from embedded objects.
(e): Design NTFS data flow detection rules. The NTFS data stream is a feature of the NTFS disk format commonly utilized for concealing data and serving as a highly covert method of hiding information. Files containing NTFS data streams can be read, written, or copied, and even viewing the file properties will not raise any exceptions. These rules aim to identify and extract NTFS data streams into separate files.
(f): Develop check rules for metadata entrainment [31]. Verify the metadata entrainment behavior of common office documents such as doc, ppt, docx, xlsx, and pptx. This is necessary because files with modifiable metadata in their properties can be utilized to conceal data.
(g): Develop the guidelines for checking file splicing, identifying header and tail splicing across different files, detecting files merged using the Windows command copy, and removing redundant data from the file tail.
(h): Develop rules for file encryption recognition to determine whether a document is encrypted. Office documents, RAR and ZIP packages, and PDF documents all support the use of document passwords, which may indicate an intentional effort to conceal content.

3.3. Rights Management Rules

Files that meet specific criteria can be directly classified as sensitive files. The level of secrecy and category of the files can be specified based on the sensitive words contained in the documents, thereby enhancing the accuracy of interception alarms. Additionally, permission information regarding sensitive files is recorded, and an alarm is triggered when these files are transmitted.

In terms of file classification, the level of confidentiality (e.g., “top secret”, “confidential”, “secret”, “general”) represents the degree of protection for the file. Simultaneously, it also designates the user’s clearance level for accessing the file (e.g., “ordinary user”, “intermediate user”, “advanced user”). The file’s classification level should be combined with the user’s clearance level to create a two-dimensional permission table. This table is then used to determine which users have permission to access encrypted documents. Table 3 illustrates this two-dimensional permission table formed by both file and user levels.

4. Document Classification Based on Keywords

In addition to simple text formats, the expert system will search the text content of Office, PDF, RAR, ZIP, and other files to determine whether they contain sensitive keywords. The expert system will then classify their security levels. Additionally, the expert system supports hierarchical management of keywords and provides multiple risk-level keywords. It also generates a risk level report for each file based on the search results. Images may also contain text content. The expert system utilizes OCR technology to quickly and accurately identify Chinese and English characters in images and conduct keyword retrieval. Representative words are extracted from documents and grouped together with related vocabulary categories, levels, match requirements, and occurrence times to construct a sensitive word base. The expert system supports ten keyword levels (level 1 to level 10), with increasing alert levels using keyword hierarchical management and scanning rules.

Electronic documents typically store text information in three formats: Unicode, Utf8, and ANSI codes, which require transcoding during the search process. Firstly, the expert system analyzes the binary storage format of the electronic document being searched and matches it with the encoding type of the keyword. If an image format is encountered, OCR image content recognition is required using Tesseract for extracting text content. The expert system then divides the security level of the document according to these search results, as shown in detail below. Additionally, a quick search is performed based on the sensitive words base using the Aho–Corasick (AC) automaton algorithm [32] for efficient searching. Finally, the security level of the document is divided according to the search results. The specific implementation process is shown in Figure 4.

4.1. Algorithm of Document Sensitivity Classification

The system classifies the security levels of sensitive words based on the documents containing those words. When configuring sensitive words, users have the option to manually enter the sensitivity level of each word or allow the system to adaptively set the sensitivity level based on a training set input by users. The ranking of a sensitive word increases directly with its frequency in the document but decreases inversely with its occurrence in the training set. For a sensitive word s, where N is the total number of documents in the training set, n is the number of documents containing s in the training set, and

{t f}_{i s}

represents the occurrence times of s in documents with security level i, then we can calculate that sensitivity level as follows:

w_{s} = {\max (t f}_{1 s}, {t f}_{2 s}, \dots {t f}_{i s}) * \log (\frac{N}{n + 1})

(1)

When evaluating documents, the sensitivity level of a sensitive word s is denoted as r, its weight as

w_{r}

, and the number of times the sensitive word s is detected in document d is represented by f(s,d). The weighted score for each sensitive word s in document d is then calculated.

{S (s, d) = w}_{r} \times f (s, d)

(2)

Sum the weighted scores of all sensitive words in document d to obtain the total weighted sensitive score T(d). Subsequently, based on the value of T(d), document d’s security level is determined.

T (d) = \sum_{s ϵ sensitive words in d} S (s, d)

(3)

4.2. Sensitive Word Matching Algorithm Based on the AC Automaton

The primary advantage of the sensitive word-matching algorithm based on the AC automaton is its ability to efficiently complete multi-pattern matching of text in linear time. In comparison to other traditional string matching algorithms, the AC automaton demonstrates higher efficiency and superior performance.

The system loads sensitive words and their corresponding levels from the thesaurus file, and then inserts all sensitive word strings into the dictionary tree, named Trie. A dictionary tree is a tree-like data structure in which each node represents a character. The root node is empty, and the path from the root node to a certain node represents a string prefix. The system has the capability to dynamically add new sensitive words and their levels to the thesaurus files, as well as update the AC automaton. To ensure that sensitive words are unique, the program checks whether the word already exists before adding it. If it does not exist, it appends the new sensitive word and its level to the thesaurus file. Additionally, it updates the AC automaton’s Trie tree and failure pointer to avoid repeated additions. The failure pointer indicates which node should be jumped to if there is a failure in matching. This allows for quick navigation through the algorithm when matching fails.

During the process of sensitive word matching, the nodes in the Trie tree are successively matched starting from the first character of the text. If a match is successful, the next character is then matched. In the event of a failed match, the failed pointer is utilized to jump and continue with the matching attempt. Throughout this matching process, if a node’s output list is not empty when reached, it indicates the detection of a sensitive word. The location and content of these sensitive words are returned based on the match results.

Suppose the sensitive word base includes the first-level sensitive words abcdb, db, and the second-level sensitive words cdbe. The AC automaton is constructed based on these three keywords, as shown in Figure 5. The numbers in the node circles represent the matching results of sensitive words, and the red arrows represent the failed pointers.

The AC automaton is highly efficient in handling a large number of pattern matches and can simultaneously match multiple sensitive words. The matching process is linear, with a time complexity of O (n + m + z), where n represents the length of the text, m is the total length of all sensitive words in the set, and z is the number of matched sensitive words.

5. Inference Engine for Concealed Information

The inference engine for extracting hidden information from electronic documents is designed to carry out mining tasks based on a knowledge base. This mining engine consists of two main components: hidden information extraction and keyword retrieval. The process of executing a scan task using the inference engine for hidden information is as follows:

(1): Firstly, the expert system for mining hidden information in electronic documents is capable of identifying the target file.
(2): The file type is filtered, allowing only those file types that match the whitelist to pass through. File type verification involves identifying whether the content of a specified file conforms to the format declared by its suffix name, with unrecognized file types being reported as suspicious. The file encryption check entails scanning for supported encryption in order to identify and report encrypted files as suspicious. The NTFS data flow check involves examining whether a file has additional NTFS data flow and reporting such files as suspicious. The tail data check aims to prevent sensitive data from being written to the tail of normal files, reporting any files containing tail data as suspicious. If a file is not deemed acceptable, it will proceed to the next check.
(3): A structure check on Office documents is performed followed by successively scanning for keywords in the text. The structure check involves examining the file structure of Office documents to identify any hidden data, as shown in Figure 6, and identifying files containing hidden data. Text keyword scanning entails searching for user-defined keywords by scanning every byte of the file in binary mode according to a specified keyword table. If hidden data or user-defined keywords are found, the file will be flagged as suspicious. Otherwise, if the test result is normal, it will proceed to the next test.
(4): For image files, OCR is performed to read images in various formats of text in various languages. The words within the images are then converted into text, and keywords within the text are scanned.
(5): For the compressed package, the files are first decompressed, then the file type filtering step takes place, and finally, all files are scanned within the compressed package.
(6): When the file scan is completed, the report generator gathers all detection results, organizes them into mining result display files based on HTML and XML result display templates, and saves them to the specified path. At this stage, the task execution is considered complete. Figure 7 shows the flow chart of mining hidden information by the inference engine.

6. System Implementation and Testing

6.1. Experimental Environment and Test File Preparation

The following is the operation and development environment of this system:

(1): Development language: C++.
(2): Development tool: Visual Studio 2022.
(3): Basic software operating environment: Windows 11 Professional Edition operating system, 11th Gen Intel^® CoreTM i9-11950H @ 2.60 GHz processor.

To ensure the accuracy and timeliness of the system test, this paper collects four types of electronic documents: text data, image data, compressed package data, and web page data. Subsequently, sensitive word injection and steganography injection are performed on the collected documents of different types to generate a document test set with 490 test cases for verifying the effectiveness of the expert system. Steganography methods include file type tampering, multiple compression of compressed packages, encryption concealment, structure concealment, redundant data concealment, embedding concealment, data stream concealment, metadata concealment, self-concealment, and image content concealment.

For another section of the document test set, incorporate various levels of sensitive words throughout the text and include images containing sensitive content. The details of the resulting test set are outlined in Table 4. The process of creating a test dataset is as follows:

Collect a total of 10 compressed package files in 5 different formats, along with 20 image files in 6 formats, and 40 Office documents in 6 formats as well as 30 files in more than 10 other formats, bringing the total number of files to 100, which constitutes the fundamental dataset.
Include sensitive words of varying degrees in the compressed package, image, and the title, body, header, footer, comments, and hidden fonts of the document within the fundamental dataset. After these operations, build another 100 test cases on the basis of the fundamental dataset.
Modify the file suffix of each item in the fundamental dataset to a different format to construct another 100 test cases.
Utilize commands such as type and copy to conceal the data flow associated with each file within the fundamental dataset and combine the other additional files, which constructs another 100 test cases.
Insert attachment files and objects in office documents within the fundamental dataset to construct 80 test cases.
Encrypt the compressed package files and office files within the fundamental dataset to construct 10 test cases.

6.2. Test Result

Experimental test results in Figure 8 have confirmed the effectiveness of the scanning system and permission management. The scanning system is capable of identifying various information-steganography methods and issuing alerts. It can detect files with tampered suffixes, marking them with their original file suffixes. Additionally, it can identify files with hidden objects or inserted data, as well as files containing hidden data streams and encrypted files. Furthermore, the system assigns appropriate document permissions based on the results of the scanning process.

The expert system utilizes two methods to present the results: HTML display and XML display of deep mining hidden information results. To achieve these two forms of expression, templates for HTML and XML result display are designed. During program execution, the display result may be generated with a large file up to M bytes, and the HTML character replacement speed of the large file is quite slow. In order to enhance the speed of mining display, all operations are performed in memory, and allocating a few megabytes of memory space is not excessive in exchange for fast replacement. Figure 9 illustrates the generated XML report, and Table 5 shows the meanings of keywords in the XML format report of the scan results.

7. Conclusions and Future Work

The design principle of this expert system for mining electronic document hidden information facing document outgoing control is that any electronic document type can be rapidly expanded into the knowledge base by constructing a technical scheme for the knowledge base. The supported file types and future data hiding methods can also be expanded to enable the detection of electronic documents in extended formats, thus enhancing the system’s scalability. Based on the reverse binary analysis of electronic documents, this paper presents the design of an expert system for deep mining of hidden information in electronic documents. Firstly, it extracts the structural features of various types of electronic documents and conducts a reverse analysis of their binary formats to identify the locations and rules governing potential hidden information. This process leads to the construction of feature information and a hidden information mining rules base for electronic documents. Secondly, sensitive keywords within electronic documents are managed through keyword classification and scanning rules, with ten levels established to trigger alarms sequentially. Thirdly, an inference engine for hidden information in electronic documents is designed based on a knowledge base, guiding the task of mining information and displaying results using a display template. This expert system enables automatic batch inspection of electronic documents containing sensitive information while implementing protection measures according to rights management rules to prevent document leaks but not to constitute information leakage.

It is important to note that hidden information and hidden types vary greatly among different types of electronic documents. In future research, the performance of this expert system will be tested using an open steganographic dataset. After extensive research, we find there is no publicly available test set applicable to the expert system in this paper. In the future, perhaps we can further enrich the adaptive electronic document types and concealment detection in our expert system knowledge base and provide a public test set for this research direction. Additionally, more file formats will be supported, and the whitelist scope will be expanded to meet a wider range of actual business needs.

Author Contributions

Conceptualization, software, and writing—original draft: L.T.; validation, methodology, project administration: J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded by the Young Backbone Teacher Support Plan of Beijing Information Science &Technology University (YBT 202417).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gao, J.; Jiang, Y.; Liu, Z.; Yang, X.; Wang, C.; Jiao, X.; Yang, Z.; Sun, J. Semantic Learning and Emulation Based Cross-platform Binary Vulnerability Seeker. IEEE Trans. Softw. Eng. 2019, 47, 2575–2589. [Google Scholar] [CrossRef]
Jegorova, M.; Kaul, C.; Mayor, C.; O’Neil, A.Q.; Weir, A.; Murray-Smith, R.; Tsaftaris, S.A. Survey: Leakage and Privacy at Inference Time. arXiv 2021, arXiv:2107.01614. [Google Scholar] [CrossRef] [PubMed]
Kleij, R.V.D.; Wijn, R.; Hof, T. An application and empirical test of the Capability Opportunity Motivation-Behaviour model to data leakage prevention in financial organizations. Comput. Secur. 2020, 97, 101970. [Google Scholar] [CrossRef]
Liang, Z.; Guo, J.; Qiu, W.; Huang, Z.; Li, S. When graph convolution meets double attention: Online privacy disclosure detection with multi-label text classification. Data Min. Knowl. Discov. 2024, 38, 1171–1192. [Google Scholar] [CrossRef]
Akyildiz, T.A.; Guzgeren, C.B.; Yilmaz, C.; Savas, E. MeltdownDetector: A Runtime Approach for Detecting Meltdown Attacks. Future Gener. Comput. Syst. 2020, 112, 136–147. [Google Scholar] [CrossRef]
Suma, M.; Madhumathy, P. Brakerski-Gentry-Vaikuntanathan fully homomorphic encryption cryptography for privacy preserved data access in cloud assisted Internet of Things services using glow-worm swarm optimization. Trans. Emerg. Telecommun. Technol. 2022, 33, e4641. [Google Scholar] [CrossRef]
Kunhu, A.; Al-Ahmad, H.; Mansoori, S.A. A Reversible Watermarking Scheme for Ownership Protection and Authentication of Medical Images; Applications Development and Analysis Section, Mohammed bin Rashid Space Centre, College of Engineering and IT, University of Dubai: Dubai, United Arab Emirates, 2024. [Google Scholar] [CrossRef]
Deshpande, P.M.; Joshi Sand Dewan, P.; Murthy, K.; Mohania, M.; Agrawal, S. The Mask of ZoRRo: Preventing information leakage from documents. Knowl. Inf. Syst. 2015, 45, 705–730. [Google Scholar] [CrossRef]
Akshaya, S.; Viji, A. Image steganography using deep reinforcement learning. J. Instrum. Soc. India Proc. Natl. Symp. Instrum. 2021, 8, 2058–2064. [Google Scholar]
Tong, Y.; Liu, Y.; Wang, J.; Xin, G. Text steganography on RNN-generated lyrics. Math. Biosciences Eng. 2019, 16, 5451–5463. [Google Scholar] [CrossRef] [PubMed]
Peng, W.; Wang, T.; Qian, Z.; Li, S.; Zhang, X. Cross-modal text steganography against synonym substitution-based text attack. IEEE Signal Process. Lett. 2023, 30, 299–303. [Google Scholar] [CrossRef]
Chang, C.Y.; Clark, S. Practical Linguistic Steganography using Contextual Synonym Substitution and a Novel Vertex Coding Method. Comput. Linguist. 2014, 40, 403–448. [Google Scholar] [CrossRef]
Shirali-Shahreza, M. Text Steganography by Changing Words Spelling. In Proceedings of the ICACT 2008, 10th International Conference on Advanced Communication Technology, Gangwon, Republic of Korea, 17–20 February 2008. [Google Scholar] [CrossRef]
Ding, C.; Fu, Z.; Yu, Q.; Wang, F.; Chen, X. Joint Linguistic Steganography With BERT Masked Language Model and Graph Attention Network. IEEE Trans. Cogn. Dev. Syst. 2023, 16, 772–781. [Google Scholar] [CrossRef]
Yu, L.; Lu, Y.; Yan, X.; Yu, Y. MTS-Stega: Linguistic Steganography Based on Multi-Time-Step. Entropy 2022, 24, 585. [Google Scholar] [CrossRef] [PubMed]
Liu, G.; Huang, F.; Li, Z. Designing adaptive JPEG steganography based on the statistical properties in spatial domain. Multimed Tools Appl. 2019, 78, 8655–8665. [Google Scholar] [CrossRef]
Sultan, B.; ArifWani, M. A new framework for analyzing color models with generative adversarial networks for improved steganography. Multimed Tools Appl. 2023, 82, 19577–19590. [Google Scholar] [CrossRef]
Dai, H.; Wang, R.; Xu, D.; He, S.; Yang, L. HEVC Video Steganalysis Based on PU Maps and Multi-Scale Convolutional Residual Network. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 2663–2676. [Google Scholar] [CrossRef]
Miranda, J.D.; Parada, D.J. LSB steganography detection in monochromatic still images using artificial neural networks. Multimed Tools Appl. 2022, 81, 785–805. [Google Scholar] [CrossRef]
Yang, Z.; Huang, Y.; Zhang, Y.J. TS-CSW: Text steganalysis and hidden capacity estimation based on convolutional sliding windows. Multimed Tools Appl. 2020, 79, 18293–18316. [Google Scholar] [CrossRef]
Wang, H.; Yang, Z.; Yang, J.; Chen, C.; Huang, Y. Linguistic Steganalysis in Few-Shot Scenario. IEEE Trans. Inf. Forensics Secur. 2023, 18, 4870–4882. [Google Scholar] [CrossRef]
Xue, Y.; Kong, L.; Peng, W.; Zhong, P.; Wen, J. An effective linguistic steganalysis framework based on hierarchical mutual learning. Inf. Sci. 2022, 586, 140–154. [Google Scholar] [CrossRef]
Li, M.; Liu, Q. Steganalysis of SS Steganography: Hidden Data Identification and Extraction. Circuits Syst. Signal Process. 2015, 34, 3305–3324. [Google Scholar] [CrossRef]
Mendoza, V.N.; Ledeneva, Y.; García-Hernández, R.A. Unsupervised extractive multi-document text summarization using a genetic algorithm. J. Intell. Fuzzy Syst. 2020, 39, 2397–2408. [Google Scholar] [CrossRef]
Qian, Y.; Liang, J.; Dang, C. Knowledge structure, knowledge granulation and knowledge distance in a knowledge base. Int. J. Approx. Reason. 2009, 50, 174–188. [Google Scholar] [CrossRef]
Karresand, M.; Axelsson, S.; Dyrkolbotn, G.O. Disk Cluster Allocation Behavior in Windows and NTFS. Mobile Netw. Appl. 2020, 25, 248–258. [Google Scholar] [CrossRef]
Hakak, S.; Kamsin, A.; Shivakumara, P.; Idris, M.Y.I. Partition-based pattern matching approach for efficient retrieval of arabic text. Malays. J. Comput. Sci. 2018, 31, 200–209. [Google Scholar] [CrossRef]
Bipin Nair, B.J.; Shobha Rani, N.; Khan, M. Deteriorated Image Classification Model for Malayalam Palm Leaf Manuscripts. J. Intell. Fuzzy Syst. 2023, 45, 4031–4049. [Google Scholar] [CrossRef]
Mahajan, P.; Kandwal, R.; Vijay, R. Rough set-based approach for automated discovery of censored production rules. J. Exp. Theor. Artif. Intell. 2014, 26, 151–166. [Google Scholar] [CrossRef]
Yang, F.; Zhang C, L. A New Approach of Expanding Data Processing Ability for Configuration Monitoring Software MCGS Based on OLE. Appl. Mech. Mater. 2011, 65, 295–298. [Google Scholar] [CrossRef]
Cabarrão, V.; Batista, F.; Moniz, H.; Trancoso, I.; Mata, A.I. Acoustic-prosodic Entrainment in Structural Metadata Events. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 2176–2180. [Google Scholar] [CrossRef]
Hendrian, D.; Inenaga, S.; Yoshinaka, R.; Shinohara, A. Efficient Dynamic Dictionary Matching with DAWGs and AC-automata. Theor. Comput. Sci. 2019, 792, 161–172. [Google Scholar] [CrossRef]

Figure 1. Diagram of the expert system structure.

Figure 2. The knowledge base diagram.

Figure 3. The binary structure of the document.

Figure 4. The parsing and search process of keywords.

Figure 5. The construction process of the AC automaton.

Figure 6. The data hiding method.

Figure 7. Flow chart of mining hidden information by the inference engine.

Figure 8. The scanning results.

Figure 9. The generated XML report.

Table 1. The content of the object-oriented feature information base.

	Name	Explanation
1	Features	Characteristic information
2	SupportFilesType	File types supported by this feature
3	PreProcess	File binary data preprocessing
4	FeaturesExtract	Feature extraction method
5	HtmlXmlReturn	Result saving mode
6	setOptions	Functional setting
7	Check	Feature detection mode

Table 2. The prevailing electronic document formats.

Number	Filename	Extension	File Header	Skew	End-of-File
1	MS Office/OLE2	doc; xls; ppt; ppa; pps; pot; dot; db	0XD0CF11E0A1B11AE1	0	0X0100FEF03
2	Rich Text Format	rtf	0x7B5C7X7466434452	0
3	XML	xml	0x3C3F786D6C7B5C72	0
4	HTML	html; htm; php; php3; php4; phtml; shtml	0x68746D6C	0
5	MS Access	mdb; mda; mde; mdt; fdb	0x5374616E64617264	4
6	Adobe Acrobat	pdf	0x255044462D	0
7	Quicken	qdf	0XAC9eBD8F	0
8	Windows Registry	registry	0x72656766	0
9	ZIP Archive	zip; jar	0x504B0304	0
10	RAR Archive	rar	0x52617221	0
11	7-ZIP Archive	7z	0x377A	0
12	Compiled HTML	chm	0x4954534603000000	0

Table 3. Two-dimensional permission table.

Secrete Grade	Ordinary User	Intermediate User	Advanced User
Top secret	read	read, print	read, print, write, copy
confidential	read, print	read, print, write	read, print, write, copy, send
secret	read, print, write	read, print, write, copy	read, print, write, copy, send
normal	read, print, write, copy	read, print, write, copy, send	read, print, write, copy, send

Table 4. Description of the test set.

Test Folder	Problem	Sensitive Word Levels	Expected Permissions
test_file001~100	Normal Files with no problem	/	read print write copy send
test_file101~120	Include sensitive word	level 1	read print write copy send
test_file121~140	Include sensitive word	level 3	read print write copy
test_file141~160	Include sensitive word	level 5	read print write
test_file161~180	Include sensitive word	level 7	read print
test_file181~200	Include sensitive word	level 9	read
test_file201~300	Modify the file suffix to a different file type		Waring, No permissions
test_file301~350	Use type command to hide data		Waring, No permissions
test_file351~400	Use copy command to merge data		Waring, No permissions
test_file401~440	Office Insert File		Waring, No permissions
test_file441~480	Office Insert Object		Waring, No permissions
test_file481~490	Compressed file encryption		Waring, No permissions

Table 5. Meanings of keywords in the XML format report of scan results.

Keyword in XML Format Report	Implication
scan_report	test report
information	summary information
scan_type	check whether completion
generate_time	check completion time of report
file_count	number of files
folder_count	number of folders
suspicious_count	number of suspicious files
results	specific document report
file type	type of file
filepath	file path/name
scan_result	scan result
result_description	result description
attachments	attachments

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tan, L.; Yi, J. Expert System for Extracting Hidden Information from Electronic Documents during Outgoing Control. Electronics 2024, 13, 2924. https://doi.org/10.3390/electronics13152924

AMA Style

Tan L, Yi J. Expert System for Extracting Hidden Information from Electronic Documents during Outgoing Control. Electronics. 2024; 13(15):2924. https://doi.org/10.3390/electronics13152924

Chicago/Turabian Style

Tan, Lingling, and Junkai Yi. 2024. "Expert System for Extracting Hidden Information from Electronic Documents during Outgoing Control" Electronics 13, no. 15: 2924. https://doi.org/10.3390/electronics13152924

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Expert System for Extracting Hidden Information from Electronic Documents during Outgoing Control

Abstract

1. Introduction

2. Structure Design of the Expert System

3. Knowledge Base Design of the Expert System

3.1. Feature Information Base

3.2. Rule Base for Mining Hidden Information

3.3. Rights Management Rules

4. Document Classification Based on Keywords

4.1. Algorithm of Document Sensitivity Classification

4.2. Sensitive Word Matching Algorithm Based on the AC Automaton

5. Inference Engine for Concealed Information

6. System Implementation and Testing

6.1. Experimental Environment and Test File Preparation

6.2. Test Result

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI