1. Introduction
Image steganography is a technology that hides secret information in images. Due to its simplicity, variability, and difficulty of detection and extraction [
1,
2], it can be easily used by illegal organizations to engage in activities that will endanger both national and public security. This situation makes steganalysis—an attack technology against steganography—a research hotspot in the field of cyberspace security.
Traditional steganalysis methods include two categories: specific steganalysis, and universal steganalysis. Specific steganalysis is an effective detection method for specific steganography algorithms; its advantage is that its false alarm rate is low, and can accurately reflect the steganographic facts, but it has the problem of small application scope in practical use. Classical specific steganalysis algorithms include regular-sigular (RS) analysis [
3], based on the correlation between neighboring pixels, raw quick pair (RQP) analysis [
4] to observe changes in test statistics through active steganography, and blockiness analysis on OutGuess [
5]. Universal steganalysis regards steganographic detection as a classification problem, extracting high-dimensional features for classification based on machine learning. Classical methods include subtractive pixel adjacency matrix (SPAM) [
6] feature analysis for steganography corrupting the correlation between neighboring pixels, steganalysis of JPEG images based on Markov features [
7], spatial rich model (SRM) [
8] features extracted by multiple submodels, and several of its variants [
9,
10,
11]. These methods significantly improve detection performance, but inevitably increase the training time due to the use of high-dimensional features. Feature design is the core element in steganalysis. The features involved in the model are often obtained by manual design. On the one hand, the features require a substantial amount of manual intervention and professional knowledge; on the other hand, the performance of the model directly depends on the quality of the manually defined features.
In recent years, deep learning has flourished in various fields. Some researchers have applied it to steganalysis, with remarkable achievements. Classical methods include Gaussian-neuron convolutional neural network (GNCNN), based on convolutional neural networks and a Gaussian activation function, as proposed by Qian et al. [
12]. Xu et al. [
13] proposed a CNN structure called Xu-Net containing five convolutional layers, and its detection performance exceeded the spatial rich model for the first time. Ye et al. [
14] designed a new truncated linear unit (TLU) as the activation function, based on which the TLU–CNN was proposed. Fridrich et al. [
15] designed SRNet on the basis of residual networks, and You et al. [
16] designed Ke-Net on the basis of Siamese network. Deep neural networks automatically obtain the feature representations for steganographic detection through sample training, avoiding the dependence on manually defined features. The core problem shifts to the structural design of deep neural networks. In the spatial image steganalysis tasks, what needs to be extracted is the very subtle steganography information features hidden behind the image content and texture, which is significantly different from the traditional computer vision task. Therefore, increasing the signal-to-noise ratio and maximizing the residual information are usually necessary in order to improve the steganographic detection performance.
In this paper, we propose a new end-to-end network to improve the performance of steganalysis tasks, which balances the accuracy and efficiency of steganographic detection. Given that steganographic detection utilizes weak signals hidden in the image content, most previous approaches have introduced high-pass filters to enhance the signal-to-noise ratio. In this paper, separable convolution and an adversarial mechanism are introduced to separate the steganographic signal from the content signal in the spatial image, thus enabling better extraction of steganographic embedding features and improving the performance of image steganographic detection, without the interference of image content. The following steps were taken in the design of the network in order to improve its performance:
A separable convolution module was introduced into the network, which not only can enable it to obtain higher accuracy, but also makes the network converge quickly, and improves efficiency. The module divides the normal convolution into two parts—pointwise convolution, and depthwise convolution—separating the spatial feature learning from the channel feature learning, maximizing the channel correlation of residuals, and effectively enhancing the signal-to-noise ratio.
We introduced an adversarial mechanism into the network structure to suppress image content information and highlight steganographic information as much as possible. In the process of adversarial training, the generator extracts more image content features to mislead the classifier and, thus, isolates the required steganographic features. The introduction of a gradient reversal layer (GRL) allows the network to better extract the steganographic embedding features and improve the performance of steganographic detection, without the interference of image content.
To better train and evaluate the proposed method, we detected a variety of adaptive steganography algorithms on BOSSBase1.01 and BOWS2. The experimental results demonstrate that the separable convolution and the adversarial mechanism have better effects on the extraction of the existence features of hidden information. The introduced separable convolution improves the signal-to-noise ratio, maximizes the channel correlation of the residuals, reduces the number of training parameters, and improves efficiency. More steganographic embedding features can be separated via the adversarial mechanism, which effectively improves the performance of steganalysis.
The rest of the paper is organized as follows: In
Section 2, we briefly review classical steganalysis network architectures.
Section 3 focuses on the network model on the basis of the separable convolution and the adversarial mechanism proposed in this paper.
Section 4 provides and analyzes the experimental results on BOSSBase1.01 and BOWS2. Finally,
Section 5 presents the concluding remarks.
2. Related Works
The earliest use of deep learning for steganalysis can be traced back to 2014, when Tan et al. [
17] used a stacked convolutional autoencoder for steganographic detection; they found that the network usually failed to converge after directly applying a randomly initialized CNN to the steganalysis task, and that using a KV kernel to initialize the weights of the first layer of the network could effectively improve the accuracy.
Qian et al. [
12] proposed a customized GNCNN for steganographic detection; the network structure contains three parts: a preprocessing layer with high-pass filters, a convolutional layer for feature extraction, and a fully connected layer for classification. This method is the first to apply CNNs to the task of steganalysis, achieving results that are comparable to traditional methods using hand-crafted features.
Xu et al. [
13] proposed a CNN structure with five convolutional layers, introducing batch normalization (BN) and global average pooling, which are commonly used in image classification tasks. The network uses various activation functions—including absolute (ABS) activation, hyperbolic tangent (TanH) activation function, and rectified linear unit (ReLU)—to improve the experimental results; its performance exceeds the SRM scheme [
8], and the improved Xu-Net achieves better results for steganalysis in the JPEG domain [
18].
Ye et al. [
14] proposed a new method in 2017, which uses a set of high-pass filters in SRM to detect the steganographic signal in the image. The method of initializing the preprocessing layer parameters is significantly better than random initialization. The method uses TLU for the first time. On this basis, TLU–CNN was designed. The idea of selection-channel-aware steganalysis was introduced, and a selection-channel-aware TLU–CNN network was proposed. Experimental results show that the detection performance of this network has obvious advantages over the traditional rich-model method.
Yedroudj et al. [
19] proposed Yedroudj-Net in 2018. This method borrows excellent results from Xu-Net and Ye-Net, uses 30 filters from SRM [
8] as initialization values of the preprocessing layer, and then adds batch normalization layers and truncating linear units. The method still achieves good performance without the use of selection-channel awareness.
Li et al. [
20] designed a CNN network with a parallel subnet structure by using linear and nonlinear filters, which further improved the performance of detection. Boroumand et al. [
15] proposed SRNet, which does not use high-pass filters in the traditional sense, but maximizes the noise residuals introduced by the steganography algorithms, and is one of the current methods that can achieve high accuracy. Zhu et al. [
21] proposed a CNN on the basis of separable convolution, multilevel pooling, and spatial pyramid pooling for steganalysis, which achieved good performance in detecting arbitrary-sized images.
The main idea of the above approaches is to regard the image steganographic detection task as an image binary classification problem, and then use the classical image classification framework based on the CNN. Nevertheless, a significant difference clearly exists between the steganalysis task and the image classification task. Image classification relies on content information, whereas steganographic detection requires subtle noise signals hidden under the image content. Consequently, directly adopting the CNN framework is difficult for steganalysis tasks. The existing methods are generally solved by adding high-pass filters in the preprocessing layer, but the manually defined filters are not always optimal, and may suppress part of the steganographic signal. In view of the fact that the performance of image steganographic detection depends heavily on the signal-to-noise ratio, this paper introduces separable convolution and adversarial mechanism to enhance the signal-to-noise ratio and improve detection performance.
Introduction of separable convolution module: Separable convolution splits normal convolution into pointwise convolution and depthwise convolution. The module first separates the channels and performs independent spatial convolution for each channel; it then concatenates the output channels via pointwise convolution to perform spatial feature learning and channel feature learning, thereby maximizing the channel correlation of the noise residuals to improve the signal-to-noise ratio in order to detect the subtle differences between the cover signal and the steganographic signal.
Introduction of adversarial training: The image contains content information reflecting the visual perception of the image, along with steganographic information reflecting the embedding of steganographic messages. In this paper, we use the idea of transfer learning for reference, and introduce adversarial training [
22] to suppress content information and highlight steganographic information as much as possible; doing so can better extract steganographic embedding features and improve the detection performance of the network, without the interference of content information.
By introducing the above two modules, the proposed network significantly improves the accuracy of steganalysis.
4. Experiments
4.1. Dataset and Software Platform
To obtain fair comparison results, all experiments used the same dataset, and the two standard datasets were as follows:
BOSSBase1.01 [
27]: This dataset contains 10,000 uncompressed grayscale images with a size of 512 × 512 pixels, which are derived from 7 different brands of cameras.
BOWS2 [
28]: This dataset also contains 10,000 uncompressed gray images with a size of 512 × 512 pixels, and the image distribution in the dataset is very similar to that in BOSSBase1.01.
Experiments on the steganographic detection of the spatial adaptive steganography algorithms spatial-universal wavelet relative distortion (S-UNIWARD) [
29], high-pass, low-pass, low-pass (HILL) [
30] and wavelet obtained weights (WOW) [
31] were performed on two image databases as above, using MATLAB for two steganographic embeddings of the cover images at 0.2 bpp and 0.4 bpp, using a random embedding key during the steganography process. The network was trained, validated, and tested in a PyTorch environment. The method was compared with Ye-Net [
14], SRNet [
15], Yedroudj-Net [
19], and Zhu-Net [
21].
4.2. Training, Validation, and Testing
Due to the limited GPU computing power, training the network using the original 512 × 512 images would be time consuming. Accordingly, we used MATLAB to change the original images to 256 × 256 pixels, and all subsequent experiments were conducted on the basis of the images of 256 × 256 pixels.
The designed experiment was divided into three parts:
The first part of the experiment focused on the effectiveness of the separable convolution and the adversarial mechanism. The experiment used 10,000 modified images on the basis of BOSSBase1.01, with each cover image having its own corresponding steganographic image, for a total of 20,000 images. The training set contained 6000 pairs of images, the validation set contained 2000 pairs of images, and the remaining 2000 pairs of images were used as the test set, with no overlay in the three-part image set. This part of the experiment verified the effectiveness of the separable convolution and the adversarial mechanism by removing the separable convolution module and the GRL, respectively, from the network structure.
The second part of the experiment compared our network with other steganalysis methods based on CNNs. The size of the original cover images in the BOSSBase1.01 dataset was modified, and then multiple adaptive steganography algorithms were performed to obtain 10,000 pairs of images as the dataset. Similarly, 6000 pairs were used as the training set, 2000 pairs as the validation set, and 2000 pairs as the test set. This part of the experiment compared the detection performance of the method proposed in this paper with various CNN-based steganalysis methods at 0.2 bpp and 0.4 bpp.
The third part of the experiment considered the impact of data expansion on network performance. Considering that a larger training set is effective in avoiding overfitting for experiments based on CNNs, 10,000 pairs of images from the BOWS2 dataset were added to this part of the experiment; together with 6000 pairs of images from BOSSBase1.01, the training set totaled 16,000 pairs of images; 20% of BOSSBase1.01 was used as the validation set, and the remaining part was used as the test set for the experiments.
According to the above experimental design, the proposed method was trained and tested with the same hyperparameters and settings as the previous method, and the test results were taken as the final performance of the model.
4.3. Hyperparameters
The method proposed in this paper applies a mini-batch stochastic gradient descent (SGD) to train the CNN network, with the momentum set to 0.9 and the weight decay set to 0.0005. Due to the limited computing power, the batch size in the training was set to 16 (8 cover/stego pairs). All convolutional layers in this network structure were initialized using the Kaiming method [
32], and all linear layers were initialized by random numbers generated from zero-mean Gaussian distribution with a standard deviation of 0.01. In this paper, the parameters of the preprocessing layer were initialized using the values of the high-pass filter in the SRM, and the threshold T of the TLU in this layer was set to 3. The experiments used a cross-entropy loss function, and the cross-entropy loss decreased continuously in the process of network training. The initial learning rate was 0.01, and the number of epochs was set to 200. As the training process progressed, the learning rate was changed to one-fifth of the original rate after a certain number of steps. The reduction in learning rate ensured that the loss was still effectively reduced rather than repeatedly oscillating in the later stage of training, thus further improving the accuracy.
4.4. Results
4.4.1. Verification of the Effectiveness of Separable Convolution and the Adversarial Mechanism
To investigate whether the introduced separable convolution and adversarial training can retain less information about the image content in the extracted features, we removed the separable convolution module and the GRL from the network structure in order to verify the performance of the network separately. We compared the networks without the introduction of separable convolution (labelled as Our method/wosep) and with the introduction of separable convolution (labelled as Our method/wisep);
Table 1 shows the experimental results.
We compared the networks without the introduction of the adversarial mechanism (labelled as Our method/woadv) and with the introduction of the adversarial mechanism (labelled as Our method/wiadv) on the same dataset and with the same hyperparameters;
Table 2 shows the experimental results.
For this subsection, we experimentally verified the effectiveness of the introduced separable convolution and adversarial mechanism.
Table 1 shows the performance comparison between the networks without the introduction of separable convolution and with the introduction of separable convolution. By observing the data in
Table 1, the network with the introduction of separable convolution can obtain higher accuracy in steganographic detection at different payloads. Owing to the introduction of separable convolution, the accuracy of the network improves by 4.8% and 4.4% for S-UNIWARD at 0.2 bpp and 0.4 bpp, respectively. This indicates that separable convolution can maximize the residual information and extract more steganographic embedding features, thus improving the accuracy.
In addition, we also compared the results achieved by the networks without the introduction of the adversarial mechanism and with the introduction of the adversarial mechanism. As can be observed in
Table 2, the network with the introduction of the adversarial mechanism outperforms that without it, improving the accuracy by 3.5% and 2.8% for S-UNIWARD at 0.2 bpp and 0.4 bpp, respectively. The above experimental results verify the effectiveness of introducing separable convolution and adversarial mechanism into the network structure.
4.4.2. Performance Comparison between this Method and other CNN-Based Steganalysis Methods
The experimental results reported in this section can be divided into two parts: The first part visualizes the training process of the proposed method via an accuracy and loss epoch chart. The second part compares the performance of the method proposed in this paper with other popular steganalysis methods. All of the experimental results are from the final iteration. When training and validating
images sourced from BOSSBase1.01 for S-UNIWARD at 0.4 bpp, our proposed network is capable of fast convergence; the detailed data are shown in
Figure 4.
We trained the network on BOSSBase1.01 for 200 epochs—a process which took ~7 h. From the chart, we can observe that the loss and accuracy tended to stabilize around the 100th epoch. To prevent the network from overfitting, we stopped training at the 200th epoch. The loss curve drops obviously at the 50th epoch, which we believe is due to the learning rate decay strategy, which effectively reduces the loss and improves accuracy.
The proposed method was compared with several common steganalysis networks, such as Ye-Net, Yedroudj-Net, SRNet, and Zhu-Net.
Table 3 shows the experimental results. The proposed method achieves good results regardless of the embedding method and payload. Given that the network further introduces separable convolution and adversarial mechanism based on the foundation of the high-pass filter, it can better extract the steganographic embedding features and, thus, improve the accuracy of steganographic detection.
In
Table 3, we further illustrate the detection accuracy of three common steganography methods—HILL, S-UNIWARD, and WOW—at payloads of 0.2 bpp and 0.4 bpp. Based on the data in
Table 3, the network proposed in this paper obviously outperforms several other CNN-based steganalysis methods; it is 12.3–20.3% better than YeNet, 3.4–12.5% better than Yedroudj-Net, 2.3–6.4% better than SRNet, 0.5–4.4% better than Zhu-Net. For the WOW algorithm, the proposed method achieves an accuracy of 89.2% at 0.4bpp.
Briefly, these experimental results demonstrate well that the method proposed in this paper can extract the steganographic features more effectively, achieving higher accuracy than other networks. According to the results of the first part of the experiment, it is believed that the introduction of separable convolution and adversarial mechanism to the network contribute greatly to the superior performance of CNN-based steganalyzers over the other approaches. Note that the above experiments were operated without using the knowledge of channel awareness, a larger database, or a virtual augmentation of the database.
4.4.3. Impact of Data Expansion on Network Performance
In deep learning, it is significant to use a larger database to ensure a good performance, but also to avoid overtraining. Academics are prone to using large datasets to improve the performance of networks and to prevent overfitting. This part of the experiment expanded the dataset by adding 10,000 pairs of images from BOWS2 to BOSSBase1.01, for a total of 16,000 pairs of training set images containing 6000 pairs of images from BOSSBase1.01 and 10,000 pairs of images from BOWS2. The remaining images in BOSSBase1.01 were used as the validation set and test set, and the enhanced dataset was noted as the extended BOSS. The network was trained using the above dataset to verify whether the expansion of the dataset could improve the accuracy of detecting steganographic images.
Table 4 shows the comparisons of Ye-Net, Yedroudj-Net, Zhu-Net, and our method trained on the original BOSSBase1.01 and the extended BOSS, against the steganography algorithm S-UNIWARD at payloads of 0.2 bpp and 0.4 bpp.
From the data in
Table 4, we can observe that the detection performance of the network gradually improves as the training set is incremented. For all of the steganalysis algorithms involved in the experiments, better results were achieved using the extended BOSS compared to training with only BOSSBase1.01. Ye-Net, Yedroudj-Net, Zhu-Net, and our method improved accuracy by up to 4.9%, 3.1%, 4.2%, and 4.3%, respectively. Especially for S-UNIWARD at 0.4 bpp, using the extended dataset for training, our method achieved the best results in all of the experimental replicas, reaching 89.8%. Similarly, when attempting to detect steganography at a lower payload, the network trained with the extended BOSS also achieved the best performance. This prompted us to use larger datasets for training the network; as opposed to using the BOSS training set only, the extended BOSS can significantly improve the detection accuracy. During the experiments, we also found that using a larger training set was effective in mitigating overfitting.
5. Conclusions
Benefiting from the application of CNNs in the field of image steganalysis, traditional manually-defined features are slowly being replaced by features extracted automatically by CNNs. In this paper, we introduced separable convolution and adversarial mechanism into the traditional CNN structure, and proposed a new method for spatial image steganalysis, which can detect steganographic images well. The algorithm shows significant improvement over the current CNN-based methods. We attribute the improved performance of steganographic detection to the following factors: a set of high-pass filters in the preprocessing layer, a separable convolution module, and the introduction of adversarial mechanism. The separable convolution module eliminates image content information from the features and increases the signal-to-noise ratio; the introduced adversarial mechanism forces the feature extractor to extract more content information features, and isolates more useful steganographic embedding features. These mechanisms can extract more steganographic embedding features and improve the accuracy of steganographic detection. We also experimentally demonstrated that the network performance can be further improved by data expansion. Extensive experiments demonstrate that the method proposed in this paper significantly improves the detection accuracy compared with other steganalysis networks.
We hope that our method can provide some inspiration for future research in image steganalysis. Our future work will focus on utilizing the current foundation in conjunction with the backbone of more advanced networks, which will extract more valuable steganographic features for the steganalysis of color images.