*2.2. SRM*

The SRM [14] is a handcrafted feature-based steganalytic method that uses various types of linear and nonlinear HPFs (Figure 3) to extract a number of meaningful features from the images. The features are then classified using an ensemble classifier (i.e., a random forest) that uses Fisher linear discriminants as the base classifiers.

**Figure 3.** Thirty linear and nonlinear 5 × 5 SRM filters [19]. The filters are padded with zeros to obtain a unified size of 5 × 5.

The SRM was the most effective method used to detect image steganography before the CNN-based image steganalytic methods emerged. The SRM is highly accurate compared to CNN-based methods. The method of extracting many features using various types of HPFs has also been widely used in CNN-based ones [19,20,25–27].

### *2.3. CNN-Based Image Steganalysis*

CNNs can automatically extract the optimal features required to detect and recognize objects in images, and can classify features with high accuracy [15,16]. Therefore, studies using CNNs are greatly increasing in the image steganalysis field. However, unlike other deep learning problems, the CNN-based image steganalysis has a preprocessing process of applying HPFs to input images. This process enhances the pixel variation caused by embedding messages such that the CNN can detect it well while also removing the low-frequency area, where the messages are less likely to be embedded.

Xu and Wu proposed a simple yet effective initial CNN for image steganalysis [17]. They used a network comprising five convolutional layers and a single fully connected layer (Figure 4). They also used a 5 × 5 HPF in a preprocessing stage, generated eight feature maps in the first convolutional layer, and doubled the number of feature maps and halved the size of the feature maps in the subsequent convolutional layers. Each convolutional layer comprised the processes of convolution, batch normalization, activation, and pooling. They improved the steganalytic performance of the network by adding the absolute layer to the first convolutional layer and by using the tanh activation function in the first two convolutional layers. Yuan et al. used the same network structure as the initial CNN, but utilized three HPFs in a preprocessing stage [18].

**Figure 4.** Initial CNN for image steganalysis [17]. The CNN extracts 128 1 × 1 feature maps from a 512 × 512 input image.

ReST-Net [19] uses three different filter sets, namely 16 simplified linear SRM, 14 nonlinear SRM, and 16 Gabor filters (Figures 3 and 5) in the preprocessing stage to extract much more features from the input images. In addition, ReST-Net constructs three subnetworks (Figure 6). After separately training the subnetworks using each preprocessing filter, it trains a new fully connected layer using transfer learning while fixing the parameters of three subnetworks.

**Figure 5.** Sixteen 6 × 6 Gabor filters with different orientations and scales [19].

**Figure 6.** Structure of Rest-Net [19], which comprises three three subnetworks that are a modification of the initial CNN [17] and uses transfer learning.

Yedroudj-Net [20] has a similar structure with the initial CNN [17], but uses linear SRM filters in the preprocessing stage, and has two additional fully connected layers (Figure 7). It removes the average pooling process in the first convolutional layer to prevent loss of information caused by pooling. In the first two convolutional layers, it uses the TLU function instead of the tanh function to remove the strong, but statistically insignificant information. It has an additional scaling process after batch normalization. Yedroudj-Net has achieved approximately 4–5% improvement in accuracy in binary classification in comparison with the initial CNN [17].

**Figure 7.** Structure of Yedroudj-Net [20].

Deep residual networks for image steganalysis have also been proposed [22,23]. These networks could be made much deeper by employing residual shortcuts (Figure 8). In [22], without fixing the preprocessing filters or initializing the filter coefficients with the SRM filters, the preprocessing process is significantly expanded using several convolutional and residual layers to realize a completely data-driven steganalysis.

**Figure 8.** Structure of a deep residual network used in [22].

Ke et al. proposed a multi-column CNN that extracts various features using filters of different sizes in convolutional layers and allows the input image to be of an arbitrary size or resolution [24]. As a multi-task learning approach, Yu et al. extended a CNN by adding fully convolutional networks that take the output of each convolutional layer as the input for a pixel binary classification that estimates whether or not each pixel in an image has been modified because of steganography [26].

Wu et al. proposed a new normalization, called shared normalization, that uses the same mean and standard deviation, instead of the minibatch mean and standard deviation, for all training and test batches to normalize each input batch and address the limitation of batch normalization for image steganalysis [21]. Meanwhile, Ni et al. proposed a selective ensemble method that can choose to join or delete a base classifier by reinforcement learning to reduce the number of base classifiers while ensuring the classification performance [25].

As such, the existing CNN-based staganalytic methods could successfully increase the classification accuracy by deepening or widening the CNNs and using various types of preprocessing filters. However, these methods aimed for the binary classification of cover and stego images, and, thus, may not be available for the *N*-ary (*N* > 2) classification. Two adaptive steganographic methods, namely WOW and UNIWARD, embed a small amount of messages in a similar manner (Section 2.1); hence, the binary classifiers are very likely to misclassify the WOW and UNIWARD stego images.

### **3. Proposed Method**

### *3.1. Similarity between WOW and UNIWARD*

The adaptive steganographic methods, namely WOW and UNIWARD, use directional filters to analyze how different the differences from the neighboring pixels (i.e., the degree of image distortion) are when a message is embedded into each pixel of an image, and then selectively embed the message into a pixel with a small degree of image distortion. WOW and UNIWARD use different functions to measure the image distortion, but their processes of embedding the message are very similar; thus, the existing CNN-based binary classifiers become confused when discriminating WOW and UNIWARD, and are very likely to make an incorrect classification.

We conducted an experiment in which UNIWARD stego images were input to a binary classifier that had been trained for WOW and vice versa to demonstrate the difficulty of discriminating WOW and UNIWARD using binary classifiers. The CNN used in the literature [17] (Figure 4) was used for the experiment. The other experimental conditions were the same as those given in Section 4.

Table 1 shows that, even when two different steganographic methods (i.e., WOW and UNIWARD) were used in the training and testing phases, respectively, the classification rates for the stego images were still high. For example, the classification rates were 67.13% when the UNIWARD stego images were input into the classifier trained using the WOW stego images. In other words, it is very likely that they are confused with each other because they are too similar to discriminate. Therefore, using existing binary classifiers to classify WOW and UNIWARD is ineffective.


**Table 1.** Cross identification between WOW and UNIWARD (*bpp* = 0.4).
