1. Introduction
With the rapid development of digital technology, information security is particularly important in many critical scenarios, such as military communications, commercial confidentiality protection, and personal privacy protection. Computer security [
1] protects access to data and operating systems through a variety of mechanisms; however, traditional techniques are no longer able to resist some of the current attacks. Therefore, more and more researchers are exploring new defense techniques such as information hiding techniques [
2].
Information hiding techniques, as a product of the ancient art of steganography combined with modern cryptography, hinge on the core principle of utilizing the redundancy of multimedia information and the visual masking of specific information by the human eye. This allows for the ingenious concealment of information within a medium, thus providing robust technical support for encrypted communication and digital content protection. Image steganography [
3,
4,
5], a significant branch of information-hiding techniques within the realm of images, seeks to discreetly transmit secret information by leveraging the visual characteristics of human perception. Research in image steganography can be approached from various angles. One aspect focuses on how to efficiently and securely embed secret information into cover images, while another emphasizes maintaining a high degree of concealment against sophisticated detection methods, such as machine learning models.
Traditional image steganography algorithms have laid the groundwork for research in this area, cleverly integrating sensitive information into cover images through the use of meticulously designed keys and specific embedding techniques. Traditional image steganography algorithms can generally be classified into two main categories: transform domain algorithms and spatial domain algorithms. Transform domain algorithms include discrete wavelet transform domain (DWT) [
6,
7,
8], discrete Fourier transform domain (DFT) [
9,
10,
11], and discrete cosine transform domain (DCT) [
12,
13,
14,
15], achieving efficient information embedding through their unique transformation properties. Spatial domain algorithms involve the least significant bit (LSB) [
16,
17], pixel value differencing (PVD) [
18], and wavelet-based optimal weighting (WOW) [
19], directly manipulating pixel spaces to achieve covert embedding of information.
However, in the face of increasingly sophisticated machine learning techniques, thanks to the powerful learning and generalization capabilities of machine learning, it is possible to discover details that are difficult to detect by the human visual system in traditional steganography algorithms, thus threatening the security of secret information. To overcome this limitation, researchers have actively explored new image steganography methods. For example, the adaptive steganography algorithms S-UNIWARD [
20] and J-UNIWARD [
21] proposed by J. Fridrich et al. Although these two algorithms take into account the factors of counter-attacks, their steganography is still challenged in the face of machine learning detection. In addition to this, most of the existing image steganography techniques focus on steganographing textual information or grey-scale images as secret images, while relatively little research has been done on color images.
In this context, the proposed VRIS (Visually Robust Image Steganography) model aims to embed color images as secret information into cover images. By introducing random noise, optimizing embedding strategies, and utilizing adversarial training techniques, it seeks to deceive both the human visual system and machine learning detectors. This research not only provides a new direction and ideas for the development of color image steganography but also offers a technological breakthrough in the field of information security. Moreover, it has broad application prospects in military, commercial, and legal domains, safeguarding information security and ensuring the safety and concealment of critical information during transmission.
The main work of this article is as follows:
(1) Proposed VRIS image steganography model: This study proposes a novel image steganography model named VRIS, which aims to deceive both the human visual system and the machine learning model to ensure the security and concealment of secret information during transmission.
(2) Feature processing of secret images: The VRIS model performs feature extraction and processing of secret images, and performs feature-level fusion with cover images to generate visually indistinguishable encrypted images that successfully deceive the human eye. Meanwhile, adding random Gaussian noise and using discriminators for adversarial training ensures that the encrypted image can successfully deceive the machine learning model.
(3) Extraction of Secret Information, Quality Assessment and Testing of Secretly Reconstructed Image: The VRIS model ensures that legitimate users are able to extract and reconstruct the secret image without any loss, and at the same time prevents unauthorized users from accessing the secret information. In addition, the quality of the encrypted image is evaluated by PSNR and SSIM, and its effectiveness in spoofing machine learning models is verified by classification tests.
2. Related Work
Research on steganography has made significant progress in recent years [
22], from the introduction of adversarial samples to the application of Generative Adversarial Networks (GANs), YIJI-based Graph Neural Networks (GNNs), and feature-space hiding strategies, which have collectively advanced the development of steganography. The work of Goodfellow, Szegedy, and others [
23,
24] has revealed vulnerabilities in machine learning models and inspired novel ideas for the advancement of steganography. The Deepfool method, proposed by Balaaditya et al. [
25], significantly advanced the development of adversarial sample generation techniques. This method can efficiently generate adversarial samples capable of deceiving deep neural networks (DNNs), thereby providing new adversarial strategies for steganography. Moreover, the C&W++ method, proposed by Du et al. [
26], not only enhances the generation efficiency of adversarial samples but also assesses the robustness of DNNs, offering a more comprehensive tool for adversarial sample generation and evaluation in the context of steganography.
Based on the concept of GAN, Zhang et al. [
27] proposed a GAN-based steganographic image generation method, which uses the generative adversarial mechanism of GAN to generate realistic encrypted images by training a generator network which are visually indistinguishable from the original images, so as to achieve the purpose of improving the covertness of steganography. With the continuous research and application of steganography, in order to meet the needs of different application scenarios, Li et al. [
28] subsequently proposed a multi-scale GAN method, which is capable of generating encrypted images with different scales, and this feature makes the encrypted image adaptable to different resolutions and sizes while maintaining high covertness, which greatly enriches the application scenarios of steganography.
However, despite the significant progress of multi-scale GAN methods in encrypted image generation, how to further improve the covertness of encrypted images and their resistance to steganalyzers is still an urgent problem to be solved. Tang and colleagues proposed ASDL-GAN [
29] based on Generative Adversarial Networks (GANs), which enable adversarial training between generators and discriminators through unsupervised learning, generating realistic steganographic images. Compared to traditional prior knowledge-based methods, this approach significantly enhances the imperceptibility of secret images while maintaining the statistical properties of the images. However, its steganographic effect may still be exposed when facing more advanced steganalyzers. In order to further improve the concealment of encrypted images and their resistance to steganalyzers, Volkhonskiy and others introduced SGAN [
30], which embeds secret information into adversarial examples using a steganalysis network, enhancing the realism of cover images while bolstering the resistance of embedded images against steganalysis detection.
Nevertheless, the semantic shortcomings of SGAN may lead to noticeable embedded images. To address this issue, Balijia and collaborators designed a new model based on graph neural networks (GNNs) [
31], associating the color channels of secret and cover images with a multi-level embedding strategy. This approach achieves high concealment and relatively low distortion, resulting in visually more natural embedded images. On this basis, Zhang and colleagues introduced ISGAN [
32], which tackles color channel strategies by embedding secret information within the Y channel of cover images, thus avoiding color distortion issues. Yet, there remains room for improvement in image quality and resistance to steganalysis detection.
In targeting the deception and security of learning-based detection, Din R et al. [
33] proposed a steganography method based on feature embedding. The secret information is cleverly embedded into the features of the image, and the dependence of the model on the features is used to improve the steganography of the information. Meanwhile Wang and Chen [
34,
35] and others have proposed a deep learning-based steganography method for semantic information, which is able to detect semantic information embedded in the image, providing a new means for security assessment of steganography technology, and also providing new ideas for the subsequent development of steganography technology.
Finally, the work of Li et al. [
36] focuses on understanding and exploiting vulnerabilities in machine learning models to design covert methods that can evade or attack detection mechanisms. Research in this area not only enhances the effectiveness of steganographic techniques but also holds significant implications for personal privacy and copyright protection. In summary, although these algorithms have achieved remarkable results in steganography, they still have certain limitations. The VRIS model proposed in this paper aims to overcome the limitations of the existing algorithms: by introducing random noise and optimizing the embedding strategy, it is able to hide the secret information more efficiently and reduce the risk of being detected by the human visual system and the machine learning detector; by using the adversarial training technique, it enhances the resistance to machine learning detection and improves the security and reliability of steganography. Compared with the existing methods, the VRIS model has higher concealment and stronger resistance while maintaining image quality, providing new ideas and methods for the development of steganography (as shown in
Table 1).
3. VRIS Image Steganography Model
To achieve dual deception of machine learning models and human visual systems, this study designs the VRIS image steganography model (as shown in
Figure 1), aiming for efficient completion of image steganography tasks. The VRIS technology combines the data reconstruction capability of autoencoders with the generative-discriminative framework of Generative Adversarial Networks (GAN), falling under the category of unsupervised learning.
3.1. VRIS Model Architecture and Design
The VRIS model consists of three key components:
Visual-Masker Module: This module extracts characteristics of the hidden image utilizing a multi-scale convolutional kernel. It leverages multi-layer convolution and batch normalization techniques to enhance feature integration [
37]. Ultimately, these features are seamlessly mapped onto the cover image, producing a visually indistinguishable first-level encrypted image.
Hidden-Insight Module: This module extracts features from the second-level encrypted image, processes them through activation functions and batch normalization, and subsequently retrieves the information embedded within the hidden image. These refined features are then transformed into the secretly reconstructed image.
AI-Evasion Shield Module: Serving as a discriminator, this module distinguishes between noisy second-level encrypted images and cover images. Through adversarial training, it refines the steganography strategy to bolster the model’s resilience against machine learning-based deceptions.
To elevate the complexity and confidentiality of the first-level encrypted image, random noise is introduced prior to decoding. This gives rise to the second-level encrypted image, characterized by randomness, controllability, and diversity. Randomness ensures that each noise instance is unique, while controllability facilitates experimental adjustments. By varying the noise intensity, a range of experimental conditions can be explored. The incorporation of random noise serves to obscure image details, enhance the stealth of the hidden image, and fortify the model against potential attacks.
3.2. Image Preprocessing
Objective: Through a series of preprocessing steps, increase the diversity of the training dataset, enhance the model’s generalization ability, and optimize the model’s learning of image features.
Input: Original image dataset , where N is the total number of images.
Output: Preprocessed image set .
Steps:
For each image in the training set, use the Random Crop function to randomly select a fixed-size area for cropping, generating the cropped image . Repeat this process until the entire training set is cropped, resulting in the cropped training set .
- 2.
Center Crop (Testing Set):
For each image in the testing set, use the Center Crop function to crop a fixed-size area around the center point of the image, generating the cropped image . Repeat this process until the entire testing set is cropped, resulting in the cropped training set .
- 3.
Convert to Tensor:
For each image, and in and , use the Tensor function to convert them to tensor format and . This results in the tenderized training set and testing set .
- 4.
Image Normalization:
For each tensor and in and , apply the transforms. Normalize () function for normalization, resulting in the normalized tensor and . This normalization process eliminates scale differences between features by adjusting pixel value scales, enhancing the model’s learning efficiency regarding image features.
- 5.
Dataset Random Subsampling:
If the dataset is too large, you can use a random subset. The sample function performs random subsampling
to obtain the training set
after subsampling, which helps to reduce the consumption of computing resources and time (as shown in
Figure 2 and
Figure 3) and improves the efficiency of the training process.
3.3. Basic Results of the VRIS Basic Model
3.3.1. Visual-Masker
In designing the Visual-Masker module (as shown in
Figure 4 and Algorithm 1), we focused on achieving efficient steganography of secret images within cover images while maintaining the visual naturalness of the cover image. To achieve this, we constructed the module based on convolutional neural networks, dividing it into two parts: feature extraction and information hiding.
Algorithm 1. Algorithm of the Visual-Masker. |
1: FUNCTION Initialize_Visual-Masker (input_S, input_C): 2: # Step 1: Process the secret image through the initial convolutional layer 3: x1 = leaky_relu(bn1(conv1(input_S))) # Process input_S with conv1, bn1 4: x2 = leaky_relu(bb2(conv2(input_S))) # Process input_S with conv2, bn2 5: x2 = PAD (x2, (0, 1, 0, 1)) # Pad x2 with zeros at the bottom right 6: x3 = leaky_relu(bn3(conv3(input_S))) # Process input_S with conv3, bn3 7: x4 = CONCATENATE (x1, x2, x3) # Concatenate x1, x2, x3 along the channel dimension 8: # Step 2: Further process feature map x4 through deeper convolutional layers 9: x1 = leaky_relu(bn4(conv4(x4))) # Process x4 with conv4, bn4 10: x2 = leaky_relu(bn5(conv5(x4))) # Process x4 with conv5, bn5 11: x2 = PAD (x2, (0, 1, 0, 1)) # Pad x2 with zeros at the bottom right 12: x3 = leaky_relu(bn6(conv6(x4))) # Process x4 with conv6, bn6 13: x4_prime = CONCATENATE (x1, x2, x3) # Obtain processed feature map x4’ 14: # Step 3: Concatenate the cover image input_C with feature map x4’ 15: x_combined = CONCATENATE (input_C, x4_prime) # Concatenate along the channel dimension 16: # Step 4: Further process x_combined through a hidden network 17: FOR i FROM 1 TO N: # N is the number of repetitions 18: x1_hidden = leaky_relu(bn7(conv7(x_combined))) # Process with conv7, bn7 19: x2_hidden = leakyrelu(bn8(conv8(x_combined))) # Process with conv8, bn8 20: x2_hidden = PAD (x2_hidden, (0, 1, 0, 1)) # Pad x2_hidden 21: x3_hidden = leaky_relu(bn9(conv9(x_combined))) # Process with conv9, bn9 22: x_combined = CONCATENATE (x1_hidden, x2_hidden, x3_hidden) # Concatenate hidden feature maps 23: x-hidden = x_combined # Final hidden feature map containing secret image information 24: # Step 5: Output the final image 25: output-image = tanh(conv16(x-hidden)) # Process with conv16 and apply tanh activation 26: RETURN output-image # First-level encrypted image |
First, the feature extraction module preprocesses the cover image through three parallel convolutional layers (conv1, conv2, conv3). Each layer employs different kernel sizes (3 × 3, 4 × 4, and 5 × 5). This multi-scale approach is rooted in the varying abilities of kernel sizes to capture image details: the 3 × 3 kernel excels at extracting local information, while the 4 × 4 and 5 × 5 kernels are better for capturing global features. The acquired feature maps are concatenated along the channel dimension to form a richer feature map, which is then concatenated with the secret image in the same dimension. This achieves the fusion of the secret image with the processed cover image features.
Figure 4.
Visual-Masker module structure.
Figure 4.
Visual-Masker module structure.
The information-hiding module is responsible for deeply merging the extracted features of the secret image with the cover image. We utilize a series of convolutional layers and batch normalization layers (from conv7 to conv21) to further process the fused feature map obtained from the feature extraction module. Throughout this process, we not only adjust the dimensions of the feature maps but also enhance feature extraction and incorporate information from the secret image. This ensures that the embedded secret information remains concealed while preserving the cover image’s visual naturalness and authenticity. Additionally, multiple concatenation operations occur to combine outputs from different convolutional layers along the channel dimension, allowing us to form a more complex feature representation.
Finally, the Conv22 layer converts the final feature map into the output image. This convolutional operation reduces the feature map’s channels to three, corresponding to the RGB color channels, and applies the tanh activation function to normalize the output values within the range of −1 to 1, satisfying image data representation requirements.
The design of Visual-Masker follows an end-to-end principle, enabling it to learn the mapping relationship between the input image and its corresponding first-level encrypted output image. Each convolution layer is followed by a batch normalization layer to accelerate the training process and reduce internal covariate shifts. Additionally, at several critical points in the network, we utilize the LeakyReLU activation function, which introduces non-linear features and maintains effective gradient propagation throughout the network. This approach aids the model in learning complex mapping relationships.
With the above design, the Visual-Masker module is able to learn the deep features of the input image and also enables efficient steganography of secret images while maintaining the visual naturalness of the cover images. This design allows secret images to be hidden in the cover images without attracting the attention of the human eye, providing an efficient and stealthy solution to the image steganography task.
3.3.2. AI-Evasion Shield
In the architecture of VRIS, the AI-Evasion Shield module (
Table 2) is the key component to improve the steganographic performance of encrypted images, and the discriminator as the core of this module adopts a multi-layer cascaded convolutional network structure, with each layer incorporating a LeakyReLU activation function (with the negative slope set to 0.2). This choice aims to avoid the gradient vanishing problem, and ensure that the gradient information can be effectively propagated even in the deeper layers of the network, accelerating the learning and enhancing the nonlinear feature extraction as well as the model’s ability to characterize complex image data.
In order to further enhance the stability and generalization ability of the model, a batch normalization layer is embedded after some of the convolutional layers, a strategy that effectively reduces the internal covariate bias, accelerates the training process, facilitates the extraction of more effective and stable feature representations from the input image, and enables the discriminator to focus on the most discriminative abstract information in the image by gradually reducing the spatial dimensionality of the features.
Compared with the traditional ReLU activation function, the LeakyReLU activation function allows small negative gradients to pass through, thus alleviating the problem of neuron ‘death’, while the batch normalization layer stabilizes the training process and speeds up convergence by normalizing the inputs between layers.
During the training process, the discriminator of AI-Evasion Shield receives two sets of inputs: one is the original cover image, and the other is the second-level encrypted image processed by the VRIS steganography algorithm with the addition of random Gaussian noise. Through a series of convolution, activation and batch normalization operations, the discriminator maps these input images to a scalar value between 0 and 1, which is converted by a sigmoid function and used as the discriminator’s confidence score for the authenticity of the input images. This score not only reflects the discriminator’s judgement on the authenticity of the image, but also provides an important reference for subsequent image processing and steganography techniques.
Through this design, the AI-Evasion Shield discriminator not only learns to differentiate between authentic cover images and second-level encrypted images, but also significantly enhances the difficulty for machine learning models to detect hidden information. This adversarial training approach improves the discriminator’s performance while simultaneously increasing the concealment of hidden information within the encrypted image.
3.3.3. Hidden-Insight
The Hidden-Insight module (as shown in
Figure 5 and Algorithm 2) is meticulously designed to extract concealed secret images from second-level encrypted images, addressing the needs of legitimate users with precision. Its design principles primarily leverage convolutional neural network architectures found in deep learning. Deep learning leverages the superior capabilities of this architecture in feature extraction and representation learning.
Algorithm 2. Algorithm of the Hidden-Insight. |
1: FUNCTION Initialize_Hidden_Insight(input_S): 2: # Step 1: Perform preliminary feature extraction and transformation on the carrier image 3: x1 = relu(conv1(input_S)) # Extract first feature map 4: x2 = relu(conv2(input_S)) # Extract second feature map 5: x2 = PAD(x2, (0, 1, 0, 1)) # Pad x2 to ensure consistent sizes 6: x3 = relu(conv3(input_S)) # Extract third feature map 7: x_new1 = CONCATENATE (x1, x2, x3) # Concatenate feature maps along the channel dimension 8: # Step 2: Further transform the concatenated feature map x_new1 9: x1’ = relu(conv4(x_new1)) # Extract new features 10: x2’ = relu(conv5(x_new1)) # Extract new features 11: x2’ = PAD (x2, (0, 1, 0, 1)) # Pad x2 12: x3’ = relu(conv6(x_new1)) # Extract new features 13: x_new2 = CONCATENATE (x1’, x2’, x3’) # Concatenate feature maps 14: x_atlas = CONCATENATE (input_S, x_new2) # Concatenate carrier image with x_new2 15: # Step 3: Repeat the process from step 2 three times to deepen the network 16: FOR i FROM 1 TO 2: # Repeat twice 17: x1 = relu(conv7(x_new2)) # Extract additional features 18: x2 = relu(conv8(x_new2)) # Extract additional features 19: x2 = PAD (x2, (0, 1, 0, 1)) # Pad x2 20: x3 = relu(conv9(x_new2)) # Extract additional features 21: x4_deep = CONCATENATE (x1, x2, x3) # Update x4_deep 22: # Step 4: Transform the final feature map 23: output_image = tanh(conv16(x4_deep)) # Process with conv16 and apply tanh activation 24: RETURN output_image # Secretly Reconstructed Image |
In the architecture of the Hidden-Insight module, the convolutional layers within the module significantly reduce the number of parameters through parameter sharing, effectively lowering the risk of overfitting while capturing the local correlations within images. This aspect is crucial for extracting key features of secret images. In order to fully cope with the multi-scale characteristics that secret image information may present, the Hidden-Insight module sets up convolutional kernels of different sizes and configurations in the first few layers to capture feature information at different scales, and accelerates the training process to enhance the generalization ability of the model so that it can cope with different input data.
As the second-level encrypted image processes through the layers, Hidden-Insight gradually refines the core features of the secret image through multiple convolutions, activations (such as ReLU), and feature concatenations. Particularly in the final layers, carefully designed combinations of convolutional layers further enrich the representation and accuracy of the features.
Lastly, to ensure the visual naturalness and authenticity of the output image, the Hidden-Insight module employs a tanh activation function in the final layer. The tanh function normalizes the output values within a suitable range for image representation, yielding a high-quality secretly reconstructed image. This design guarantees that the extracted secret image visually aligns with the original image, satisfying the needs of legitimate users.
3.4. Hybrid Loss Function
3.4.1. Mean Square Error Loss
Mean Squared Error (MSE) is a commonly used regression loss function to measure the difference between predicted and true values. In the image steganography task, our goal is to achieve that the encrypted image formed after embedding a secret image into a cover image is as similar as possible to the cover image in terms of visual effect, so as to avoid arousing suspicion. In order to quantify this difference, we use the mean square error as a loss function to compute the reconstruction loss between the secret image and the cover images.
where
represents the true value of the ith sample,
represents the predicted value of the predicted value of the ith sample, and T is the number of samples, because the MSE is obtained by calculating the mean of the squares of the difference between the predicted value and the true value; it is so-called because it is more sensitive to large errors.
3.4.2. Adam Optimizer
In image steganography tasks, the model usually needs to deal with multiple aspects of the image, and the gradient variations of different parameters during training may vary greatly. Adam is a gradient descent-based optimization algorithm that combines the ideas of Force Momentum (Momentum) and RMSprop, with the property of adaptive learning rate. Its ability to adjust the adaptive learning rate enables the model to remain efficient and stable while dealing with these differences. Secondly, the Adam optimizer not only considers the first order moment estimation of the gradient (i.e., the mean of the gradient), but also the second order moment estimation, so that it can dynamically adjust the learning rate of each parameter, and quickly converge to find the optimal solution during the training process.
3.4.3. Adam Optimizer Core Formula
Momentum estimation:
where
represents the first order estimate of the gradient (momentum),
is the current gradient and
represents the exponential decay rate of the momentum term.
Uncentred variance estimation:
where
represents the second order estimate of the gradient (uncentred variance) and
represents the exponential decay rate of the variance term.
Parameter update:
where
represents the parameter,
is the learning rate,
and
are the bias-corrected versions of the first-order and second-order moment estimates of the gradient, respectively, and
is a constant used to prevent division by zero.
3.4.4. A Variant of the Binary Cross-Entropy Loss
The binary cross-entropy loss function updates the parameters of the model through optimization algorithms such as gradient descent, making the model’s predictions closer to the true labels. By minimizing the binary cross-entropy loss function, we can train more accurate and efficient image steganography models. The basic form of the binary cross-entropy loss function can be expressed as follows:
where
represents the true label of the sample, and
represents the predicted probability of the discriminator on the sample.
The discriminator in GAN is essentially a binary classifier that can distinguish whether the input sample is a real sample or a generated sample. Therefore, during training for image steganography tasks, we use a variant of the binary cross-entropy loss function that is often used to measure the accuracy of the model in predicting hidden information.
The loss function of the discriminator is divided into two parts: the first part represents the loss of the discriminator’s prediction success for the true samples, and the second part is the loss of the discriminator’s prediction success for the generated samples, where in general we would like
to be close to 1, and
close to 0. The loss function of the discriminator is as follows:
where
represents the discriminator, G is the generator,
is the real data distribution, and
represents the noise distribution.
The generator goal is to trick the discriminator into producing false predictions, and in general, the generator wants to maximize the discriminator’s prediction probability for the generated samples. Therefore, in the actual training, we will maximize
. The generator’s loss function is shown below:
4. VRIS Multidimensional Assessment
In this section, the details of the experiment, the selection of the dataset, and the evaluation metrics will be presented.
4.1. Experimental Setup and Dataset
In this study, the Adam optimizer was used to optimize the parameters of the generator and discriminator, the learning rate was set to 0.001, and the ReduceLROnPlateau scheduler was used to dynamically adjust the learning rate. Trained in batches of 46, the model was trained on 1000 epochs, with each epoch randomly selecting 3200 images for training. Tiny ImageNet-200, Mini-ImageNet, 256_ObjectCategories, and LFW datasets were selected as experimental datasets. Tiny ImageNet-200 is a small dataset consisting of 200 categories of images, each containing 500 training images, 50 validation images, and 50 test images, which is commonly used for model training and evaluation of image classification tasks. The Mini-ImageNet dataset is also a dataset for small-sample image classification, which consists of a subset of categories in the ImageNet dataset, each containing 600 images. Compared to Tiny-ImageNet-200, the images of Mini-ImageNet are larger and clearer. Mini-ImageNet datasets are commonly used in small-shot learning, meta-learning, transfer learning, and adversarial training. The 256_ObjectCategories dataset is an image dataset containing 256 different object categories. The dataset is manually filtered from the Google Open Images dataset and is mainly used for tasks such as image classification, object recognition, and image segmentation. The LFW dataset is a commonly used face recognition dataset, which contains a set of real-world face images from the Internet, which have great variation, such as different postures, expressions, lighting conditions, etc. In this study, all images from these datasets were resized to 64 × 64 pixels.
4.2. Visual Quality Assessment
4.2.1. Evaluation Indicator Selection
PSNR (Peak Signal-to-Noise Ratio) is an indicator used to evaluate image quality, especially in image steganography, where it measures the quality degradation of encrypted images compared to the originals. It is derived by calculating the Mean Squared Error (MSE) between the original image and the encrypted image and normalizing it to the maximum pixel value of the image. A higher PSNR value indicates a smaller difference in quality between the encrypted image and the original image, meaning that the embedding process has a minimal impact on image quality, and the secret information is effectively hidden. PSNR mainly focuses on the image pixel-level error, and in image steganography, the level of PSNR value directly reflects the extent to which the image quality is maintained after secret image embedding.
- 2.
SSIM
SSIM is an image quality evaluation metric that simulates the human eye’s visual system and evaluates the similarity of images by considering brightness, contrast, and structure. In image steganography, SSIM is used to quantify the similarity between the encrypted image and the original image and to evaluate the influence of embedded information on image quality. A higher SSIM value indicates that the image still maintains a better visual quality after embedding the secret information, and at the same time can effectively hide the secret information. Compared with PSNR, SSIM pays more attention to image structure and provides an assessment that is more in line with human perception.
- 3.
Noise factor
In this study, the noise factor is used as a very important parameter to control the intensity of the random Gaussian noise by adjusting its value, which affects the performance of the image steganography model and produces encrypted images of different quality.
4.2.2. Results of the Experiment
Robustness is a crucial metric for evaluating the resistance of image steganography models against various attacks and noises. In assessing the robustness of the VRIS model, we selected the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) as the two primary performance evaluation criteria.
In this study, we evaluate the performance of the VRIS model by comparing the PSNR and SSIM of different datasets under different noise factors (as shown in
Table 3). In Tiny-ImageNet, the PSNR fluctuates slightly with increasing noise factor, while the SSIM is relatively stable, which indicates that the VRIS model is more robust in maintaining the image structure by optimizing the image under a specific noise intensity than random noise on image impressions. In Mini-ImageNet and LFW, both PSNR and SSIM remain relatively stable with increasing noise factor, and this result proves that the VRIS model is robust enough to resist the interference of random Gaussian noise while maintaining the high quality of the encrypted image. In 256-ObjectCategories, PSNR fluctuates more under different noise factors, while SSIM is relatively stable, which implies that the VRIS model is more sensitive to the changes of parameters such as luminance and contrast on this dataset, and also indicates that the VRIS model shows strong stability in maintaining the image structure.
We compared the quality of encrypted images generated by the VRIS model with the PSNR and SSIM of encrypted images generated by the state-of-the-art steganography model (shown in
Table 4). The VRIS model is better on the ImageNet and LFW datasets, and the values of PSNR and SSIM are higher than or equal to those of ISGAN, SGAN, and the methods in the literature [
38]. This indicates that VRIS outperforms or equals the other compared methods in terms of both image quality and similarity.
To further validate the quality of images generated by VRIS, we compared the histograms of each channel for the cover and first-level encrypted image. The results are illustrated in
Figure 6. The
x-axis of the histogram represents pixel values, which correspond to the brightness or color values of each pixel in the image. The
y-axis indicates frequency, showing the number of times pixels with the same value appear in the image. Thus, the histogram provides a clear understanding of the brightness or color distribution before and after steganography. The minor differences in channel histograms before and after steganography indicate that the proposed steganographic model possesses good embedding capacity and quality. It effectively maintains the visual quality of the images after embedding the secret image into the cover image, avoiding noticeable changes or distortions.
4.3. Security Analysis
4.3.1. Evaluation Indicator Selection
The Misclassification Rate is a key metric used to evaluate the performance of a classification model. It refers to the extent to which a model makes errors in the prediction process, which is defined as the ratio of the number of samples with classification errors to the total number of samples. In the image steganography task, the misclassification rate measures the misclassification effect of the adversarial sample when it deceives the target machine learning model. Specifically, it refers to the proportion of the total number of samples in an image dataset that successfully misleads a machine learning classifier to produce a misclassification in an image dataset containing steganography information. A high misclassification rate indicates that the image steganography model has a high technical level and innovation ability in the embedding and encoding of steganography information.
4.3.2. Experimental Results
Security is also a vital metric in evaluating image steganography models. In this study, we enhanced the security of the image steganography algorithm by adding random noise to the first-level encrypted image, thereby increasing the difficulty for unauthorized users to extract the secret information. The presence of noise prevents unauthorized users from directly extracting the hidden secret information from the encrypted image without undergoing additional processing. The addition of noise makes the secret information more difficult to analyze and extract, effectively safeguarding the security of the hidden information. Without sufficient information and keys, unauthorized users find it challenging to restore the original secret information, thereby protecting the confidentiality of the information. The specific experimental results are presented in
Table 5.
In summary, by adding noise to the first-level encrypted image, we can confuse unauthorized users’ analysis of the image content, enhancing the security of image steganography algorithms. Moreover, the misclassification rate is a crucial evaluation metric in image steganography models. We selected InceptionV3 and ResNet50 to detect the misclassification rates of second-level encrypted images generated by our steganography model under varying noise intensities. According to the experimental results (
Table 6), for InceptionV3, the misclassification rate shows an overall increasing trend as the noise factor rises, reaching 87.18% on the LFW dataset. For ResNet50, the misclassification rate also fluctuates under different noise factors, peaking at 99.24% when the noise factor is 0.05, which is quite notable. This indicates that our proposed image steganography model can effectively deceive machine learning models, resulting in erroneous classification outcomes while concealing secret images, demonstrating significant security and robustness. This is crucial for preserving the protection and privacy of steganographic information, especially in the context of sensitive data transmission and storage.
In addition, to further verify that the VRIS model also has significant advantages in terms of effectiveness and robustness, we conducted a comparative analysis of multiple steganographic algorithms using the excellent steganalysis algorithms SRM and XuNet. The experimental results show that the VRIS model has a significant advantage in terms of misclassification rate, which is higher than that of other algorithms under both steganalysis algorithms, which indicates that VRIS has a stronger ability to evade steganalysis detection (as shown in
Table 7).
4.4. Analysis of Concealment
To better demonstrate the pixel differences between the encrypted image and the cover image, we enhanced the residual images of the encrypted image and the cover image by 5, 10, 15, and 20 times (
Figure 7). The specific experimental results are shown in the figure. When the residual images of the encrypted image and the cover image are enhanced by different multiples, the noise points gradually increase and become scattered, making it more difficult to identify valid information, which helps protect the privacy of the carrier image and the secret image and prevents unauthorized access to sensitive information in the image. Furthermore, the scattered noise points can interfere with the vision of unauthorized users, making it difficult for them to distinguish the information in the image, thereby increasing the difficulty of analysis for unauthorized users and ensuring the concealment of the secret image.
We compared and analyzed the residual results of the method proposed in the literature [
38], the VRIS model, and the ISGAN model (shown in
Figure 8). Generally speaking, the color depth of the residual image reflects the degree of pixel difference between encrypted images and cover images: the darker the color, the smaller the pixel difference between the two, the less the image distortion, and the better the visual quality of encrypted images. When the residual multiplier is 1, the method proposed in the literature [
38] has been able to vaguely identify the infrared imaging features of the face, while the residual images generated by both the VRIS model and the ISGAN model present better concealment. With the gradual increase of the residual magnification, the difference becomes gradually more obvious, and when the residual magnification is 5 and 10, observing the residual images of the ISGAN model and the literature [
38], it can be found that the cover images can be almost completely recognized. While some detailed differences can also be observed in the residual images of the VRIS model, the degree of difference is significantly lower compared to the ISGAN model and the literature [
38], and the cover images cannot be recognized. This suggests that the VRIS model is able to maintain the visual quality of the cover images while embedding the secret images compared to the literature [
38] and the ISGAN model, resulting in a higher degree of concealment of the secret images.
6. Summary and Future Prospects
This paper introduces VRIS, an image steganography model capable of simultaneously deceiving both the human visual system and machines. By replacing the generator role in Generative Adversarial Networks (GANs) with an autoencoder, VRIS achieves the steganography of secret images. Random Gaussian noise is added to the encrypted image to ensure that unauthorized users cannot access the specific information of the secret image, thereby preserving its security and concealment. Experimental validation of the misclassification rates of the encrypted images generated by VRIS on InceptionV3 and ResNet50 demonstrates its strong security and robustness, particularly against various attacks, including those based on deep learning models.
However, as the noise factor increases, slight color differences emerge between the reconstructed secret image and the original, potentially compromising the integrity and accuracy of the secret information, particularly in applications where image quality is stringently required. Therefore, future research should focus on reducing the noise introduced during embedding through noise suppression techniques or enhancing the quality and accuracy of the secretly reconstructed image by optimizing the embedding algorithm. Furthermore, considering the various interference factors that may exist in practical application scenarios, deep learning methods can be explored to improve the robustness of the embedding algorithm, enabling effective embedding and extraction of secret images in diverse environments.
Given that the current VRIS model primarily focuses on image steganography, future endeavors can explore extending the model to multimodal data such as video and audio to cater to diverse application needs. This necessitates addressing the unique challenges of multimodal data, including synchronization and continuity while maintaining high concealment.