This paper proposes a data hiding network and extraction network structure based on Transformer. To evaluate how the learned model fits the data, the loss from both the hiding network and the extraction network are weighted and summed. Before hiding the secret image into the cover image, we encrypt the secret image to prevent the leakage of the secret image information. Thus, the generated container image is double-encrypted. In order to encrypt the secret image, this paper proposes an image encryption method based on recursive permutation. After the image is encrypted, we pass the encrypted secret image and cover image to the data hiding model to generate a container image for transmission. When a receiver receives the container image, the encrypted secret image is first extracted by the extraction network, and then the encrypted image is recovered to obtain the secret image.
3.1. Recursive Permutation
Traditional encryption algorithms usually encrypt an entire image by treating equally some-level data, such as bit-level, two-bit-level (DNA-level), pixel-level and/or block-level data. The encryption procedure is repeated until all data have been encrypted at least once.
It is known that many repeated tasks can be solved by introducing the idea of recursion. However, few existing encryption algorithms consider using such a strategy to conduct encryption. Here, we propose a type of recursive permutation for image encryption. The operations of the recursive permutation are determined by the generated sequence
of the widely used logistic chaotic system [
50], defined as below:
where
is an initial value in the range of
and
is a positive parameter in the range of
.
To our knowledge, it is the first time to apply recursive ideas to image encryption. The proposed encryption algorithm mainly consists of four steps, shown as follows.
Step 1. Generate a chaotic sequence the same size as the image.
Step 2. Divide the image to be encrypted into four parts: upper left, upper right, lower left, and lower right.
Step 3. Perform logistic transform encryption with the chaotic sequence on the overall image composed of four parts.
Step 4. Recursively conduct the above steps for each of the four parts until the width or height is 1.
By these four steps, a cipher image is obtained. Algorithm 1 shows the pseudocode for recursive encryption. Note that the called logistic_scramble_encryption function in Algorithm 1 refers to Algorithm 2. The decryption algorithm of recursive permutation is the inverse of the encryption algorithm.
Algorithm 1 Recursion_encryption(img, width, height, S) |
Input: The secret image to encrypt, img; The width of image, width; The height of image, height; The generated chaotic sequence, S; |
Output: The encrypted image, img; |
nw ←⌊width/2⌋, nh ←⌊height/2⌋ |
if nw < 1 or nh < 1 then |
return img |
else |
//Divide the image into four parts, and encrypt the four parts, respectively. Encrypt the upper left part. |
Recursion_encryption(img[0:nw, 0:nh, :], nw, nh, S) |
//Use the logistic_scramble_encryption function (Algorithm 2) to scramble the image with the chaotic sequence generated by the logistic algorithm. |
img[0:nw, 0:nh] ← logistic_scramble_encryption(img[0:nw, 0:nh], S) |
//Encrypt the lower left part. |
Recursion_encryption(img[0:nw, nh:height], nw, nh, S) |
img[0:nw, nh: height] ← logistic_scramble_encryption(img[0:nw, nh: height], S) |
//Encrypt the upper right part. |
Recursion_encryption(img[nw:width, 0:nh], nw, nh, S) |
img[nw:width, 0:nh] ← logistic_scramble_encryption(img[nw:width, 0:nh], S) |
//Encrypt the lower right part. |
Recursion_encryption(img[nw:width, nh:height], nw, nh, S) |
img[nw:width, nh:height] ← logistic_scramble_encryption(img[nw:width, |
nh:height], S) |
//Encrypt the entire image. |
img[0:width, 0:height] ← logistic_scramble_encryption(img[0:width, 0:height], S) |
end if |
return img |
Algorithm 2 logistic _scramble_encryption(img, S) |
Input: The image to encrypt, img; The generated chaotic sequence, S; |
Output: The encrypted image, img; |
w, h ← img.shape //Get the width (w) and height (h) of img. |
img ← img.flatten() //Convert img to 1D array. |
idx ← sort(S) //Sort S to obtain the corresponding indices (idx) of S. |
img ← img[idx,:] |
img ← img.reshape(w,h,3) |
return img |
3.2. Hiding Network
The hiding network uses a neural network structure based on the Swin-Transformer to hide the secret image into the cover image. The specific structure is shown in
Figure 1. An RGB cover image and an RGB secret image are used as network input and an RGB container image is used as the network output. All these three images have the same size of
. The hiding network consists of three modules: shallow information hiding, deep information hiding, and construction container image modules. Shallow information hiding module uses a
convolution layer. The convolution layer is good at early visual processing, leading to more stable optimization and better results [
51].
This also provides a simple way to map the input image space to a high-dimensional feature space. Then, the deep information hiding module composed of one Patch Embedding, four residual Swin-Transformer blocks (RSTB), one LayerNormal, one Patch Unembedding, and a convolution layer, which is used to hide deep information of the images. Finally, the construction container image module uses a convolutional layer to construct the container image with the size of .
As shown in
Figure 1, RSTB is a residual block with Patch unembedding, Patch embedding, Swin-Transformer layer (STL) and convolutional layer. STL is based on the standard multi-head self-attention of the original Transformer layer [
24,
49]. The main differences lie in local attention and the shifted window mechanism. As shown in
Figure 2, given an input image of size
, Swin-Transformer first reshapes the input to a
feature by partitioning the input into non-overlapping
local windows, where
is the total number of windows. Then, calculate the standard self-attention for each window, i.e., local attention. For a local window feature
, the query, key and value matrices
Q,
K and
V are computed as:
where
,
and
are projection matrices that are shared across different windows. Generally, we have
. As shown in
Figure 3, the attention matrix is thus computed by the self-attention in a local window as
where
E is the learnable relative positional encoding. In practice, following [
24], we perform the attention function six times in parallel and concatenate the results for multi-head self-attention (MSA).
Next, a multi-layer perceptron (MLP ) that has two fully connected layers with GELU non-linearity between them is used for further feature transformations. The LayerNorm (LN) layer is added before both MSA and MLP, and the residual connection is employed for both modules. The whole process is formulated as:
However, when the partition is fixed for different layers, there is no connection across local windows. Therefore, regular and shifted window partitioning are used alternately to enable cross-window connections [
49], where shifted window partitioning means shifting the feature by pixels before partitioning. In order to enable cross-window, the number of STL modules must be even.
Figure 2 shows the two successive Swin-Transformer blocks. From
Figure 4, in W-MSA window partitioning, a regular window partitioning scheme is adopted, and self-attention is computed within each window. In SW-MSA window partitioning, the window partitioning is shifted, resulting in new windows.
The self-attention computation in the new windows crosses the boundaries of the previous windows in W-MSA window partitioning, providing connections among them. In two successive Swin-Transformer layer, the first module uses a regular window partitioning strategy which starts from the top-left pixel, and the feature map is evenly partitioned into windows of size . Then, the next module adopts a windowing configuration that is shifted from that of the preceding layer by displacing the windows by pixels from the regularly partitioned windows. W-MSA and SW-MSA denote window based multi-head self-attention using regular and shifted window partitioning configurations, respectively.
3.4. Loss Function
The evaluation criteria of traditional image data hiding schemes include peak signal-to-noise ratio (PSNR), mean squared error (MSE), etc., which are used to quantify the difference between the original cover image and the container image, and the difference between the secret data and the extracted data. Therefore, the MSE is used as the model loss function in this paper. In the hiding network, MSE is used to measure the difference between the cover image
C and the container image
, while in the extracting network, the MSE is used to measure the difference between the secret image
S and the extracted secret image
. The MSE function equation can be formulated below:
where
I and
denote two matrices for MSE operation, and
M and
N denote the length and width of the matrix, respectively. The loss function of the data hiding network is defined as:
where
and
are the cost of the hiding network and the extraction network, respectively,
is a tradeoff factor to balance these two types of loss. Here, the weight of the error term
of the hiding network is not shared with the weight of the extraction network, and the weight of the error term
is shared between the two networks. This ensures that the two networks adjust the network training by receiving this error term to minimize the error loss of the hiding network reconstructed secret image and the cover image, and to ensure that the information of the secret image is completely encoded on the cover image.