Inspired by the visual word embeddings designed in TNT [
30] and the dual-branch structure in EDTER [
36], we propose a local information-enhanced dual-branch Transformer network for CMFD tasks named the Local Branch Refinement Transformer (LBRT), which ensures that the local branch’s patch segmentation is based on the segmentation of small patches from large patches in the global branch, making the detail features within each patch able to be extracted. Meanwhile, LBRT achieves the extraction of local information and the refinement of local details in the global information, while adding very few parameters. Moreover, pre-trained weights are not needed for the local branch’s initialization, and joint training with the global branch shortens the overall training time.
As shown in
Figure 2, our network follows an overall encoder–decoder structure. Given an input image
, we first apply a ViT-base [
26] encoder with a dual-branch design to extract global similarity features
and local detail features
for copy–move forgery. The feature extraction process consists of the global context modeling branch and the local refinement branch. The global branch aims to extract features with a global perspective, enabling the model to focus on the source/target regions where copy–move manipulations occur in the image. The local branch aims to extract features within smaller image patch dimensions, allowing the model to focus on the edge details of local regions. Following the feature extraction stage, a fusion module combines the features
derived by the global branch and
derived by the local branch, producing a final feature representation
that incorporates both the similarity of long-range regions that have been tampered with and detailed information from local regions. After feature fusion is finished, the 2D feature map
is sent to the decoder to ultimately obtain a copy–move tampering mask
.
3.1. Global Context Modeling Branch
Figure 3 depicts the specific structure of the global context modeling branch. Using an encoder with ViT-base standards [
26] for feature extraction, the model is able to extract two correlated regions from the original image, namely the source and target regions of the copy–move forgery. The ViT-base encoder has the advantage of being more in line with the characteristics of the CMFD task. The core mechanism in the ViT-base encoder, called global self-attention calculation, is able to focus on the region similarity in the global view based on the global patches divided from the image. The process is analogous to the initial observation of the entire image by the human eye when detecting copy–move forgery, followed by a search for conspicuously similar regions that are subsequently included in the areas that are suspected of being tampered with. Moreover, the global self-attention calculation also replaces the self-correlation calculation module that is widely used in DCNN-based models to some extent because it has been shown to be capable of extracting the correlation between different regions of the image itself and updating the feature representation accordingly.
We denote an input image as
, where
h and
w are the height and width of the image, respectively. Firstly, to convert
into a 1D sequence that can be used as input for the Transformer encoder, the image has to be processed through the image pre-processing block. The image is filtered by a convolution layer with a kernel size of
and stride of
, which is equivalent to the process of flattening the image and linearly projecting it as a patch embedding
, where
,
represents the
-th global patch,
, and
represents the channel number of
. Position embeddings
are then added to
as element-wise additions, generating the global patch embedding
as the input of the global Transformer encoder, which retains the same size as
. The image pre-processing step can be formulated as
where
l represents the
l-th layer of the encoder, and
. Following the standards of the ViT-base backbone,
l,
, and
are set to 12, 16 and 768, respectively.
Subsequently,
is fed to the global Transformer encoder, which is composed of multiple layers of multi-head self-attention (MSA) and multi-layer perceptron (MLP) modules. The global encoding procedure enables a global perspective in order to identify regions that exhibit copy–move tampering traits throughout the image at the patch level. The MSA block and MLP layers are added before a LayerNorm layer and residual connection.
Figure 4 shows the specific steps of the global Transformer encoder.
In the MSA module,
is first multiplied by three learnable matrices,
,
, and
, to obtain
,
, and
, respectively. Next, there are
attention layers running in parallel, namely heads, within each block, which can divide the channels of
,
, and
into
groups. Each group of
,
, and
is used for parallel multi-head self-attention (MSA) calculations to obtain self-attention matrix
. Then,
is multiplied by
to update the values of
based on the attention weights between different patches. The results of each head are then concatenated to obtain the final output of this module,
. The mapping of
,
, and
to the output
is calculated as
where
.
is set to 256, which is equal to the length of
,
, and
;
is set to 12;
denotes a global self-attention calculation with the same size as
,
, and
;
denotes a global multi-head self-attention calculation with the same size as the input of module
;
denotes an activation function implemented by softmax;
denotes channel concatenation.
The process of GMSA calculation involves computing the feature similarity between each global patch and the other patches using a dot product. The resulting self-attention matrix , which represents the correlation between elements, is then used as weights for weighted summation to update the feature representation applied to each global patch. During MSA calculation, elements with a low correlation have a diminished impact, allowing the network to gradually focus on regions with high similarity, which are suspected to be regions that have been tampered with.
After performing the multi-head self-attention calculations, is input into the MLP module to enrich the feature representation, where the number of channels is quadrupled and projected back. The output of this module is the final output of one layer of the encoder, denoted as , which is also the input for the next layer of the encoder.
The final output of the global Transformer encoder, denoted as , is obtained after completing the encoding procedure for all 12 stacked layers.
3.2. Local Refinement Branch
It is not enough to capture the entirety of the forgery feature information by modeling the global context alone. The image is roughly divided into
patches in the global branch. The self-attention between the query and key is calculated based on the patches in the ensuing MSA calculations to discover the correlations between the patches, which are essential for feature extraction. Nevertheless, modeling of the local information within each patch is lacking, which may have an impact on the localization performance regarding regions that have been tampered with. Consequently, we design a local refinement branch that is implemented inside the global patches defined by the global branch to carry out further local Transformer encoding. The feature extraction procedures of two branches are run concurrently. The architecture of the local refinement branch is illustrated in
Figure 5.
In the local branch, we still need to divide
into several patches, called global patches, and then flatten these patches to a 1D sequence
. Note that in this branch, every global patch will be treated as an input image, i.e., during image pre-processing, every global patch will be re-divided, and each global patch
will be divided into several sub-patches with a resolution of
, called local patches. Afterwards, each batch of local patches is sent to the linear projection layer to be processed as a local patch embedding and then added to the position embedding as the input of the local Transformer encoder
. In this case,
;
m represents the
m-th global patch;
;
represents the
-th local patch that belongs to the
m-th global patch. Thus, pre-processing by the local branch is finished. The following procedure for obtaining the local patches is performed in the Intra-Patch Re-Dividing Layer (IPRL), and it can be formulated as
where
, which represents the local patch embedding obtained by the
m-th global patch during image pre-processing,
l represents the
l-th layer of the encoder,
is the linear projection layer,
, and
is the channel number of local patch embedding, which is set to a lower value of 48 in order to account for the computational cost of re-concatenating the local patch embedding after completing the following encoding process.
The IPRL design ensures that local patches are divided based on a global patch, and each batch of local patches belonging to different global patches is fed into the encoder for parallel local self-attention calculation, thereby facilitating the extraction of detailed features from each local area in the image. Thanks to the incorporation of position embedding in every local patch, the locations of all local patches in the image are recorded, which facilitates the subsequent fusion between the local and global features.
Afterwards, all local patch embeddings
obtained from a single image are fed into the local Transformer encoder in batches. This has the same structure as the global Transformer. The local Transformer encoder calculates the self-attention within each global patch based on the local patch to find the local details within the global patches. The MSA calculation of the local branch can be formulated as
where
;
is set to 12,
is the local self-attention calculation,
is a local multi-head self-attention calculation with the same size as the input of the module
, and
and
denote the same operations as in the global branch.
The LMSA calculation facilitates the comprehensive exploration of the correlations among the distinct local patches within each global patch. This type of local feature effectively captures the detailed semantic information in each local area of the image, thereby enabling the network to focus on meaningful local areas and compensating for the shortcomings of the basic Transformer backbone in extracting the internal information of global patches.
After completing the 12-layer encoding, we concatenate local patch embeddings to obtain the final output of the local branch, denoted as . The output is fed into the fusion module along with the final output of the global branch .
3.3. Feature Fusion Module and Decoder
As previously indicated, we have extracted the long-range global contextual information within the whole image from the global branch and the local information within the patches from the local branch. Subsequently, the features derived from the two branches need to be combined by the feature fusion module so that the network can simultaneously focus on two extremely similar manipulation regions in the global context of the image and focus on the edge artifacts created by the copy–move forgery in the local regions. This would ensure that the focus regions are the copy–move forgery regions and that the region’s localization accuracy is higher.
Specifically, the global branch output and the local branch output f are first reshaped as 2D feature maps and , respectively. The two feature maps are then concatenated by the channels and fed into a convolution layer with a kernel size of 1 × 1, keeping the size of the output consistent with the two inputs, whereas the number of output channels is adjusted to 256. After the convolution is finished, a BatchNorm layer and a ReLu activation function are used to obtain the final output of the module .
The design of the fusion module is quite simple, but it has proven to be effective. Since the IPRL designed in the local branch ensures that extraction of the local information is based on each global patch, and the position embedding has been added to every local patch to record the position information in the image while generating the local image block embedding, the global features and local features can be aligned in space. Therefore, only the channel-wise fusion needs to be considered in the fusion process. Mechanically, each channel of the feature map can be regarded as a feature descriptor. In this situation, a simple 1 × 1 convolution layer can be used for channel selection, eliminating extraneous description information and efficiently incorporating the local feature descriptors into the global features to refine the description of the dominant global features. The overall feature representation not only retains the judgment of two highly similar regions in the global feature, but it also assists with judging whether tampering traces actually exist in these suspected regions according to the local features and locates the edges of regions that have been tampered with by the semantic information within each local region more accurately.
Figure 6 illustrates the specific structure of the feature fusion module.
After a 2D feature map with both global and local information
is obtained, the feature map needs to be decoded to obtain the pixel-level predicted mask. We create a simple but useful decoder to up-sample the 2D feature map in a learnable way. The decoder also relies on the convolution layer as its fundamental module. Following the fusion of the two types of feature information by the feature fusion module, the output
is fed into two 3 × 3 convolution layers. Each convolution layer also keeps the size of the outputs consistent with the input, and we implement a BatchNorm and ReLu layer afterwards. However, the first 3 × 3 convolution layer keeps the number of channels unchanged, whereas the second 3 × 3 convolution layer reduces the channel number to 1. Following the second 3 × 3 convolution layer, we up-sample the feature map by a factor of
using a bilinear interpolation function to allow the final decoded mask
to match the spatial resolution of the original image. Thus far,
is the predicted mask that indicates the position of the copy–move forgery occurring in the original image.
Figure 6 illustrates the decoder’s specific structure.
3.4. Loss Function
In order to train LBRT, we use the binary cross-entropy loss (
) for localization tasks and fully supervise the predicted mask based on the pixel labels in the ground truth (GT) that are the same size as those of the original image, where a value of 0 represents the original pixel and a value of 1 represents a pixel that has been tampered with.
is formulated as
where
is the ground truth mask that indicates the position of the copy–move forgery. In addition, the LBRT solution adds a corresponding auxiliary decoding head, which takes the results from different layers of the Transformer encoder, inputs them into the fusion module and decoder, which are the same as those of the main task, and calculates the auxiliary losses by comparing the predicted mask with the GT, denoted as
, where
i represents the
i-th encoder layer. The auxiliary losses are then summed to obtain the loss of the main task, denoted as
, to obtain the final total loss. The addition of an auxiliary loss has been shown to contribute to the convergence of model training [
37]. The calculation of the total loss is formulated as
where
represents the loss of the
i-th auxiliary decoding head,
represents the loss of the main task, and
is the total loss. In this study, we take the 3rd, 6th, 9th and 12th layers of the Transformer encoder to feed into the auxiliary decoding head.