Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

LBRT: Local-Information-Refined Transformer for Image Copy–Move Forgery Detection

Sensors 2024, 24(13), 4143; https://doi.org/10.3390/s24134143

by Peng Liang^*,†, Ziyuan Li^†

, Hang Tu

and Huimin Zhao

Reviewer 1:

Ching-Ta Lu

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Sensors 2024, 24(13), 4143; https://doi.org/10.3390/s24134143

Submission received: 15 May 2024 / Revised: 19 June 2024 / Accepted: 24 June 2024 / Published: 26 June 2024

(This article belongs to the Special Issue Image Processing and Analysis for Object Detection: 2nd Edition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Contributions:

This study proposed a local information refined dual-branch Transformer-based network for image copy-move forgery detection. My comments are as follows:

The full names of abbreviations should be provided when they first appear, such as LBRT USCI CASIA, CMFD, and DEFACTO. Please check them throughout this paper.
The performance is acceptable.
(Line 78 on page 3) Why is the abbreviation LBRT?
(Page 6) Please briefly introduce the ViT-based standard.
(Line 248 on page 7) The group number should be n1 rather than n.
(Line 249 on page 7) Matrix A does not appear in eq. (2).
(Page 7) What is the value of D_h?
(Page 7) The symbols of matrices and vectors should be bolded.
(Page 7) What are the dimensions of CSA(Z^l) in eq. (2) and GMSA(Z^l) in eq. (3)?
(Line 254 on page 7) Are q, k, and v are matrices? If so, they are inconsistent with Fig. 2. They have been flattening.
(Line 262 on page 7) Why do you use the symbol Z¹²?
(Page 8) The borderline in Fig. 5 should be removed.
(Line 285 on page 8) The statement should be rewritten.
(Page 9) As shown in eq. (5), q and k are matrices. Why does it need a transpose operator on k?
(Page 10) Equation (7) is unclear. Please define it explicitly.
(Page 10) Section 4 needs to be renamed. It can be “experimental results”.

Comments on the Quality of English Language

The quality of the English language should be further improved.

Author Response

We feel great thanks for your professional review work on our article. As you are concerned, there are several problems that need to be addressed. According to your nice suggestions, we have made extensive corrections to our previous draft, the detailed corrections are listed below.

Comment 1: The full names of abbreviations should be provided when they first appear, such as LBRT USCI CASIA, CMFD, and DEFACTO. Please check them throughout this paper.

Response 1: Thank you for pointing this out. The full names of abbreviations has been added when they first appear in the main body of the manuscript. However, USCISI, CASIA and DEFACTO are the official names of these datasets, and we can not find their full names in their original papers, so we can not provide them.

Comment 2: The performance is acceptable.

Response 2: Thank you for your recognition of our work.

Comment 3: (Line 78 on page 3) Why is the abbreviation LBRT?

Response 3: Thank you for pointing this out. The full name of LBRT is Local Branch Refinement Transformer. We have added LBRT's full name on page 3, line 81.

Comment 4: (Page 6) Please briefly introduce the ViT-based standard.

Response 4: Thanks for your suggestion. The ViT-base encoder uses a 12-layer encoding block, the patch size is 16x16, the number of heads in multi-head self-attention calculation is 12, and the number of channels of the patch embeddings is 768. The encoder of our global branch is designed in accordance with these hyperparameters. The above hyperparameters are represented correspondingly as l, p₁, n₁, and d₁ in our manuscript. We have also added a detailed description of the ViT-base encoder on page 7, lines 236-245.

Comment 5: (Line 248 on page 7) The group number should be n1 rather than n.

Response 5: We are really sorry for our careless mistakes. Thank you for your reminder and the mistake have been corrected in the revised version.

Comment 6: (Line 249 on page 7) Matrix A does not appear in eq. (2).

Response 6: Thank you for pointing this out. We have modified formulas 2-3 and their descriptions on page 8 by adding the representation of matrix A.

Comment 7: (Page 7) What is the value of D_h?

Response 7: Thank you for pointing this out. The value of D_h is set to 256, which is equal to the length of the vectors q, k and v. We have added the description on page 8, lines 272-273.

Comment 8: (Page 7) The symbols of matrices and vectors should be bolded.

Response 8: Thank you for pointing this out. According to your suggestion, all of the symbols of matrices and vectors in Section 3 have been checked and bolded.

Comment 9: (Page 7) What are the dimensions of CSA(Z^l) in eq. (2) and GMSA(Z^l) in eq. (3)?

Response 9: Thank you for pointing this out. The dimensions of GSA(Z^l) in eq. (2) and GMSA(Z^l) in eq. (3) are both N₁×d₁, which are equal to the dimension of Z^l. In our work, N₁ and d₁ represent the length of the patch embedding and the number of channels of the patch embedding respectively, which are set to 256 and 768 respectively.

Comment 10: (Line 254 on page 7) Are q, k, and v are matrices? If so, they are inconsistent with Fig. 2. They have been flattening.

Response 10: q, k, and v are two-dimensional matrices with length N₁ and number of channels d₁. However, if we only consider its dimensions in space, they are actually obtained by flattening the color image in space, i.e. flattening from a two-dimensional image to a one-dimensional sequence.

Comment 11: (Line 262 on page 7) Why do you use the symbol Z¹²?

Response 11: The global patch embedding is denoted as Z^l, where l represents the l-th layer of the encoder. Therefore, Z¹²means the final output of the golobal Transformer encoder after completing all 12 stacked layers. We have also added the relative description on page 8, line 290.

Comment 12: (Page 8) The borderline in Fig. 5 should be removed.

Response 12: Thank you for pointing this out. We have removed the borderline in Fig.4 and Fig.5 in the revised version.

Comment 13: (Line 285 on page 8) The statement should be rewritten.

Response 13: We are really sorry for our careless mistakes. Thank you for your reminder and the mistake have been corrected in the revised version.

Comment 14: (Page 9) As shown in eq. (5), q and k are matrices. Why does it need a transpose operator on k?

Response 14: As we mentioned in Response 10, q, k are two-dimensional matrices with length N₁ and number of channels d₁. When the matrix is multiplied, k needs to be transposed into a d₁×N₁matrix, so that each channel of q can be multiplied and summed with each channel of k, and finally the attention matrix A with the dimension of N₁×N₁ is obtained.

Comment 15: (Page 10) Equation (7) is unclear. Please define it explicitly.

Response 15: Thank you for pointing this out. In the revised version, the equation is numbered 9, and we have added euqation (8) to the manuscipt, along with a detailed description of euqation (9) to clearly define the formula.

Comment 16: (Page 10) Section 4 needs to be renamed. It can be “experimental results”.

Response 16: Thanks for your suggestion. We have modified the name of Section 4 in the revised version.

Comment 17: The quality of the English language should be further improved.

Response 17: Thanks for your suggestion. We feel sorry for our poor writings. We have used the paid editing service of MDPI to help us polish the language in the revised manuscript. We hope the revised manuscript could be acceptable for you.

Reviewer 2 Report

Comments and Suggestions for Authors

1. Innovation and Originality

Novel CMFD Network: The introduction of a Transformer-based CMFD network with a dual-branch structure to capture both global and local information is innovative. This approach addresses the limitations of previous DCNN-based methods by leveraging the strengths of Transformers in modeling global context and refining local details.
Local Information Refinement: The refinement of Transformer encoding in the local branch and the implementation of the Intra-Patch Re-dividing Layer (IPRL) demonstrate an original approach to enhancing local feature extraction without significantly increasing model complexity.

2. Technical Contribution

Dual-Branch Structure: The use of a dual-branch structure that separately processes global and local information before fusing them is a significant technical contribution. This method ensures a more comprehensive feature extraction process, potentially leading to more accurate forgery detection.
Intra-Patch Re-dividing Layer (IPRL): The IPRL is a novel layer that re-divides global patches into smaller local patches, ensuring detailed local feature extraction. This innovation balances the model's complexity and performance, enhancing local detail extraction with minimal parameter increase.

3. Methodology

Model Description: The proposal should provide detailed descriptions of the global context modeling branch and the local refinement branch, including how the multi-head self-attention mechanism is applied and how the IPRL operates.
Feature Fusion: The process of fusing features from the global and local branches should be clearly explained, detailing how the integration enhances the overall feature representation.

4. Experimental Validation

Datasets: The use of multiple datasets (USCISI, CASIA CMFD, and DEFACTO CMFD) for validation is a strong point, providing comprehensive evaluation across different scenarios and image types.
Comparative Analysis: The proposal claims that the LBRT outperforms advanced techniques, including traditional DCNN models and those with additional attention mechanisms. Specific performance metrics, results, and comparisons should be included to substantiate this claim.

5. Practical Implications

Improvement in Accuracy: By refining global features with local edge information, the proposed method improves the accuracy of locating copy-move forgery regions, which is a critical aspect of CMFD.
Model Efficiency: The minor increase in model parameters due to the local branch refinement suggests that the proposed method is efficient, making it practical for real-world applications without the need for extensive computational resources.

6. Future Work

Refinement and Optimization: The future work should focus on refining the network's encoding processes and optimizing computational costs, ensuring that the model remains lightweight and efficient.
Source and Target Differentiation: Enhancing the model's capability to differentiate between the source and target of copy-move forgery will further improve its practical applicability and robustness.

Author Response

Thank you very much for taking the time to review this manuscript. According to your nice suggestions, we have made extensive corrections to our previous draft, the detailed corrections are listed below.

Comment 1:

Innovation and Originality

Novel CMFD Network: The introduction of a Transformer-based CMFD network with a dual-branch structure to capture both global and local information is innovative. This approach addresses the limitations of previous DCNN-based methods by leveraging the strengths of Transformers in modeling global context and refining local details.
Local Information Refinement: The refinement of Transformer encoding in the local branch and the implementation of the Intra-Patch Re-dividing Layer (IPRL) demonstrate an original approach to enhancing local feature extraction without significantly increasing model complexity.

Response 1:

Thank you for your nice comments on our article. We appreciate to your recognition of the innovation and originality in our work.

Comment 2:

Technical Contribution

Dual-Branch Structure: The use of a dual-branch structure that separately processes global and local information before fusing them is a significant technical contribution. This method ensures a more comprehensive feature extraction process, potentially leading to more accurate forgery detection.
Intra-Patch Re-dividing Layer (IPRL): The IPRL is a novel layer that re-divides global patches into smaller local patches, ensuring detailed local feature extraction. This innovation balances the model's complexity and performance, enhancing local detail extraction with minimal parameter increase.

Response 2:

Thank you for your nice comments on our article. We appreciate to your recognition of the technical contribution in our work.

Comment 3:

Methodology

Model Description: The proposal should provide detailed descriptions of the global context modeling branch and the local refinement branch, including how the multi-head self-attention mechanism is applied and how the IPRL operates.
Feature Fusion: The process of fusing features from the global and local branches should be clearly explained, detailing how the integration enhances the overall feature representation.

Response 3:

Thank you for pointing this out. We agree with the comment about model description and feature fusion. In response to your comments, we have revised the descriptions of global branches, local branches, and feature fusion module in sections 3.1 to 3.3, respectively. We have added a detailed explanation of how multi-head self-attention works in global branches on lines 277-283 on page 8. We have added a description about how IPRL works in local branch and the mechanism of local multi-head self-attention in lines 320-325 on page 9 and 335-340 on page 10, respectively. We have revised and refined the description of how the feature fusion module fuses global and local features and the explanation of why it fully considers the two features and enhances the overall feature representation well on page 10, lines 360 -374.

Comment 4:

4. Experimental Validation

Datasets: The use of multiple datasets (USCISI, CASIA CMFD, and DEFACTO CMFD) for validation is a strong point, providing comprehensive evaluation across different scenarios and image types.
Comparative Analysis: The proposal claims that the LBRT outperforms advanced techniques, including traditional DCNN models and those with additional attention mechanisms. Specific performance metrics, results, and comparisons should be included to substantiate this claim.

Response 4:

Thank you for pointing this out. We agree with the comment about comparative analysis, but the models compared to LBRT in this section have already included both traditional DCNN models and models with additional attention mechanism. We have accordingly modified the description of each baseline on page 12, lines 437-443, emphasizing which models are traditional DCNN models and which models have additional attention mechanism.

Comment 5:

Practical Implications

Improvement in Accuracy: By refining global features with local edge information, the proposed method improves the accuracy of locating copy-move forgery regions, which is a critical aspect of CMFD.
Model Efficiency: The minor increase in model parameters due to the local branch refinement suggests that the proposed method is efficient, making it practical for real-world applications without the need for extensive computational resources.

Respons 5:

Thank you for your nice comments on our article. We appreciate to your recognition of the practical implications in our work.

Comment 6:

Future Work

Refinement and Optimization: The future work should focus on refining the network's encoding processes and optimizing computational costs, ensuring that the model remains lightweight and efficient.
Source and Target Differentiation: Enhancing the model's capability to differentiate between the source and target of copy-move forgery will further improve its practical applicability and robustness.

Respons 6:

Thank you again for your positive comments and valuable suggestions to improve the quality of our manuscript. We will consider focusing our future work on these aspects according to your suggestions.

Reviewer 3 Report

Comments and Suggestions for Authors

1.The structure of the abstract is generally clear, but some sentences lack smooth transitions. It is suggested to begin the introduction of the LBRT model with a brief explanation of how TNT and EDTER inspired this model, followed by a detailed description of LBRT's innovations.

2. The internal logic within each section is clear, gradually guiding readers to understand the design and implementation of the LBRT model. However, there is a lack of transitional sentences and conjunctions to enhance coherence between paragraphs.

3. The description of the shortcomings of existing methods is somewhat brief. It is recommended to elaborate on the specific deficiencies of deep convolutional neural networks and Transformer structures in CMFD, to enhance readers' understanding of the necessity of the research. Detailed descriptions of the innovative features and advantages of the LBRT model can increase readers' interest and appreciation of this approach.

4. The description of the "Local Refinement Branch" is quite detailed, but it is suggested to provide more detailed explanations on how the "global branch" and "local refinement branch" interact with each other.

5. For the "ViT-base encoder with dual-branch design" section, it is recommended to provide a detailed explanation of why the ViT-base encoder was chosen and its advantages in CMFD tasks. For the "feature extraction process" section, detailed explanations of the specific implementation methods and interactions between the global context modeling branch and the local refinement branch are suggested. Regarding the "fusion module" section, it is advisable to provide detailed explanations of the specific methods and advantages of feature fusion.

Author Response

Comment 1:

The structure of the abstract is generally clear, but some sentences lack smooth transitions. It is suggested to begin the introduction of the LBRT model with a brief explanation of how TNT and EDTER inspired this model, followed by a detailed description of LBRT's innovations.

Response 1:

Thank you for your nice comments on our article. According to your suggestions, we have checked the content of the abstract and adjusted some of the statements. We have modified the description of TNT and EDTER in lines 194-208 on page 5, and briefly explained which contents of these two works inspired our proposed model in lines 210-218. In addition, we have added a description of some LBRT innovations in sections 3.1 to 3.3, which can be found in paragraph 1 of section 3.1 on page 7, paragraph 1 of section 3.2 on page 8, and paragraph 3 of section 3.3 on page 10.

Comment 2:

The internal logic within each section is clear, gradually guiding readers to understand the design and implementation of the LBRT model. However, there is a lack of transitional sentences and conjunctions to enhance coherence between paragraphs.

Response 2:

Thank you again for your positive comments and valuable suggestions to improve the quality of our manuscript. We edited the entire manuscript in English, believing that these changes would make the transition between paragraphs more coherent.

Comment 3:

The description of the shortcomings of existing methods is somewhat brief. It is recommended to elaborate on the specific deficiencies of deep convolutional neural networks and Transformer structures in CMFD, to enhance readers' understanding of the necessity of the research. Detailed descriptions of the innovative features and advantages of the LBRT model can increase readers' interest and appreciation of this approach.

Response 3:

Thank you for your suggestions on our article. In two paragraphs on page 2, we have elaborated on the specific shortcomings of the existing DCNN models and approaches with Transformer structure in CMFD. The introduction of the innovative features and benefits of LBRT is summarized on page 3, and some details are added when the specific content of the method is introduced in Methodology. If there are any other modifications we could make, we would like very much to modify them and we really appreciate your help.

Comment 4:

The description of the "Local Refinement Branch" is quite detailed, but it is suggested to provide more detailed explanations on how the "global branch" and "local refinement branch" interact with each other.

Response 4:

Thanks for your suggestion. At the stage of feature extraction, the two branches are completely parallel, and the interaction between them is reflected in the comprehensive consideration and fusion of the two features at the stage of feature fusion. According to your suggestions, we have provided a more detailed explanation about the fusion mechanism of the two branches and how they contribute to the overall feature representation together on pages 360-374.

Comment 5:

For the "ViT-base encoder with dual-branch design" section, it is recommended to provide a detailed explanation of why the ViT-base encoder was chosen and its advantages in CMFD tasks. For the "feature extraction process" section, detailed explanations of the specific implementation methods and interactions between the global context modeling branch and the local refinement branch are suggested. Regarding the "fusion module" section, it is advisable to provide detailed explanations of the specific methods and advantages of feature fusion.

Response 5:

Thank you for your nice comments on our article. According to your suggestions, we have added a detailed explanation of the reasons for choosing ViT-base encoder and its advantages in CMFD tasks on pages 6-7, lines 233-245. As we mentioned in Response 4, the two branches run in parallel in the feature extraction process, and their interaction needs to be reflected in the fusion module. For comments about the feature fusion section, we have added a detailed description of the feature fusion process and its advantages on page 345-374 of page 10.

Thank you again for your positive comments and valuable suggestions to improve the quality of our manuscript.

Reviewer 4 Report

Comments and Suggestions for Authors

The authors describe an approach for detecting and locating copy-move forgery based on a dual-branch transformer structure allowing to extract local and global information. It performs Transformer encoding on global patches divided from the image and local patches re-divided from global patches using a global modeling branch and a local refinement branch, respectively. The self-attention features from both branches are combined, up-sampled, and decoded. Empirical results show the quality of the proposal. Overall, the paper is nicely written, well motivated, the proposal is well described. The manuscript provides a nice contribution to the litterature.

Author Response

Comments 1: The authors describe an approach for detecting and locating copy-move forgery based on a dual-branch transformer structure allowing to extract local and global information. It performs Transformer encoding on global patches divided from the image and local patches re-divided from global patches using a global modeling branch and a local refinement branch, respectively. The self-attention features from both branches are combined, up-sampled, and decoded. Empirical results show the quality of the proposal. Overall, the paper is nicely written, well motivated, the proposal is well described. The manuscript provides a nice contribution to the litterature.

Response 1: Thank you very much for taking the time to review this manuscript. We appreciate your recognition of our work. If there are any other modifications we could make, we would like very much to modify them and we really appreciate your helps.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have improved the quality of this paper. I think this paper can be accepted for publication.

Comments on the Quality of English Language

The quality of the Ehglish language is acceptable.

Article Menu

LBRT: Local-Information-Refined Transformer for Image Copy–Move Forgery Detection

1. Innovation and Originality

2. Technical Contribution

3. Methodology

4. Experimental Validation

5. Practical Implications

6. Future Work

Innovation and Originality

Technical Contribution

Methodology

4. Experimental Validation

Practical Implications

Future Work

Further Information

Guidelines

MDPI Initiatives

Follow MDPI