Next Article in Journal
Special Issue “Advances in Neuroimaging Data Processing”
Previous Article in Journal
Antioxidant Activity of Coffee Components Influenced by Roast Degree and Preparation Method
 
 
Article
Peer-Review Record

A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition

Appl. Sci. 2023, 13(4), 2058; https://doi.org/10.3390/app13042058
by Jing Shi 1,†, Yuanyuan Zhang 1,†, Weihang Wang 1, Bin Xing 2, Dasha Hu 1,* and Liangyin Chen 1,3,*
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 4:
Reviewer 5: Anonymous
Appl. Sci. 2023, 13(4), 2058; https://doi.org/10.3390/app13042058
Submission received: 28 December 2022 / Revised: 2 February 2023 / Accepted: 3 February 2023 / Published: 5 February 2023

Round 1

Reviewer 1 Report

A Novel Two-stream Transformer-based Framework for Multi-modalities Human Action Recognition is proposed. It is interesting but not complete enough for publication. I recommend a major revision of it. Specific comments are listed as follows:

(1) Some spelling mistakes and writing style can be further improved, for example: Line 260-261 at page 8: “The fused classification token”, these small mistakes need to be corrected before the paper is published.

(2) Some references are too old in introduction section. The authors need to refer to some of the latest methods and literature to replace them.

(3) Figure 2 at page 4: The size of the RGB heatmap does not reflect that it is smaller than the original frame, and the noise background and extraneous information are not cropped from the original frame.

(4) How is the loss function set and on what basis? The authors should elaborate on it in the paper.

(5) Figure 2 at page 4: Why only H1,H2,H3,H4.

(6) How is the dataset divided and what is the training method? The authors should elaborate on it in the paper.

(7) Heatmaps for limbs and joints at page 13: Why only joint-only heatmaps and joint-limb heatmaps are compared, but not limb-only heatmaps?

(8) 3D poses vs. 2D poses at page 13: Why use joint-only heatmaps, joint-limb heatmaps results is better than joint-only heatmaps? The authors should elaborate on it in the paper.

(9) 3D poses vs. 2D poses at page 13: By dividing a 3D skeleton into three 2D points, how to get the results of 3D skeleton? Why 3D skeletons which has reduced to 2D in dimensionality only have one image?

(10) It is suggested to correct the conclusion to highlight the most important results and the significance of the research.

(11) In the introduction section, some important references on transformer are omitted. More related works about transformer should be added, such as:

1)  https://www.doi.org/10.3390/rs14092019

2)  https://www.doi.org/10.1109/TAES.2022.3174826

3)  https://www.doi.org/10.3390/rs14081884

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

The topic of the article is not very interesting—modest contribution to the literature. There are some similar studies with this work. The methods section is missing. The methods could be explained in a more detailed way. For example, each step of the methods could be summarized with bullets. Methods should be referenced with literature. This causes a huge gap. In addition, the importance of the study should be highlighted. In this structure it is not clear. The language of the paper should be improved. There are some typos. To sum up, overall improvements need for publication. The current version is not enough for publication.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

The authors introduced a good work demonstrating their stream transformer-based framework for multi-modalities human action recognition. However, the following points are required to promote to manuscript’s outcome.

1.      Clarify the colour sections in your figures (i.e.Fig.2, Fig.3, Fig.4, Fig.5, and Fig.9 ) so as to be clear and clean, particularly the right most sections with Gray backgrounds & white colour fonts.

2.      Give the variation between yours and the study@ https://doi.org/10.48550/arXiv.2110.13708  ?

3.      In Fig.2. to minimise the costs can we merge the two layers(upper/lower) into a single fuse one?

4.      Outline your dataset descriptive statistics to show how its point diverge and variant?

5.      Table 2 shows there is no significant progress for your findings/method against some related ones(i.e.[11][33])! Discuss Why?

6.      Figure 8. is it representing the confusion matrix or correlation coefficient?

7.      In Fig.10, the left figure is the x-axes represents the number of layers or the number of neurons inside a layer?

8.      Evaluating classification require measure both Accuracy and error rate(s). Discuss how to calculate the error rate in your system?

9.      In Table 5, Discuss why the variation is so large between 2D and 3D Top-1 accuracy values? However, looks good for Top-5 Accuracy.

10.  In Page#14, Line#384, preidiction >>prediction.

11.  Give your framework disadvantages?

I can accept the manuscript after amendment,

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 4 Report

The paper proposes a novel two-stream Transformer-based framework called RGBSformer for human action recognition using both RGB and skeleton modalities. The framework inputs skeleton heatmaps and RGB frames at different temporal and spatial resolutions, with fewer attention layers in the skeleton stream to capture motion information more accurately. Two ways to fuse the information from the two streams are proposed to make full use of the complementary nature of RGB and skeleton modalities in action recognition. The proposed framework achieves state-of-the-art performance on four widely used action recognition benchmarks. The goal of the paper is to capture motion information precisely by processing skeleton data with higher temporal resolution and lower spatial resolution.

This article is well organized and fluently written, so I would like to thank the authors. In addition, if the manuscript that reaches me is not a revised version of the manuscript, the bold sharing of data and codes makes it easier for us reviewers, it is very good. My opinion is that the paper can be accepted in present form. When I edited it with the ithenticate plagiarism program, the similarity rate in the publication was 20%. I think this value is high for the article. It would be nice if the authors could improve on this a little.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 5 Report

Dear authors,

I would like to thank you for your efforts writing this paper. While the main contributions of your work have merits, I believe that there are areas for revisions as follows:

-          In section 4.1, “We experiment on four popular action recognition datasets”, the authors are required to discuss their rationale for this specific choice.

-          In section 4.3, “We compared RGBSformer and the single pathway only to three kinds 303 of models”, once again, write a statement about your justification of choosing these models. This will enhance the logical flow of your ideas.

-          The conclusion section needs some improvements. For instance, (1) What are the practical implications of your work? (2) What are the limitations of the proposed method? All these issues can be discussed in a very succinct manner and can enhance the overall quality of this manuscript.  

I wish you all the best with your manuscript

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

The authors have revised the manuscript according to the comments of all reviewers. I have no other questions.

Reviewer 2 Report

Although the topic of the article is not very interesting—modest contribution to the literature, the authors mostly matched the unclear points. Also, important improvements look reflected in the paper. Therefore, it looks it is ready for publication

Back to TopTop