Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Spatiotemporal Graph Autoencoder Network for Skeleton-Based Human Action Recognition

AI 2024, 5(3), 1695-1708; https://doi.org/10.3390/ai5030083

by Hosam Abduljalil^1,*

, Ahmed Elhayek²

, Abdullah Marish Ali¹

and Fawaz Alsolami¹

Reviewer 1: Anonymous

Reviewer 2:

Sibo Cheng

Reviewer 3: Anonymous

AI 2024, 5(3), 1695-1708; https://doi.org/10.3390/ai5030083

Submission received: 25 July 2024 / Revised: 13 September 2024 / Accepted: 19 September 2024 / Published: 23 September 2024

(This article belongs to the Special Issue Artificial Intelligence-Based Image Processing and Computer Vision)

Round 1

Reviewer 1 Report (Previous Reviewer 2)

Comments and Suggestions for Authors

This paper presents a spatiotemporal graph autoencoder network for skeleton-based human action recognition. The proposed framework modifies the method [11] by adding a few skip connections in the model.
This is a re-submitted paper. However, it didn't add new experimental results in the revision. Experimental results are still insufficient. There is no point to present precision and recall for each individual category in Figures 4 and 6. The more important is to compare the proposed method and the existing methods by comparing solid performance metrics, such as precision and recall, in Tables 2, 3, and 4. Furthermore, the proposed framework modifies the method [11]. However, its performance is worse than that of [11], as shown in Tables 3 and 4. So it is not convincing to see the contribution of the proposed method.

Comments on the Quality of English Language

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report (New Reviewer)

Comments and Suggestions for Authors

The paper presents a novel skeleton-based human action recognition (HAR) algorithm called GA-GCN, which utilizes a spatiotemporal graph autoencoder network to improve the accuracy of recognizing human activities. The study focuses on two widely used datasets, NTU RGB+D and its extension NTU RGB+D 120, which provide comprehensive skeletal motion data for various action classes. I would suggest revisions as listed below:

1. What strategies can be employed to further optimize the GA-GCN model for real-time applications in human action recognition?

2. Many recent works use GNN and RNN for spatial temporal systems, some references to build the related work: SST-GNN: simplified spatio-temporal traffic forecasting model using graph neural network; Reduced-order digital twin and latent data assimilation for global wildfire prediction; Explainable Global Wildfire Prediction Models using Graph Neural Networks;

3. What specific additional modalities (e.g., audio, environmental context) could be integrated into the GA-GCN framework to enhance its recognition capabilities, and how would they affect the model's performance?

4. How can the GA-GCN model be extended to recognize long-term actions or sequences of actions over time, rather than isolated movements?

Author Response

Comment 1: What strategies can be employed to further optimize the GA-GCN model for real-time applications in human action recognition?

Response 1: Thank you for your insightful feedback and suggestions. In response to your question regarding strategies for further optimizing the GA-GCN model for real-time applications, while techniques such as model pruning, quantization, and efficient graph convolution variants are promising, our current focus was on maximizing accuracy and validating the model’s effectiveness using two widely used benchmark datasets: NTU RGB+D and NTU RGB+D 120.

We chose not to pursue real-time optimizations in this study because our primary objective was to demonstrate the model’s capabilities in terms of recognition accuracy rather than its computational efficiency. Implementing these strategies requires significant additional experimentation and tuning, which would have extended beyond the scope of our current research. However, we acknowledge the importance of these optimizations and plan to explore them in future work.

Thank you again for your valuable comments.

Comment 2: Many recent works use GNN and RNN for spatial temporal systems, some references to build the related work: SST-GNN: simplified spatio-temporal traffic forecasting model using graph neural network; Reduced-order digital twin and latent data assimilation for global wildfire prediction; Explainable Global Wildfire Prediction Models using Graph Neural Networks;

Response 2: Thank you for your comment and for suggesting valuable references to enhance the related work section. We recognize the growing body of research on spatiotemporal systems utilizing GNN and RNN architectures, including the recent works you mentioned such as SST-GNN and Explainable Global Wildfire Prediction Models using Graph Neural Networks. These studies contribute significantly to the understanding and application of GNNs in various domains.

In our related work section, we specifically concentrated on research directly related to skeleton-based human action recognition. This approach was chosen to maintain a clear focus on works that are most pertinent to the context of our study. While the papers you mentioned are valuable, our intent was to highlight studies that are closely aligned with the skeleton-based action recognition domain. We appreciate your understanding in this matter.

Comment 3: What specific additional modalities (e.g., audio, environmental context) could be integrated into the GA-GCN framework to enhance its recognition capabilities, and how would they affect the model's performance?

Response 3: Thank you for your insightful question. Integrating additional modalities, such as audio or environmental context, could indeed enhance the GA-GCN framework's recognition capabilities by providing complementary information that helps disambiguate complex actions. However, our current work focused exclusively on skeleton-based data using the NTU RGB+D and NTU RGB+D 120 datasets, which do not include audio or environmental context. This limitation constrained our ability to explore these additional modalities. Incorporating such features would also necessitate a significant redesign of the model to manage heterogeneous data inputs, which was beyond the scope of this study. We prioritized benchmarking GA-GCN on well-established skeleton-based datasets to ensure a fair comparison with existing state-of-the-art methods. While we have implemented multiple modalities in terms of different data representations within the same framework, our approach is open to future research integrating audio or environmental context.

Thank you again for your valuable suggestion.

Comment 4: How can the GA-GCN model be extended to recognize long-term actions or sequences of actions over time, rather than isolated movements?

Response 4: Thank you for your insightful question. Extending the GA-GCN model to recognize long-term actions or sequences of actions over time is a valuable direction for enhancing its capabilities. Currently, our model focuses on isolated movements as provided by the NTU RGB+D and NTU RGB+D 120 datasets, which primarily consist of single actions per sequence.

It’s more challenging to recognize short-term actions and easier for long-term actions because long-term actions provide more features. To extend the GA-GCN model for recognizing long-term actions or sequences of actions over time, several strategies could be employed. Incorporating temporal aggregation mechanisms or hierarchical architectures could help capture extended sequences by integrating information over longer time spans. Attention mechanisms might also be utilized to focus on relevant parts of the sequence and learn dependencies over time. Additionally, training the model with datasets that include longer sequences or multi-action scenarios could improve its ability to recognize complex action sequences. These approaches could enhance the GA-GCN model’s ability to handle more complex temporal dynamics in human action recognition. We appreciate your suggestion and see it as a promising direction for future work.

Thank you again for your valuable feedback.

Reviewer 3 Report (New Reviewer)

Comments and Suggestions for Authors

In line 37, the term prediction is “suddenly” introduced and comes as a surprise. Up to that point, action prediction is not mentioned and the discussion is about action recognition. Also, the title of that subsection (1.1) talks about recognition (not prediction), which is confusing.

Lines 51-59. According to English dictionaries, activities such as laughing, waiting, thinking, listening, gesturing, observing, attending etc are human actions. Literature on total body capture (see DOIs below), includes gaze control and hand tracking and exhibits the potential to reveal such actions. Thus, my recommendation is to specify the classes or types of actions the proposed method covers (e.g. those that include coarse skeletal body motion?). Relevant literature:

DOI:10.1109/CVPR.2019.01122

DOI:10.1109/CVPR.2019.01123

DOI:10.1109/CVPR46437.2021.00478

DOI:10.1109/ICCVW54120.2021.00201

DOI: 10.1109/TPAMI.2022.3197352

What are “skip connections” mentioned in line 103? Please provide a definition or description.

Each acronym needs to be defined once (at its first occurrence). E.g. The HAR acronym is defined in lines 1, 90, 166, 188, and 330.

The quality of Figure 3 and Figure 5 needs improvement. Some numbers are overlapping and not legible. Also, the values in the highlighted diagonal of the matrix are not fully shown.

In Figure 4 and Figure 6, the legend hides parts of the plotted data. Also, to make differences clearer you could consider starting the vertical axis from value 0.6.

Author Response

Comment 1: In line 37, the term prediction is “suddenly” introduced and comes as a surprise. Up to that point, action prediction is not mentioned and the discussion is about action recognition. Also, the title of that subsection (1.1) talks about recognition (not prediction), which is confusing.

Response 1: Thank you for pointing out this inconsistency. We acknowledge that the sudden introduction of the term "prediction" in line 37 may cause confusion, especially since the discussion up to that point has focused on action recognition. The term "prediction" was used to provide historical context about action prediction before transitioning to a discussion on action recognition. We will revise the manuscript to ensure consistency, aligning the terminology throughout the text and clearly indicating that our primary focus is on action recognition. We appreciate your feedback and will make these necessary revisions to enhance clarity.

Comment 2: Lines 51-59. According to English dictionaries, activities such as laughing, waiting, thinking, listening, gesturing, observing, attending etc are human actions. Literature on total body capture (see DOIs below), includes gaze control and hand tracking and exhibits the potential to reveal such actions. Thus, my recommendation is to specify the classes or types of actions the proposed method covers (e.g. those that include coarse skeletal body motion?). Relevant literature:

DOI:10.1109/CVPR.2019.01122

DOI:10.1109/CVPR.2019.01123

DOI:10.1109/CVPR46437.2021.00478

DOI:10.1109/ICCVW54120.2021.00201

DOI: 10.1109/TPAMI.2022.3197352

Response 2: Thank you for your observation. We appreciate the suggestion to clarify the types of actions covered by our proposed method. We agree with the recommendation and would like to clarify that the used actions are categorized by the dataset authors into three major categories: daily actions, mutual actions, and medical conditions which is commonly used in many papers in the same field, and we followed the same approach for fair comparison.

These categories include actions with clear and discernible skeletal patterns, which are suitable for our model's scope. Activities such as laughing, waiting, and thinking, which involve more subtle or internal processes, may not be included. In the revised manuscript, on page 6, paragraph 2, line 204 we define these action categories to provide a clearer understanding of our method’s focus. Thank you for your valuable feedback.

Comment 3: What are “skip connections” mentioned in line 103? Please provide a definition or description.

Response 3: Thank you for raising this point. Skip connections, mentioned in line 103, refer to architectural features in neural networks where connections are made that skip one or more layers. This technique allows the model to bypass certain layers and connect directly to layers further along in the network, which helps in mitigating issues like vanishing gradients and allows for better flow of information. In the revised manuscript, on page 3, paragraph 7, line 106, we included a clear definition and description of skip connections to ensure readers understand their role and significance in our proposed model. Thank you for your valuable feedback.

Comment 4: Each acronym needs to be defined once (at its first occurrence). E.g. The HAR acronym is defined in lines 1, 90, 166, 188, and 330.

Response 4: Thank you for highlighting this issue. We acknowledge that the acronym HAR (Human Action Recognition) should be defined only once at its first occurrence to maintain clarity and avoid redundancy. We will revise the manuscript to ensure that each acronym is defined only once at its initial mention and remove any subsequent redundant definitions. We appreciate your attention to detail and will make these corrections to improve the readability of the manuscript.

Comment 5: The quality of Figure 3 and Figure 5 needs improvement. Some numbers are overlapping and not legible. Also, the values in the highlighted diagonal of the matrix are not fully shown.

Response 5: Thank you for pointing out the issue with the quality of Figure 3 and Figure 5. We have improved the formatting of these figures in the revised manuscript to ensure that all numbers are clearly legible and not overlapping. We appreciate your feedback and have made the necessary adjustments to enhance the clarity of the figures.

Comment 6: In Figure 4 and Figure 6, the legend hides parts of the plotted data. Also, to make differences clearer you could consider starting the vertical axis from value 0.6.

Response 6: Thank you for your helpful observation. We have repositioned the legend in Figures 4 and 6 to ensure that no parts of the plotted data are covered. Additionally, as per your suggestion, we have adjusted the vertical axis to start from 0.6 to make the differences clearer. These changes have been incorporated in the revised manuscript to improve the clarity and readability of the figures. We appreciate your valuable feedback.

Round 2

Reviewer 1 Report (Previous Reviewer 2)

Comments and Suggestions for Authors

The revision is fine, there is no further questions on my side.

Comments on the Quality of English Language

Reviewer 2 Report (New Reviewer)

Comments and Suggestions for Authors

the paper has been improved

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This manuscript introduces a novel spatiotemporal graph autoencoder network, GA-GCN, aimed at Human Action Recognition (HAR) based on skeleton data. The proposed GA-GCN model seems to have advantages in learning both spatial and temporal patterns from human skeleton datasets, resulting in a better performance increase over most current methods. This reviewer provides several comments and suggestions throughout the manuscript that might improve the quality of this manuscript. If the authors can address these comments/suggestions, I would be happy to see this work published. Specific issues are listed below.

Major comments

• The authors use ‘accuracy’ as the metric for evaluating the performance of various methods, but they should provide a clear definition of how the accuracy was calculated. Additionally, incorporating a range of other evaluation metrics, such as sensitivity and specificity, could offer a more comprehensive assessment of performance. I recommend that the authors expand their evaluation framework to include these additional metrics.

• In Tables 3 and 4, the results indicate that CTR-GCN and PSUMNet outperform the GA-GCN model. The authors need to provide a detailed analysis or discussion regarding why these models exhibit superior performance. This could help readers understand comparative advantages or limitations of these three models (CTR-GCN, PSUMNet, and GA-GCN).

• To better understand the GA-GCN model, the authors are encouraged to discuss its potential limitations. For example, an examination of conditions under which the model may underperform, such as specific types of actions or datasets, would be particularly insightful.

• To enhance the interpretability of the GA-GCN model and provide insights into its decision-making processes, it would be beneficial for the authors to employ techniques such as SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations). The authors could further discuss how these methods might be used to reveal the 'black-box' nature of the GA-GCN model.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper presents a graph-based classification network based on human skeleton points. There are a few things that need further clarification for the current version of the paper.

One of the major concerns of this paper is that it is not clear what the contribution of the proposed approach is. What is the difference between the proposed network architecture in Figure 2 and the other existing ones? What are the purposes of a few color texts in Figure 2? What is the loss function to train this model?
Secondly, the construction of the spatiotemporal graph is not clear, as the input to the network shows in Figure 2.
Lastly, experimental results are very insufficient. Only accuracy is reported in Table 2 for this classification task. Other performance metrics, such as precision and recall, should be reported. In addition, the experimental results are not convincing because there is very limited improvement (or even worse), as reported in Tables 3 and 4.

Comments on the Quality of English Language

Reviewer 3 Report

Comments and Suggestions for Authors

I have a question about classification. I see in Fig.2 there is action classification. However, it is not clear how many classes there are. What determines that a person is doing one action or another.

After reading the article, it is still not clear what the purpose of the new architecture is, to classify, to recognise only movement, etc.

What is missing is an (pseudo) algorithm of some kind and some graphs to illustrate the experiments and the results obtained.

What is the accuracy given in the tables? It is not clear what is being measured nor what the metric is (what the formula is)?

It would be useful to include a time metric as well, since the advantage of accuracy is very merginal.

Reviewer 4 Report

Comments and Suggestions for Authors

The manuscript Spatiotemporal Graph Autoencoder Network for Skeleton-Based Human Action Recognition has merits however I ask the authors to further expand on the following parts:

The proposed method is to succinct presented. I could not actually understand the actual model and what are the proposed parts.
I would argue if Section 2 should not actually be Section 1.
In all the comparisons on different datasets we miss the number of Parameters and FLOPS used.
I think the conclusion is not development enough for a journal paper.

Article Menu

Spatiotemporal Graph Autoencoder Network for Skeleton-Based Human Action Recognition

Further Information

Guidelines

MDPI Initiatives

Follow MDPI