Next Article in Journal
Defining Structural Cracks in Exterior Walls of Concrete Buildings Using an Unmanned Aerial Vehicle
Previous Article in Journal
UAV Path Planning in Multi-Task Environments with Risks through Natural Language Understanding
 
 
Article
Peer-Review Record

Development of a Novel Lightweight CNN Model for Classification of Human Actions in UAV-Captured Videos

by Nashwan Adnan Othman 1,2,* and Ilhan Aydin 2
Reviewer 2:
Reviewer 3:
Submission received: 30 December 2022 / Revised: 9 February 2023 / Accepted: 19 February 2023 / Published: 21 February 2023

Round 1

Reviewer 1 Report

Development of a Novel Lightweight CNN Model for Classification of Human Actions in UAV-Captured Videos

 

Comments and Suggestions for Authors:

This paper presents a new deep-learning model based on depthwise separable convolutions that has been designed to be lightweight. Other parts of the HarNet model comprised convolutional, rectified linear unit, dropout, pooling, padding, and dense blocks.

I hope that my comments will help to improve the quality of the article:

 * Recheck the English language level of the paper.

 

Abstract:

a.    You need to support your suggestions with numerical outcomes from statistical calculations about the problem and solutions.

b.    You should mention to all keywords in the Abstract.

 

1.    Introduction:

a.    The main contributions of the paper are well explained in the Introduction section.

 

2.    Review of Related Works

a.    Please avoid to depend the references older than 2018 in the Review of Related Works section, because you are working in 2023. So, you need to exchange references (22, 23, and 24) with newest ones.

b.    It is better to write a separate paragraph at the end of the Literature review section to summarize the positive points of your work than previous ones.

 

3.    Methodology

a.    In Figure 1, when the flow of rectangular (Validation of each epoch) returns back to the previous step (Epoch training of HarNet)? And, when it is going forward to the next step (Trained HarNet model)?

b.    What are the (X and Y) axis labels and units of Figure 3?

c.     Figure 6 is not clear.

d.    In lines (409 to 411), you said “A new model has been proposed that utilizes depth wise separable convolution operations rather than standard convolution layers in order to reduce the number of parameters.”. I think it is more fair to say a modified model.

e.     At the end of this section, you said “In recent years, there has been a trend toward hybrid approaches in research that combine CNNs for feature extraction with other machine learning models for classification. While the softmax function is a commonly used method for classification, in this study, we used it as a baseline approach for comparison purposes. These hybrid approaches aim to take advantage of the strengths of various models in order to achieve a better performance in classification tasks.”. Please, explain this assumption in more details.

 

4.    Results and discussion

a.    Add the % symbol to the Y-axis of accuracy plots of Figures 9.

b.    The caption of Table 4. “Classification report of the proposed HarNet model” is not correct, these are your results, not an output report from something.

c.     Figure 10, is not clear and its numerical values cannot be distinguished well, please make it clearer.

d.    You talked about your obtained results for Table 4, and you said “Moreover, the proposed model had several advantages, including a high classification performance, a low complexity, and a small number of parameters.”. How you can proof the high classification performance and low complexity?

e.     What are the units of the compared metrics of Table 5?

 

5.    Conclusion and future work

a.    The conclusion is written in a weak style. Please summarize the first paragraph, and go directly in your proposed system semaphores.

b.    Please add the significant outcomes extracted from your results in compare with those of previous works.

c.     Add the important numerical values to the end of the conclusion.

d.    Rewrite the section of future work. Avoid using the expression (will be), instead you can say we suggest to use…etc.

 

 

Author Response

Please see the attachment.

 

Author Response File: Author Response.docx

Reviewer 2 Report

The authors presented in this paper a new light weight CNN named HarNet to detect human actions from UAV video sequences. The proposed method was evaluated with the UCF-ARG dataset and showed high accuracy when identifying six common actions.

-          The paper’s organization is perfect.

-          The quality of the writing is excellent.

-          The presentation methodology (i.e., methods, experiments, and analysis) is clear and understandable.

-          The suggested approach's performances are well assessed using several experiments (e.g., ablation) and comparisons.

-          The obtained results are acceptable, competitive, and surpass all compared works.

-          The references are recent, of good quality, and respected in the journal’s format.

For all these reasons, I recommend the acceptance of the paper.

Author Response

Dear Reviewer,

Thank you for your motivational comments.

 

Author Response File: Author Response.docx

Reviewer 3 Report

The review is based on the manuscript provided only. The authors did not provide information of the source code, therefore it is impossible to performance artifact evaluation.    1) The authors are suggested to provide the source code for artifact evaluation.      In this article, the authors proposed a lightweight CNN, in order to detect human actions from UAV video sequences. The proposed method was evaluated with the UCF-ARG dataset, and it achieved a relative high accuracy when identifying six common actions. I appreciate authors' effort to continue improving performance of CNN on application of Classification of Human Actions in UAV-Captured Videos.    After carefully review of the manuscript, I do have the following concerns,    2) The proposed method achieved constantly higher accuracy and less loss in the validation than that in the training. It seems like that the validation images are sampled from the video clips of the persons exist both in training and validation, so that the validation dataset is very similar to the training and is not new to the model.    3) The validation setup may result in an unfairly high accuracy, as well as an unfair comparison to other research.    4) UCF-ARG Data Set contains 10 actions: Boxing, Carrying, Clapping, Digging, Jogging, Open-Close Trunk, Running, Throwing, Walking and Waving. It is to only present the results of 6 actions in Boxing, Digging, Running, Throwing, Walking and Waving evaluated using the proposed method. Leaving the readers wondering what would be the performance of the proposed methods on actions such as Carrying, Clapping, Jogging, and Open-Close Trunk?    5) Considering Waqas Sultani et al.'s GAN + DML method compared in Table 6 evaluated 8 actions, to show the proposed method is superior in accuracy, a minimum of 8 actions should be evaluated. Authors should conduct the experiment more thoroughly.    6) Authors are suggested using the Leave-p-Out Cross-Validation to better evaluate the model performance.    7) Figure 2 could use a horizontal layout to better present the various actions.    8) The samples of human action in Figure 6 are low resolution due to the resolution of the dataset, but the red colored annotations can be clearer. 

Author Response

Dear reviewer,

Please see the attachment to see the responses to the comments.

 

Author Response File: Author Response.docx

Round 2

Reviewer 3 Report

Response to Responses 2, 3, 4 and 6:

Making claims of superiority for a proposed method over others is unacceptable unless evaluations are conducted with the same criteria. For instance, Hazar Mliki et al. [17] applied Leave-p-Out cross-validation, which is particularly crucial for evaluating models on datasets with similar characteristics. "Large" datasets refer to a variety of images, not just a high number of images. For example, a vast number of identical images would be unhelpful. "Large" implies a substantial quantity of diverse images.

The proposed method consistently showed higher accuracy and lower loss during validation compared to training. This might be because the validation images were likely drawn from video clips of individuals present in both the training and validation datasets, making the validation dataset similar to the training set and not challenging the model. This validation setup could lead to an overestimated accuracy and an unfair comparison to other studies.

The UCF-ARG Data Set comprises of 10 actions, including Boxing, Carrying, Clapping, Digging, Jogging, Open-Close Trunk, Running, Throwing, Walking, and Waving. However, the proposed method only assesses 6 actions: Boxing, Digging, Running, Throwing, Walking, and Waving. Please directly address this comment by presenting a performance evaluation of the proposed method for the actions of Carrying, Clapping, Jogging, and Open-Close Trunk.

To claim superiority in accuracy, the authors of the proposed method MUST compare their results to the 8 actions evaluated in Waqas Sultani et al.'s GAN + DML method as presented in Table 6. The authors MUST conduct additional experiments to demonstrate the performance of their proposed method on all 8 actions.

Author Response

Please see the attachment.

 

Author Response File: Author Response.docx

Back to TopTop