Next Article in Journal
Quantitation of the Surface Shortwave and Longwave Radiative Effect of Dust with an Integrated System: A Case Study at Xianghe
Next Article in Special Issue
A Deep Learning Model for Markerless Pose Estimation Based on Keypoint Augmentation: What Factors Influence Errors in Biomechanical Applications?
Previous Article in Journal
Lightweight Tunnel Obstacle Detection Based on Improved YOLOv5
Previous Article in Special Issue
PosturePose: Optimized Posture Analysis for Semi-Supervised Monocular 3D Human Pose Estimation
 
 
Article
Peer-Review Record

Human Pose Estimation Based on Efficient and Lightweight High-Resolution Network (EL-HRNet)

Sensors 2024, 24(2), 396; https://doi.org/10.3390/s24020396
by Rui Li 1,2, An Yan 1, Shiqiang Yang 1,*, Duo He 1, Xin Zeng 1 and Hongyan Liu 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Sensors 2024, 24(2), 396; https://doi.org/10.3390/s24020396
Submission received: 18 December 2023 / Revised: 29 December 2023 / Accepted: 30 December 2023 / Published: 9 January 2024

Round 1

Reviewer 1 Report (Previous Reviewer 2)

Comments and Suggestions for Authors

The authors have made significant improvements in this revision, but there are still several minor points to address.

1. The Abstract would benefit from additional specific details regarding the technical aspects of the model, such as the methodologies used to lighten the Basicblock module and introduce the CBAM mechanism.

2. In the Introduction, while traditional methods and deep learning algorithms are discussed, a specific problem or challenge in human pose estimation that the paper aims to address is not explicitly stated.

 

3. The Conclusion would benefit from a more definitive statement about the identified areas for improvement and potential directions for future research. Furthermore, linking these findings to broader implications and applications in the field of human pose estimation would provide a more comprehensive conclusion.

 

Author Response

1. The Abstract would benefit from additional specific details regarding the technical aspects of the model, such as the methodologies used to lighten the Basicblock module and introduce the CBAM mechanism.

We are very appreciative of you for the valuable advice to make our manuscript more comprehensive. In the revised summary, we added a detailed description of the model modification and simplified the summary so that the words meet the requirements of sensors.

The main modification is given below:

Lines 7-21 :”As an important direction in computer vision, human pose estimation has received extensive attention in recent years. High Resolution Network (HRNet) can achieve effective estimation results as a classical human pose estimation method. However, due to the complex structure of the model, it is not conducive to deploying such methods under limited computer resources. Therefore, an improved Efficient and Lightweight HRNet (EL-HRNet) model is proposed. In detail, the point-wise and grouped convolution were used to construct a lightweight residual module, replacing the original 3×3 module to reduce the parameters. To compensate for the information loss caused by lightweight, the Convolutional Block Attention Module (CBAM) is introduced after the new lightweight residual module to construct the Lightweight Attention Basicblock (LA-Basicblock) module to achieve high-precision human pose estimation.  To verify the effectiveness of the proposed EL-HRNet, experiments are carried out using COCO2017 and MPII datasets respectively. Experimental results show that the EL-HRNet model requires only 5 million parameters and 2.0 GFlops calculations, and achieves an AP score of 67.1% on the COCO2017 validation set. In addition, [email protected] is 87.7% on the MPII validation set, and EL-HRNet shows a good balance between model complexity and human pose estimation accuracy.”

2. In the Introduction, while traditional methods and deep learning algorithms are discussed, a specific problem or challenge in human pose estimation that the paper aims to address is not explicitly stated.

Thank you for your careful reviewing and giving great helpful advice to improve the quality of our paper. We have revised the introduction section and added a more detailed description of the major challenges in the research field of human pose estimation. Moreover, the purpose of our study was pointed out clearly.

The main modification is given below:

Lines 55-67: ”It is difficult to achieve accurate pose estimation when the background color is cluttered and complex, the body parts are occluded, or the body color is similar to the surrounding environment. Maintaining high-resolution information is very important for the detection of these key points. However, in each network structure that maintains high-resolution information, there are high network complexity and a large amount of calculation parameters. Therefore, a major challenge in pose estimation is how to have lower parameters and better performance while preserving high-resolution information. Among them, HRNet achieves high accuracy in the task of human pose estimation, but its parameter number and computational complexity are high. Thus, lightening the network is a major challenge in the field of pose estimation. It is challenging to balance the complexity and accuracy of the model because of the loss of accuracy caused by the lightweight of the model.”

3. The Conclusion would benefit from a more definitive statement about the identified areas for improvement and potential directions for future research. Furthermore, linking these findings to broader implications and applications in the field of human pose estimation would provide a more comprehensive conclusion.

Heartfelt thanks for your valuable time and careful review of our manuscript. We revise the conclusions to discuss the broader implications and applications of the field of human pose estimation, and summarize the article to make it more complete. Thanks again for your advice.

The main modification is given below:

Lines 514-522: ” Although the model in this paper achieves a balance between the complexity and accuracy of human pose estimation, there is still much room for improvement in the accuracy index of the model. Due to the demand for human pose estimation networks on mobile terminals, the number of algorithm parameters and calculations should be considered when estimating the pose on mobile terminals, so a lightweight and accurate model is required. Therefore, the use of the pose estimation model on mobile terminals will be further studied in the future, and how to further improve the prediction accuracy and real-time detection effect of the network model will be studied.”

Reviewer 2 Report (Previous Reviewer 1)

Comments and Suggestions for Authors

The authors proposed Lightweight EL-HRNet for Human Pose Estimation. This is the resubmission of the paper. The authors have addressed my comments in this new submission. There are a few comments which are not addressed properly as follows,

1.       Table 2: It is unclear how the proposed method is better than existing methods. In terms of the number of parameters, the proposed method is heavier than most of the listed methods. The number of FLOPs used is also the highest among these methods. Compared with the best method in terms of performance, there is no improvement. So please justify the efficacy of your method.

Also, include this paper in the comparison or other methods from (https://paperswithcode.com/sota/multi-person-pose-estimation-on-coco-test-dev)

Kan Z, Chen S, Li Z, He Z. Self-Constrained Inference Optimization on Structural Groups for Human Pose Estimation. InEuropean Conference on Computer Vision 2022 Oct 23 (pp. 729-745). Cham: Springer Nature Switzerland.

2.       Table 3: Results must be compared with other methods as well. Please use the website (http://human-pose.mpi-inf.mpg.de/#results) to get the latest results. For example,

 

Bulat A, Kossaifi J, Tzimiropoulos G, Pantic M. Toward fast and accurate human pose estimation via soft-gated skip connections. In2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) 2020 Nov 16 (pp. 8-15). IEEE.

Comments on the Quality of English Language

English language is better

Author Response

1. Table 2: It is unclear how the proposed method is better than existing methods. In terms of the number of parameters, the proposed method is heavier than most of the listed methods. The number of FLOPs used is also the highest among these methods. Compared with the best method in terms of performance, there is no improvement. So please justify the efficacy of your method. Also, include this paper in the comparison or other methods from (https://paperswithcode.com/sota/multi-person-pose-estimation-on-coco-test-dev)Kan Z, Chen S, Li Z, He Z. Self-Constrained Inference Optimization on Structural Groups for Human Pose Estimation. InEuropean Conference on Computer Vision 2022 Oct 23 (pp. 729-745). Cham: Springer Nature Switzerland.

Thanks for your serious and careful review of our manuscript. Our proposed method is mainly a lightweight treatment for the HRNet model. The number of original hrnet model parameters is 28.5M, and the number of our model parameters after lightweight treatment is only 5M, which is reduced by 82.5%. Compared with the SimpleBaseline model, our improved number of model parameters and the accuracy are more advantageous. Compared with Lightweight, SmallHRNet, and Lite-HRNet, our model is more advantageous in accuracy. Compared with ViPNAS, our model is more advantageous in the APM metrics, and to further contrast our model, we added the ScaleNAS model in Table II. Compared with the ScaleNAS model with the highest accuracy, our accuracy still has room to improve, but our number of parameters is 5M, which is much lower than 35.6M. The self-constrained inference optimization (SCIO) is also a very excellent model, which can effectively improve the accuracy of the attitude estimation model, but the input size of the two models is completely different, where SCLO is 384×288 and our input is 256×192, so it is difficult to make a fair comparison. We are appreciated the reviewer’s suggestion, this excellent method in this article provides some ideas and inspiration for our follow-up research, so we supplemented this article in the introduction section. We cannot deny that there is room for improvement in our model compared to these excellent methods, and we will continue to improve them in the future.

The main modification is given below:

Lines 43-47: ”For example, Kan [10] et al. proposed to divide the body key points into six structural groups, each of which was further divided into terminal key points and base key points. And developed a self-constrained prediction-validation network to learn the structural correlations between these two subsets within each structural group.”

Lines 443-450: ”Also using the MobileNetV3 network backbone ViPNAS network model number and accuracy are excellent, but EL-HRNet on medium scale human detection index of accuracy is higher. It is well known that medium scale human detection in daily and industrial scenarios are more widely used. Moreover, compared with the best performing ScaleNAS model, our number of parameters is far smaller than that of the ScaleNAS model, and our FLOPS is only a quarter of the ScaleNAS. Overall, the experimental results demonstrated that the study of HRNet network structure still has its significance.”

2. Table 3: Results must be compared with other methods as well. Please use the website (http://human-pose.mpi-inf.mpg.de/#results) to get the latest results. For example,Bulat A, Kossaifi J, Tzimiropoulos G, Pantic M. Toward fast and accurate human pose estimation via soft-gated skip connections. In2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) 2020 Nov 16 (pp. 8-15). IEEE.

Thank you very much for your suggestions, which makes our article more perfect and our conclusion more convincing. We added the Hourglass + U-Net model to the MPII experimental results for comparison. Compared with Hourglass + U-Net model, our accuracy performance was not good enough, but our number of parameters is 5M and the calculation amount is 2G, which is more advantageous, far less than 26M and 33.5G, and greatly reduced the computational and training costs. In the future, we will continue to study to improve our model and improve the accuracy of our model.

The main modification is given below:

Lines 477-485: ”The mean [email protected] increased by 0.2%, while the number of parameters decreased the 20.1M. Compared to the best performing hourglass + U-Net model, our accuracy performance is not good enough, but the number of parameters is 26M, while the number of our model parameters is only 5.0M, much lower than its 26M. Meanwhile, our calculated quantity is only 2.66G, which is much smaller than 33.5G. Therefore, the experimental results show that our proposed model has lower requirements for equipment and computing power, higher computational cost, and is more suitable for peripheral devices (e. g., robot control). The experimental results are compared with these models to prove the validity and rationality of the proposed model. In the future, we will continue to optimize the structure of the model to reduce the number of parameters and improve the precision.”

Lines 492-499: ”For the Lite-HRNet network model, which also adopts the HRNet structure, the accuracy was greatly improved, although the number of parameters is quite large. For the Hourglass + U-Net model and ScaleNAS model with the best accuracy, our number of parameters and computation greatly reduce the computational cost. These comparative results illustrate the validity and rationality of our modified method.By training, validating, and testing on the COCO2017 dataset and MPII dataset, the EL-HRNet model is demonstrated to have a good performance on human pose estimation tasks.”

Round 2

Reviewer 2 Report (Previous Reviewer 1)

Comments and Suggestions for Authors

The authors have revised the paper and addressed my comments

Comments on the Quality of English Language

The English language is OK. A few minor changes can improve the readability. 

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.


Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The authors proposed a lightweight EL-HRNet for human pose estimation. The paper must be revised according to the following comments.

1.       Lines 26-27: The sentence should be rewritten and remove language errors.

2.       Lines 105-111:  This paragraph should be removed.

3.       Equation 2 includes /p/w/h. What is the purpose of using such an expression?

4.       Equations 3 and 4: The font size in the equations should be the same.

5.       Lines 243-263: This paragraph needs revision to make it more understandable for the readers. The introduction of some equations may simplify the text.

6.       Line 383: It is better to write the list of all the models with the reference.

7.       Equation 11: d should also be defined by an equation.

8.       Line 406: Training setup should not be part of the dataset subsections. It should be included in the experimental setup.

9.       Line 424: Why different evaluation criterion is used for two datasets. Moreover, evaluation criteria should be included in materials and methods.

10.   Line 438: A subsection should not be started after a section without any introduction of the section.

11.   Table 1: Lightweight[29] and ViPNAS[30] have lesser parameters than the proposed method with better performance metric values. What is the significance of a 0.6% improvement in APM metric compared to the size of the model?

12.   Table 2: Why ViPNAS and other models are not compared on the test set? Why the list of methods in the validation dataset and test dataset is different?

13.   Table 3: A smaller set of methods are used to compare the results. Why?

14.   Line 479: Results on validation and testing datasets for both datasets should be compared with a bigger model list, as in Table 1.

15.   A more detailed discussion on a compromise on the accuracy of the model size should be provided to prove the efficacy of the proposed method.

Comments on the Quality of English Language

The paper requires extensive proofreading.

 

Reviewer 2 Report

Comments and Suggestions for Authors

To address the conflict between pose estimation performance and limited computer resources, this paper proposes EL-HRNet. It incorporates an L-Basicblock module to reduce parameters and computation. To tackle information loss in L-Basicblock, the paper introduces the CBAM mechanism. Additionally, the paper constructs the LA-Basicblock for accurate human pose estimation. Experimental validation on COCO2017 and MPII datasets confirms the effectiveness of EL-HRNet. While the paper is generally well-organized, there are some points that could be improved upon.

1. The abstract should be better organized and more concise. It should provide an accurate summary of the authorsfindings and their implications.

2. In the Introduction, it is essential for the authors to focus on defining the problem and providing background information. It is important to highlight the main challenges in the research area of Human Pose Estimation.

3. The first paragraph of the Materials and Methodssection, which describes How to write the materials and method,should not be included in the paper.

4. Section 2.1 HRNet Modelis a widely known concept in the related community. Hence, it is unnecessary to provide detailed explanations. Only a reference to the relevant literature is required.

5. When describing the proposed method, the authors should highlight the main innovation of their EL-HRNet. Although there is a substantial reduction in parameters in the model where the backbone is HR-Net, there are already many lightweight HR-Net-based models. Therefore, it is important to highlight the advantages of EL-HRNet.

6. The significance of the keyword Lightweightin the title should be further evaluated, as it did not play a significant role in the experiment.

7. The experimental results need to be further discussed. The authors should strive to clarify what sets their results apart from other Human Pose Estimation approaches and why they are significant.

8. The conclusion section could benefit from a more thorough discussion on the implications and limitations of the research findings.

Comments on the Quality of English Language

The English text still needs to be carefully checked for conciseness and objectivity.

Back to TopTop