Next Article in Journal
Detection and Grade Classification of Diabetic Retinopathy and Adult Vitelliform Macular Dystrophy Based on Ophthalmoscopy Images
Previous Article in Journal
Synchronization and Control of a Single-Phase Grid-Tied Inverter under Harmonic Distortion
 
 
Article
Peer-Review Record

A Compact and Powerful Single-Stage Network for Multi-Person Pose Estimation†

Electronics 2023, 12(4), 857; https://doi.org/10.3390/electronics12040857
by Yabo Xiao 1, Xiaojuan Wang 1,*, Mingshu He 1, Lei Jin 1, Mei Song 1 and Jian Zhao 2,3
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Electronics 2023, 12(4), 857; https://doi.org/10.3390/electronics12040857
Submission received: 4 January 2023 / Revised: 23 January 2023 / Accepted: 31 January 2023 / Published: 8 February 2023
(This article belongs to the Section Artificial Intelligence)

Round 1

Reviewer 1 Report

General Comments

1. The reference system is quite confusing. Please get the references ordered as they are cited in the paper accordingly.

2. There are some mis-use of definite articles in the manuscript. Please check.

3. Figure 1 is NOT used. 

Questions/Suggestions

Lines 3-4, Abstract, Page 1

between the human distance and ... thus leading to the high computation cost ...

=> between human distance and ... thus leading to high computation cost ...

Line 9, Para 1, Section 1, Page 1

while suffers ...

=> while suffering ...

Lines 1~3, Para 1, Page 2

... as shown in Fig. 2(a).

Question: Is "Fig. 2(a)" a typo or a missing piece to illustrate the use of  top-down and bottom-up methods?

Lines 6-7, Para 2, Page 2

The connections can be ...

=> Connections can be ...

Line 8, Para 2, Section2, Page 4

For transformer-based network, ...

=> For transformer-based networks, ...

Lines -9~-8, Page 7

... set to 2 and 4, following [16,33].

Question: "following" ...??

Line 2, Para 1, Section 4.1, Page 9

MS COCO dataset [21] ...

=> The MS COCO dataset [21] ...

And more ... Please check.

 

Author Response

Response to Reviewer

General Comments

Q1. The reference system is quite confusing. Please get the references ordered as they are cited in the paper accordingly.

A1: Thanks for your comments. We re-organize the reference system as they are cited in the paper accordingly in the revision.

Q2. There are some mis-use of definite articles in the manuscript. Please check.

A2: Thanks. We revise the paper writing carefully in the revision.

Q3. Figure 1 is NOT used. 

A3: Thanks for your valuable comment. We mention Figure 1 at the Paragraph 4 of the Introduction.

Questions

Q4.

Lines 1~3, Para 1, Page 2

... as shown in Fig. 2(a).

Question: Is "Fig. 2(a)" a typo or a missing piece to illustrate the use of  top-down and bottom-up methods?

A4: Thanks for your concern. Fig. 2(a) denotes the keypoint heatmap representation. Generally, both top-down and bottom-up methods use the heatmap representation to locate keypoints. Thus we use Fig. 2(a) to illustrate the conventional heatmap representation in top-down and bottom-up paradigm.

Suggestions

Q5.

Lines 3-4, Abstract, Page 1

between the human distance and ... thus leading to the high computation cost ...

=> between human distance and ... thus leading to high computation cost ...

Line 9, Para 1, Section 1, Page 1

while suffers ...

=> while suffering ...

Lines 6-7, Para 2, Page 2

The connections can be ...

=> Connections can be ...

Line 8, Para 2, Section2, Page 4

For transformer-based network, ...

=> For transformer-based networks, ...

Lines -9~-8, Page 7

... set to 2 and 4, following [16,33].

Question: "following" ...??

Line 2, Para 1, Section 4.1, Page 9

MS COCO dataset [21] ...

=> The MS COCO dataset [21] ...

And more ... Please check.

A5: Thanks. We address all above issues and revise our paper carefully in the revision.

Reviewer 2 Report

Multi-person pose estimation generally follows top-down and bottom-up paradigms. Both of them use an extra stage (e.g., human detection in top-down paradigm or grouping process in bottom-up paradigm) to build the relationship between the human instance and corresponding key points, thus leading to the high computation cost and redundant two-stage pipeline. To address the above issue, this paper propose to represent the human parts as adaptive points and introduce a fine-grained body representation method. The novel body representation is able to sufficiently encode the diverse pose information and effectively model the relationship between the human instance and corresponding key points in a single-forward pass. With the proposed body representation, they further deliver a compact single-stage multi-person pose regression networks. Without any bells and whistles, they achieve the most competitive performance on MS COCO and CrowdPose in terms of accuracy and speed. Furthermore, the outstanding performance on MuCo-3DHP and MuPoTS-3D further demonstrates the effectiveness and generalizability on 3D scenes.

 

Review opinions:

 

1. In fact, the position of the foot can also be used as an important reference element. Why not express it as a relevant point?

 

2. Whether it is possible to design a correction backtracking unit. If there is deviation in the central positioning, it can be adjusted to prevent the deviation of relevant points?

 

3. It is suggested to explain why the data set is divided according to the ratio of 4:1:5, is this the same for the related work?

 

4. In Table 8, when comparing the methods proposed in the paper with the top-down and bottom-up methods, it seems that the methods mentioned in the table are all proposed before 2020. Is there any new work to compare?

 

5. Based on the visualization in Figure 8, it seems that the method cannot find all the people in the image, and some people are not successfully marked even though the body can be clearly seen

 

6. Some related works should be discussed, such as 10.1109/TDSC.2020.3004708,  10.1109/TCSVT.2019.2896270,10.1109/TNSE.2021.3139671

 

Author Response

Response to Reviewer

Q1. In fact, the position of the foot can also be used as an important reference element. Why not express it as a relevant point?

A1: Thanks for your comments. As mentioned in initial manuscript, we divide the body according to the inherent structure of human. The grouped keypoints are adjacent and each part is a rigid structure. We further conduct the experiments that expressing the ankle as adaptive point and achieve similar result (64.5 AP v.s. 64.6 AP in original partition strategy). We consider that adjacent keypoints can be sufficiently expressed by one human part-related point (e.g., the ankle and knee are grouped into one part and represented by its corresponding adaptive point). More fine-grained division strategy only brings extra computational burdens without performance gains.

 

Q2. Whether it is possible to design a correction backtracking unit. If there is deviation in the central positioning, it can be adjusted to prevent the deviation of relevant points?

A2: Thanks for your valuable comments. Previous methods directly use the center feature to regress the keypoint offsets, the deviation of the center will aggravate the feature mis-alignment, and further lead to keypoint regression bias.  In our work, we introduce adaptive part-related points to separately represent the keypoints in local parts. By using the adaptive part-related points,  the Enhanced Center-Aware Branch adjusts the central positioning, and the Two-hop Regression alleviates the keypoint bias caused by the deviation of the center. In this manner, if there is deviation in central positioning, adaptive part-related points can prevent keypoint regression bias accordingly. The proposed adaptive part-related points can be regarded as correction backtracking unit.

Besides that, the adaptive relevant points only act as intermediate representations, there is no optimal target or metric to evaluate its deviation. As long as the keypoints can be accurately located by its features, the learned relevant points are reasonable.

 

Q3: It is suggested to explain why the data set is divided according to the ratio of 5:1:4, is this the same for the related work?

A3: Thanks for your concern. All related works divide the CrowdPose in proportional to 5:1:4 for train, validation and test process. We strictly follow the previous methods for fair comparisons.

 

Q4: In Table 8, when comparing the methods proposed in the paper with the top-down and bottom-up methods, it seems that the methods mentioned in the table are all proposed before 2020. Is there any new work to compare?

A4: Thanks for your concern. AdaptivePose and AdaptivePose++ are designed for simplifying 2D multi-pose estimation pipeline. We simply extend AdaptivePose to 3D scenes for verifying its generality. Consequently, we only report the representative 3D top-down and bottom-up methods. We add more new work after 2020 for comprehensive presentation in Table 8 of the revision.

 

Q5. Based on the visualization in Figure 8, it seems that the method cannot find all the people in the image, and some people are not successfully marked even though the body can be clearly seen.

A5: Thanks for your concern. For clear visualization, we simply set a visualization threshold (e.g. 0.5), and the human pose whose score is higher than the threshold will be visualized. We observe that only few persons with serious occlusion are not visualized in Figure 8. The keypoints of these samples are mostly invisible, so the pose scores are lower than the threshold.

 

Q6. Some related works should be discussed, such as 10.1109/TDSC.2020.3004708,  10.1109/TCSVT.2019.2896270,10.1109/TNSE.2021.3139671.

A6: Thanks for your comments. We cite and discuss the above paper in Introduction of the revision.

Reviewer 3 Report

 

The paper presents a novel body representation method for multi-person pose estimation that represents human parts as adaptive points. This method is able to effectively model the relationship between the human instance and corresponding keypoints in a single-forward pass. A compact single-stage multi-person pose regression network, called AdaptivePose++, is introduced that only requires a single-step decode operation during inference. The method is tested on 2D/3D multi-person pose estimation tasks and is found to achieve competitive performance on MS COCO and CrowdPose datasets in terms of accuracy and speed, and it shows its effectiveness and generalizability on 3D scenes.

 

However, it could be improved.

 

1-

Abstract:

Clarify the difference between the top-down and bottom-up paradigms in more detail.

Provide more specific information about the proposed body representation method.

Provide more context for the datasets (MS COCO, CrowdPose, MuCo-3DHP, MuPoTS-3D) used to evaluate the proposed method.

Provide more information about the performance improvements achieved by the proposed method.

Add some references to support the claims made in the abstract.

The abstract is a bit dense, you can consider breaking up the sentences into shorter and simpler ones to make the content easier to read.

2-

Errors:

 

"Multi-person pose estimation generally follows top-down and bottom-up paradigms. Both of them use an extra stage (e.g., human detection in top-down paradigm or grouping process in bottom-up paradigm) to build the relationship between the human instance and corresponding keypoints, thus leading to the high computation cost and redundant two-stage pipeline."

The sentence is too long and difficult to follow, consider breaking it up into shorter sentences.

"With the proposed body representation, we further deliver a compact single-stage multi-person pose regression network, termed as AdaptivePose++, which is the extended version of AAAI-22 paper [1]."

The word "deliver" might not be the correct word to use in this context, consider using alternative terms such as "propose" or "introduce"

"During inference, our proposed network only needs a single-step decode operation to form the multi-person pose without complex post-processes and refinements."

The word "form" might not be the right term to use here, consider using "estimate" or "infer" instead.

Overall, the abstract is a bit dense and uses technical terms, it might be helpful to simplify the language, and add more details and examples to make the content more clear.

3-

Provide more specific information about the improvements achieved by the proposed method, such as the exact accuracy increases

4-Add some references to support the claims made in the section.

5-

"The input image are resized to 832×512 for both training and testing process." The sentence should be "The input images are resized to 832x512 for both training and testing process."

6-

Provide more information about the specific input image size and the size used for training and testing

7-Provide more specific information about the improvements achieved by the proposed method, such as the exact accuracy increases.

8-The test images and the corresponding 3D multi-person pose predicted by our proposed AdaptivePose-3D" It should be "The test images and the corresponding 3D multi-person poses predicted by our proposed AdaptivePose-3D."

 

9-Conclusions

Summarize the main contributions of the paper in a clear and concise manner.

Provide more specific information about the improvements achieved by the proposed method, such as the exact accuracy increases.

Consider providing some future research directions or limitations of the proposed method.

 

 

 

Author Response

Response to Reviewer

Q1.

Abstract:

Clarify the difference between the top-down and bottom-up paradigms in more detail.

Provide more specific information about the proposed body representation method.

Provide more context for the datasets (MS COCO, CrowdPose, MuCo-3DHP, MuPoTS-3D) used to evaluate the proposed method.

Provide more information about the performance improvements achieved by the proposed method.

Add some references to support the claims made in the abstract.

The abstract is a bit dense, you can consider breaking up the sentences into shorter and simpler ones to make the content easier to read.

A1: Thanks for your valuable comments. We revise the paper as follows: (1) We briefly clarify the difference between the top-down and bottom-up paradigms in Abstract. (2) We give a brief description of proposed body representation method. (3) We provide context of datasets and the accuracy increases to support the claims made in the abstract.(4) We break up the sentences and re-organize the Abstract.

 

Q2.

Errors:

"Multi-person pose estimation generally follows top-down and bottom-up paradigms. Both of them use an extra stage (e.g., human detection in top-down paradigm or grouping process in bottom-up paradigm) to build the relationship between the human instance and corresponding keypoints, thus leading to the high computation cost and redundant two-stage pipeline."

The sentence is too long and difficult to follow, consider breaking it up into shorter sentences.

"With the proposed body representation, we further deliver a compact single-stage multi-person pose regression network, termed as AdaptivePose++, which is the extended version of AAAI-22 paper [1]."

The word "deliver" might not be the correct word to use in this context, consider using alternative terms such as "propose" or "introduce"

"During inference, our proposed network only needs a single-step decode operation to form the multi-person pose without complex post-processes and refinements."

The word "form" might not be the right term to use here, consider using "estimate" or "infer" instead.

Overall, the abstract is a bit dense and uses technical terms, it might be helpful to simplify the language, and add more details and examples to make the content more clear.

A2: Thanks. We correct the above issues in the revision.

 

Q3. Provide more specific information about the improvements achieved by the proposed method, such as the exact accuracy increases

A3: Thanks. In the revision, we provide the representative accuracy increases in Abstract, and give the detailed description of accuracy increases in Section 4.3, Section 4.4 and the last paragraph of Section 4.5

 

Q4. Add some references to support the claims made in the section.

A4: Thanks. We add the representative references to support the claims made in the Abstract.

 

Q5. "The input image are resized to 832×512 for both training and testing process." The sentence should be "The input images are resized to 832x512 for both training and testing process."

A5: Thanks. We have revised our paper carefully.

 

Q6. Provide more information about the specific input image size and the size used for training and testing

A6: Thanks. We provide the detailed information about input image size in Data Augmentation, Implementation Details of Section 4.1 and Implementation Details of Section 4.5

 

Q7. Provide more specific information about the improvements achieved by the proposed method, such as the exact accuracy increases.

A7: Thanks. In the revision, we provide the representative accuracy increases in Abstract, and give the detailed description of accuracy increases in Section 4.3, Section 4.4 and the last paragraph of Section 4.5

 

Q8. The test images and the corresponding 3D multi-person pose predicted by our proposed AdaptivePose-3D" It should be "The test images and the corresponding 3D multi-person poses predicted by our proposed AdaptivePose-3D."

A8: Thanks. We revise it in the revision.

 

Q9. Conclusions

Summarize the main contributions of the paper in a clear and concise manner.

Provide more specific information about the improvements achieved by the proposed method, such as the exact accuracy increases.

Consider providing some future research directions or limitations of the proposed method.

A9: We re-organize the Conclusion, provide the representative accuracy improvements and also give a discussion about limitations and the future research directions.

Round 2

Reviewer 2 Report

The authors have successfully addressed my major concerns. I recommend accepting this manuscript.

Back to TopTop