Next Article in Journal
Adaptive 3D Reversible Data Hiding Technique Based on the Cumulative Peak Bins in the Histogram of Directional Prediction Error
Previous Article in Journal
DANet: A Semantic Segmentation Network for Remote Sensing of Roads Based on Dual-ASPP Structure
 
 
Article
Peer-Review Record

MSTPose: Learning-Enriched Visual Information with Multi-Scale Transformers for Human Pose Estimation

Electronics 2023, 12(15), 3244; https://doi.org/10.3390/electronics12153244
by Chengyu Wu *, Xin Wei, Shaohua Li and Ao Zhan
Electronics 2023, 12(15), 3244; https://doi.org/10.3390/electronics12153244
Submission received: 26 June 2023 / Revised: 26 July 2023 / Accepted: 26 July 2023 / Published: 27 July 2023
(This article belongs to the Section Computer Science & Engineering)

Round 1

Reviewer 1 Report

The paper presents a model architecture for Human Pose Estimation using Multi-scale Transformers and Convolutional Neural Networks. The paper is well written, the methods are good, and the results, although not state-of-the-art, are very close.

Here are some suggestions:

Can you compare it against state-of-the-art models? such as the ones in https://paperswithcode.com/sota/pose-estimation-on-coco-test-dev and https://paperswithcode.com/sota/pose-estimation-on-mpii-human-pose

What is the contribution compared to those methods?

Author Response

Thanks for your suggestions.  

Compared to the state-of-the-art ViTPose model, our proposed model has a different framework. Therefore, we did not directly compare the two in the original manuscript.

In the revised manuscript, we have added a comparison description with the state-of-the-art (SOTA) model in the conclusion section. Specifically, compared to the Pure-Trans type ViTPose model, our model significantly reduces the computational complexity and computational requirements while only slightly compromising accuracy.

Reviewer 2 Report

Figure 1 should be enlarged because the text of the figure is almost two times smaller than the paper's text. Also, the blue arrows of figure 1 are not visible as blue arrows. 

In the text is mentioned HRNetW48-s but it is not clear which part of the proposed architecture presented on Figure 1 corresponds to HRNetW48-s. Please either mark HRNetW48-s on Figure 1 or clarify in the text.

On Figure 1, the most left processing block (encoder) is labeled as Stem, but no where in the text this acronym is described. 

The coordinate attention mechanism is a central to the proposed approach, so elaboration on the intuition behind it is needed. The paper only mentions "Each attention map captures long-range dependencies along the spatial direction of the input feature maps"

In 3.3. is stated "the Transformer is a sequential network" which is imprecise. It should be "Transformer is processing sequential data" (or similar).

In 3.3. is mentioned "Transformer lacks positional awareness, position encoding is then applied to the channel tokens and spatial tokens.". Please, with one sentence specify how this positional encoding is implemented. 

In 3.3. is made reference to Figure 3. However, on Figure 3 are encoders labeled as "Channel Encoder", "Spatial Encoder" which are not referenced in the text. The description of MST presented in Figure 3 should be made more clear with correspondence to the figure.  Also, the mathematical notations of in the text should be made in the figure as well (whenever possible). 

In 3.4. is mentioned, "In this paper, we first fuse the one-dimensional sequences output from three branches to generate ..". It is not clear how this fusion is implemented. This must be clarified. 

The mathematical notations of the variables in 3.4. as well as their dimensions should be also noted in figure 4. It is difficult to trace the processing stages of each variable depicted in the figure.

It is not mentioned which of the DL frameworks is used for implementation. 

 If claims of such high margin improvements are made , e.g. "Decrease of 39.7% in GFLOPs" and "improves the AP by 4.8%", then it is recommended to publish the source code publicly. 

English quality is good, small editing is needed. 

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

In this paper, the authors present their work on the using multi-scale transformers for 2D human pose estimation.

Overall, the paper is well written but with minor concerns about the presentation. Specifically, as it currently stands, it is unclear on what the key novelty is. The current presentation mentions these as novelty (at least that I how I read it) - using transformers with CNN backbone, using coordinate attention, using multi-scale training. These are good strategies, but these have been done in the past up to a certain extent. The "why" part of the research question is not being answered in the paper clearly. The only contribution (again, as it is being portrayed) that I feel is novel and is clearly motivated as to "why" it is being done - is the use of VeR module to process one-dimensional transformer output instead of conventional heatmap. I urge the authors to be clear on the novelty as compared to the previous works and present it clearly with the motivation behind its use.

Most of the sections in the paper are adequately described, i.e., introduction and methods. However, there are some issues with the experimental results section. I would request the authors to fix / clarify the following issues. Specifically, the way the ablation studies are performed is unclear.

What is the difference between Tables 4 & 5? The way how I see it, the authors are trying to justify why they used all scales, and not just higher resolutions as what others do. These are ablations for scales, not ATTM or MSTPose. Also, one reason why others use higher resolutions is to reduce complexity. That needs to be a part of the comparison. Please clarify!!

Table 6 is good, which clearly justifies the use of VeR. However, I am not sure why you need the first two columns. You could get rid of them and the table will still convey the same meaning.

As for Table-7, this is the main ablation study for the proposed works in your model. However, as it is currently, it does not justify VeR. It is only testing the use of ATTM and MSTPose - the VeR is always used in the all the rows. The baseline should show nothing but the base model without any of your contributions. Then, you keep adding each module to see whether adding any given module is improving the performance or not. As I understand, this table should have 8 rows for the 3 different contributions you are claiming. Another possibility is just using 4 rows assuming you need to have the VeR active for getting any output - it must be either Heatmap or VeR, plus you justified why VeR using Table-6. Also, you can remove the first column here too.

Lastly, there have been some progress in this field with works that can be compared against. Please update the literature and the comparisons to include the more recent works (some are listed below) --
(1) your current reference #36 -- Yang, S., Quan, Z., Nie, M. and Yang,W., 2021. Transpose: Keypoint localization via transformer. In Proceedings of the IEEE/CVF 435 International Conference on Computer Vision. Montreal, Canada, October 10-17, 2021, pp. 11802-11812. 
(2) Ma, H., Chen, L., Kong, D., Wang, Z., Liu, X., Tang, H., Yan, X., Xie, Y., Lin, S.Y., Xie, X.: Transfusion: Cross-view fusion with transformer for 3d human pose estimation. BMVC (2021)
(3) Li, Y., Zhang, S., Wang, Z., Yang, S., Yang, W., Xia, S.T., Zhou, E.: Tokenpose: Learning keypoint tokens for human pose estimation. In: ICCV (2021)
(4) Ma, H., Wang, Z., Chen, Y., Kong, D., Chen, L., Liu, X., ... & Xie, X. (2022, October). Ppt: token-pruned pose transformer for monocular and multi-view human pose estimation. In European Conference on Computer Vision (pp. 424-442). Cham: Springer Nature Switzerland.
(5) Li, J., Bian, S., Zeng, A., Wang, C., Pang, B., Liu, W., Lu, C.: Human pose regression with residual log-likelihood estimation. In: Proc. IEEE Int. Conf. Comp. Vis. (2021)
(6) Mao, W., Ge, Y., Shen, C., Tian, Z., Wang, X., Wang, Z., & den Hengel, A. V. (2022, October). Poseur: Direct human pose regression with transformers. In European Conference on Computer Vision (pp. 72-88). Cham: Springer Nature Switzerland.
(7) Li, Y., Yang, S., Liu, P., Zhang, S., Wang, Y., Wang, Z., ... & Xia, S. T. (2022, October). Simcc: A simple coordinate classification perspective for human pose estimation. In European Conference on Computer Vision (pp. 89-106). Cham: Springer Nature Switzerland.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

The paper presents a model architecture for Human Pose Estimation using Multi-scale Transformers and Convolutional Neural Networks. The paper is well written, the methods are good, and the results, although not state-of-the-art, are very close.

Here are some suggestions:

The authors claim state-of-the-art on the tables presented. However, there are other methods that obtain better performance. Please put those on the tables. You can find them here:

https://paperswithcode.com/sota/pose-estimation-on-coco-test-dev  https://paperswithcode.com/sota/pose-estimation-on-mpii-human-pose

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 3

Reviewer 1 Report

The authors responded all the comments.

Back to TopTop