A Novel Part Refinement Tandem Transformer for Human–Object Interaction Detection
Abstract
:1. Introduction
- This study disassembles the rich HOI prediction elements and performs multiple queries to focus on all element-related features and achieve a good trade-off effectively. Simultaneously, it adopts the multiple tandem decoders strategy to avoid additional and time-consuming matching operations.
- We efficiently encode and integrate appearance features and state semantics through a pretrained Bert model with human pose key points as clues.
- This study adopts a novel prior feature integrated cross-attention layer to efficiently introduce fine-grained part-state semantics and appearance features in the second stage to guide and improve queries.
2. Related Work
2.1. Object Detection
2.2. Human–Object Interaction Detection
3. Method
3.1. Overview
3.2. Backbone
3.3. Tandem Transformer
3.4. Part State Feature Extraction
3.5. Inference and Loss Function
4. Experiments
4.1. Experimental Setup
4.2. Implementation Details
4.3. Results
4.4. Qualitative Visualization Comparison
4.5. Ablation Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Chao, Y.W.; Liu, Y.; Liu, X.; Zeng, H.; Deng, J. Learning to detect human-object interactions. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 381–389. [Google Scholar]
- Zhao, H.; Wildes, R.P. Spatiotemporal feature residual propagation for action prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7003–7012. [Google Scholar]
- Kong, Y.; Tao, Z.; Fu, Y. Deep sequential context networks for action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1473–1481. [Google Scholar]
- Lin, X.; Ding, C.; Zeng, J.; Tao, D. Gps-net: Graph property sensing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3746–3753. [Google Scholar]
- Suhail, M.; Mittal, A.; Siddiquie, B.; Broaddus, C.; Eledath, J.; Medioni, G.; Sigal, L. Energy-based learning for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13936–13945. [Google Scholar]
- Ulutan, O.; Iftekhar, A.; Manjunath, B.S. Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13617–13626. [Google Scholar]
- Gao, C.; Zou, Y.; Huang, J.B. ican: Instance-centric attention network for human-object interaction detection. arXiv 2018, arXiv:1808.10437. [Google Scholar]
- Liu, X.; Li, Y.L.; Wu, X.; Tai, Y.W.; Lu, C.; Tang, C.K. Interactiveness field in human-object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20113–20122. [Google Scholar]
- Wan, B.; Zhou, D.; Liu, Y.; Li, R.; He, X. Pose-aware multi-level feature network for human object interaction detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9469–9478. [Google Scholar]
- Wu, X.; Li, Y.L.; Liu, X.; Zhang, J.; Wu, Y.; Lu, C. Mining cross-person cues for body-part interactiveness learning in hoi detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 121–136. [Google Scholar]
- Wang, X.; Gupta, A. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 399–417. [Google Scholar]
- Liao, Y.; Liu, S.; Wang, F.; Chen, Y.; Qian, C.; Feng, J. Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 482–490. [Google Scholar]
- Wang, T.; Yang, T.; Danelljan, M.; Khan, F.S.; Zhang, X.; Sun, J. Learning human-object interaction detection using interaction points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4116–4125. [Google Scholar]
- Zou, C.; Wang, B.; Hu, Y.; Liu, J.; Wu, Q.; Zhao, Y.; Li, B.; Zhang, C.; Zhang, C.; Wei, Y.; et al. End-to-end human object interaction detection with hoi transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11825–11834. [Google Scholar]
- Dong, Q.; Tu, Z.; Liao, H.; Zhang, Y.; Mahadevan, V.; Soatto, S. Visual relationship detection using part-and-sum transformers with composite queries. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 3550–3559. [Google Scholar]
- Tamura, M.; Ohashi, H.; Yoshinaga, T. Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10410–10419. [Google Scholar]
- Kim, B.; Lee, J.; Kang, J.; Kim, E.S.; Kim, H.J. Hotr: End-to-end human-object interaction detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 74–83. [Google Scholar]
- Ning, S.; Qiu, L.; Liu, Y.; He, X. Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 23507–23517. [Google Scholar]
- Zhang, A.; Liao, Y.; Liu, S.; Lu, M.; Wang, Y.; Gao, C.; Li, X. Mining the benefits of two-stage and one-stage hoi detection. Adv. Neural Inf. Process. Syst. 2021, 34, 17209–17220. [Google Scholar]
- Zhang, F.Z.; Campbell, D.; Gould, S. Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20104–20112. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Gupta, S.; Malik, J. Visual Semantic Role Labeling. arXiv 2015, arXiv:1505.04474. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-time flying object detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar]
- Li, Y.L.; Zhou, S.; Huang, X.; Xu, L.; Ma, Z.; Fang, H.S.; Wang, Y.; Lu, C. Transferable interactiveness knowledge for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3585–3594. [Google Scholar]
- Su, Z.; Wang, Y.; Xie, Q.; Yu, R. Pose graph parsing network for human-object interaction detection. Neurocomputing 2022, 476, 53–62. [Google Scholar] [CrossRef]
- Su, Z.; Yu, R.; Zou, S.; Guo, B.; Cheng, L. Spatial-Aware Multi-Level Parsing Network for Human-Object Interaction. Int. J. Interact. Multimed. Artif. Intell. 2023, 1–10. [Google Scholar] [CrossRef]
- Li, Y.L.; Xu, L.; Liu, X.; Huang, X.; Xu, Y.; Chen, M.; Ma, Z.; Wang, S.; Fang, H.S.; Lu, C. Hake: Human activity knowledge engine. arXiv 2019, arXiv:1904.06539. [Google Scholar]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
- Zhong, X.; Qu, X.; Ding, C.; Tao, D. Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13234–13243. [Google Scholar]
- Kim, B.; Choi, T.; Kang, J.; Kim, H.J. Uniondet: Union-level detector towards real-time human-object interaction detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 498–514. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5999. [Google Scholar]
- Yan, W.; Sun, Y.; Yue, G.; Zhou, W.; Liu, H. FVIFormer: Flow-guided global-local aggregation transformer network for video inpainting. IEEE J. Emerg. Sel. Top. Circuits Syst. 2024. [Google Scholar] [CrossRef]
- Lu, Y.; Fu, J.; Li, X.; Zhou, W.; Liu, S.; Zhang, X.; Wu, W.; Jia, C.; Liu, Y.; Chen, Z. Rtn: Reinforced transformer network for coronary ct angiography vessel-level image quality assessment. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 644–653. [Google Scholar]
- Chen, M.; Liao, Y.; Liu, S.; Chen, Z.; Wang, F.; Qian, C. Reformulating hoi detection as adaptive set prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9004–9013. [Google Scholar]
- Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7103–7112. [Google Scholar]
- Lee, J.D.M.C.K.; Toutanova, K. Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
- Chen, X.; Fang, H.; Lin, T.Y.; Vedantam, R.; Gupta, S.; Dollár, P.; Zitnick, C.L. Microsoft coco captions: Data collection and evaluation server. arXiv 2015, arXiv:1504.00325. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Liu, Y.; Chen, Q.; Zisserman, A. Amplifying key cues for human-object-interaction detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 248–265. [Google Scholar]
- Li, Y.L.; Liu, X.; Wu, X.; Li, Y.; Lu, C. Hoi analysis: Integrating and decomposing human-object interaction. Adv. Neural Inf. Process. Syst. 2020, 33, 5011–5022. [Google Scholar]
- Iftekhar, A.; Kumar, S.; McEver, R.A.; You, S.; Manjunath, B. Gtnet: Guided transformer network for detecting human-object interactions. arXiv 2021, arXiv:2108.00596. [Google Scholar]
- Liao, Y.; Zhang, A.; Lu, M.; Wang, Y.; Li, X.; Liu, S. Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20123–20132. [Google Scholar]
- Yuan, H.; Wang, M.; Ni, D.; Xu, L. Detecting human-object interactions with object-guided cross-modal calibrated semantics. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 3206–3214. [Google Scholar]
- Hou, Z.; Yu, B.; Qiao, Y.; Peng, X.; Tao, D. Affordance transfer learning for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 495–504. [Google Scholar]
- Shen, L.; Yeung, S.; Hoffman, J.; Mori, G.; Fei-Fei, L. Scaling Human-Object Interaction Recognition Through Zero-Shot Learning. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018. [Google Scholar]
Method | Feature Backbone | Scenario 1 | Scenario 2 |
---|---|---|---|
UnionDet [35] | ResNet-50-FPN | 47.5 | 56.2 |
Wang et al. [13] | ResNet-50-FPN | 51.0 | - |
PGPN [30] | ResNet-50-FPN | 50.2 | - |
SMPNet [31] | ResNet-50-FPN | 52.8 | - |
HOI-Trans [14] | ResNet-101-FPN | 52.9 | - |
FCMNet [45] | ResNet-50 | 53.1 | - |
IDN [46] | ResNet-50 | 53.3 | 60.3 |
ASNet [39] | ResNet-50 | 53.9 | - |
GGNet [34] | Hourglass-104 | 54.7 | - |
HOTR [17] | ResNet-50 | 55.2 | 64.4 |
GTNet [47] | ResNet-50 | 56.2 | 60.1 |
QPIC [16] | ResNet-50 | 58.8 | 61.0 |
QPIC [16] | ResNet-101 | 58.3 | 60.7 |
Zhang et al. [20] | ResNet-101 | 60.7 | 66.2 |
Liu et al. [8] | ResNet-50 | 63.0 | 65.2 |
Wu et al. [10] | ResNet-50 | 63.0 | 65.1 |
HOICLIP [18] | ResNet-50 | 63.5 | 64.8 |
GEN-VLKTl [48] | ResNet-101 | 63.6 | 65.9 |
CDN [19] | ResNet-101 | 63.9 | 65.9 |
OCN [49] | ResNet-50 | 64.2 | 66.3 |
Our method | ResNet-50 | 63.8 | 65.5 |
Our method | ResNet-101 | 65.2 | 66.8 |
Method | Full | Rare | Non-Rare |
---|---|---|---|
UnionDet [35] | 17.58 | 11.72 | 19.33 |
Wang et al. [13] | 19.56 | 12.79 | 21.58 |
FCMNet [45] | 20.41 | 17.34 | 21.56 |
PGPN [30] | 17.40 | 13.84 | 18.45 |
SMPNet [31] | 20.31 | 17.14 | 21.26 |
PPDM [12] | 21.73 | 13.78 | 24.10 |
HOI-Trans [14] | 23.46 | 16.91 | 25.41 |
PST [15] | 23.93 | 14.98 | 26.60 |
HOTR [17] | 25.10 | 17.34 | 27.42 |
IDN [46] | 26.29 | 22.61 | 27.39 |
GTNet [47] | 26.78 | 21.02 | 28.50 |
ATL [50] | 28.53 | 21.64 | 30.59 |
ASNet [39] | 28.87 | 24.25 | 30.25 |
QPIC (ResNet-50) [16] | 29.07 | 21.85 | 31.23 |
QPIC (ResNet-101) [16] | 29.90 | 23.92 | 31.69 |
GGNet [34] | 29.17 | 22.13 | 30.84 |
CDN [19] | 32.07 | 27.19 | 33.53 |
Zhang et al. [20] | 32.31 | 28.55 | 33.44 |
HOICLIP [19] | 34.69 | 31.12 | 35.74 |
Liu et al. [8] | 33.51 | 30.30 | 34.46 |
OCN(ResNet-50) [49] | 30.91 | 25.56 | 32.51 |
GEN-VLKTl [48] | 34.95 | 31.18 | 36.08 |
Our method (ResNet-50) | 34.47 | 32.43 | 33.78 |
Our method (ResNet-101) | 35.06 | 32.48 | 35.83 |
Method | |
---|---|
QPIC [16] | 58.8 |
Base model | 62.1 |
Base model + | 63.3 |
Base model + FR + | 63.6 |
Base model + FR + + IS | 64.1 |
Base model + + IS | 64.3 |
Base model + FR + + | 64.8 |
Base model + + + IS | 64.9 |
Our method (Base model + FR + + + IS) | 65.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Su, Z.; Yang, H. A Novel Part Refinement Tandem Transformer for Human–Object Interaction Detection. Sensors 2024, 24, 4278. https://doi.org/10.3390/s24134278
Su Z, Yang H. A Novel Part Refinement Tandem Transformer for Human–Object Interaction Detection. Sensors. 2024; 24(13):4278. https://doi.org/10.3390/s24134278
Chicago/Turabian StyleSu, Zhan, and Hongzhe Yang. 2024. "A Novel Part Refinement Tandem Transformer for Human–Object Interaction Detection" Sensors 24, no. 13: 4278. https://doi.org/10.3390/s24134278
APA StyleSu, Z., & Yang, H. (2024). A Novel Part Refinement Tandem Transformer for Human–Object Interaction Detection. Sensors, 24(13), 4278. https://doi.org/10.3390/s24134278