Multi-View Visual Relationship Detection with Estimated Depth Map
Abstract
:Featured Application
Abstract
1. Introduction
- To introduce the multi-view information, based only on limited monocular RGB images, into a visual relationship detection task, we construct a novel multi-view VRD framework composed of three modules to generate and utilize multi-view images. This framework can take advantage of multi-view information to extract potential visual features hidden in flat RGB images.
- We improve the visual relationship classifier and mask matrices in a novel balanced classifier to process multi-view features, and then predict relationships with information in different views.
- Detailed experiments were conducted to show the effectiveness of the multi-view VRD framework. A comparison of results demonstrated that our proposed novel framework achieved state-of-the-art generalization ability in specific depth conditions. We also analyzed the effects of different multi-view feature normalization strategies.
2. Related Work
2.1. Object Detection
2.2. Relationship Triplet Learning
2.3. Depth Map Embedding
2.4. Multi-View Framework
2.5. Zero-Shot Learning
3. Methodology
3.1. Multi-View VRD Framework
3.2. Multi-View Features Generator
3.3. Visual Relationship Classifier
4. Experimental Procedure
4.1. Visual Relationship Detection
4.1.1. Dataset
4.1.2. Setup
- Predicate detection (Pre. Det.). The input was an original RGB image and a set of ground truth objects annotations. Our task was to extract the possible predicates between the located object pairs. This task condition was to evaluate the visual relationship classifier performance without effects of the object detector.
- Phrase detection (Phr. Det.). The input was only an original RGB image. Our task was to output a visual relationship triplet (subject-predicate-object) and the entire bounding box which contained the corresponding subject and object simultaneously. This entire bounding box should have at least overlap with the ground truth region.
- Relationship detection (Rel. Det.). The input was only an original RGB image. Our task was to output a visual relationship triplet (subject-predicate-object), and locate the subject and object, respectively, in the image. These two bounding boxes should both have at least overlap with the ground truth bounding boxes.
4.1.3. Recall Performance Evaluation and Comparison
4.1.4. Ablation Study
4.2. Image Retrieval on UnRel Dataset
4.2.1. Dataset
4.2.2. Setup
- With GT. The input was a visual triplet description by natural language. The bounding boxes of the subject and object were already given by ground truth. Our task was to retrieve images containing input visual triplet descriptions according to retrieval confident scores. The retrieval confident scores were generated only by the relationship classifier, without the effects from the object detector.
- With candidates. The input was a visual triplet description by natural language. However, the bounding boxes were generated by the object detector. Thus, the retrieval confident scores were computed by Equation (12). There were three sub-tasks in this setup condition: mAP-Union computed the overlap between the bounding boxes union of the detected object pairs and the ground truth union, mAP-Sub, computed the overlap between the detected subject bounding box and the ground truth subject, the mAP-Sub-Obj computed the overlap of subject and object with the corresponding ground truth bounding box, respectively.
4.2.3. Retrieval Performance Evaluation and Comparison
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Lu, C.; Krishna, R.; Bernstein, M.S.; Fei-Fei, L. Visual Relationship Detection with Language Priors. In Lecture Notes in Computer Science, Proceedings of the Computer Vision-ECCV 2016—14th European Conference, Part I, Amsterdam, The Netherlands, 11–14 October 2016; Springer: New York, NY, USA, 2016; Volume 9905, pp. 852–869. [Google Scholar]
- Zhang, H.; Kyaw, Z.; Chang, S.; Chua, T. Visual Translation Embedding Network for Visual Relation Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, IEEE Computer Society, Honolulu, HI, USA, 21–26 July 2017; pp. 3107–3115. [Google Scholar]
- Yu, R.; Li, A.; Morariu, V.I.; Davis, L.S. Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, IEEE Computer Society, Venice, Italy, 22–29 October 2017; pp. 1068–1076. [Google Scholar]
- Peyre, J.; Laptev, I.; Schmid, C.; Sivic, J. Weakly-Supervised Learning of Visual Relations. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, IEEE Computer Society, Venice, Italy, 22–29 October 2017; pp. 5189–5198. [Google Scholar]
- Zhang, H.; Kyaw, Z.; Yu, J.; Chang, S. PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, IEEE Computer Society, Venice, Italy, 22–29 October 2017; pp. 4243–4251. [Google Scholar]
- Liang, X.; Lee, L.; Xing, E.P. Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, IEEE Computer Society, Honolulu, HI, USA, 21–26 July 2017; pp. 4408–4417. [Google Scholar]
- Liu, X.; Gan, M. RDBN: Visual relationship detection with inaccurate RGB-D images. Knowl. Based Syst. 2020, 204, 106142. [Google Scholar] [CrossRef]
- Sharifzadeh, S.; Baharlou, S.M.; Berrendorf, M.; Koner, R.; Tresp, V. Improving Visual Relation Detection using Depth Maps. In Proceedings of the 25th International Conference on Pattern Recognition, ICPR 2020, Milan, Italy, 10–15 January 2021; pp. 3597–3604. [Google Scholar]
- Thomas, A.; Ferrari, V.; Leibe, B.; Tuytelaars, T.; Schiele, B.; Gool, L.V. Towards Multi-View Object Class Detection. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), New York, NY, USA, 17–22 June 2006; IEEE Computer Society: New York, NY, USA, 2006; pp. 1589–1596. [Google Scholar]
- Zhou, H.; Liu, A.; Nie, W.; Nie, J. Multi-View Saliency Guided Deep Neural Network for 3-D Object Retrieval and Classification. IEEE Trans. Multim. 2020, 22, 1496–1506. [Google Scholar] [CrossRef]
- Rubino, C.; Crocco, M.; Bue, A.D. 3D Object Localisation from Multi-View Image Detections. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1281–1294. [Google Scholar] [CrossRef] [PubMed]
- Tang, C.; Ling, Y.; Yang, X.; Jin, W.; Zheng, C. Multi-view object detection based on deep learning. Appl. Sci. 2018, 8, 1423. [Google Scholar] [CrossRef] [Green Version]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D Object Detection Network for Autonomous Driving. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, IEEE Computer Society, Honolulu, HI, USA, 21–26 July 2017; pp. 6526–6534. [Google Scholar]
- Yu, Y.; Hasan, K.S.; Yu, M.; Zhang, W.; Wang, Z. Knowledge Base Relation Detection via Multi-View Matching. In Communications in Computer and Information Science, Proceedings of the New Trends in Databases and Information Systems-ADBIS 2018 Short Papers and Workshops, AI*QA, BIGPMED, CSACDB, M2U, BigDataMAPS, ISTREND, DC, Budapest, Hungary, 2–5 September 2018; Springer: New York, NY, USA, 2018; Volume 909, pp. 286–294. [Google Scholar]
- Zhang, H.; Xu, G.; Liang, X.; Zhang, W.; Sun, X.; Huang, T. Multi-view multitask learning for knowledge base relation detection. Knowl. Based Syst. 2019, 183, 104870. [Google Scholar] [CrossRef]
- Wang, C.; Fu, H.; Yang, L.; Cao, X. Text Co-Detection in Multi-View Scene. IEEE Trans. Image Process. 2020, 29, 4627–4642. [Google Scholar] [CrossRef] [PubMed]
- Roy, S.; Shivakumara, P.; Pal, U.; Lu, T.; Kumar, G.H. Delaunay triangulation based text detection from multi-view images of natural scene. Pattern Recognit. Lett. 2020, 129, 92–100. [Google Scholar] [CrossRef]
- Eigen, D.; Fergus, R. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture. In Proceedings of the 2015 IEEE International Conference on Computer Vision, ICCV 2015, IEEE Computer Society, Santiago, Chile, 7–13 December 2015; pp. 2650–2658. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1106–1114. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.S.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
- Girshick, R.B.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, IEEE Computer Society, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Plummer, B.A.; Mallya, A.; Cervantes, C.M.; Hockenmaier, J.; Lazebnik, S. Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, IEEE Computer Society, Venice, Italy, 22–29 October 2017; pp. 1946–1955. [Google Scholar]
- Sadeghi, F.; Divvala, S.K.; Farhadi, A. VisKE: Visual knowledge extraction and question answering by visual verification of relation phrases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, IEEE Computer Society, Boston, MA, USA, 7–12 June 2015; pp. 1456–1464. [Google Scholar]
- Qiu, Y.; Satoh, Y.; Suzuki, R.; Iwata, K.; Kataoka, H. Multi-View Visual Question Answering with Active Viewpoint Selection. Sensors 2020, 20, 2281. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Hu, R.; Xu, H.; Rohrbach, M.; Feng, J.; Saenko, K.; Darrell, T. Natural Language Object Retrieval. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, IEEE Computer Society, Las Vegas, NV, USA, 27–30 June 2016; pp. 4555–4564. [Google Scholar]
- Johnson, J.; Karpathy, A.; Li, F.-F. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 4565–4574. [Google Scholar]
- Johnson, J.; Krishna, R.; Stark, M.; Li, L.; Shamma, D.A.; Bernstein, M.S.; Fei-Fei, L. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, IEEE Computer Society, Boston, MA, USA, 7–12 June 2015; pp. 3668–3678. [Google Scholar]
- Girshick, R.B. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision, ICCV 2015, IEEE Computer Society, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.; Shamma, D.A.; et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef] [Green Version]
- Fu, Y.; Hospedales, T.M.; Xiang, T.; Gong, S. Transductive Multi-View Zero-Shot Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 2332–2345. [Google Scholar] [CrossRef] [PubMed]
- Gu, X.; Lin, T.; Kuo, W.; Cui, Y. Zero-Shot Detection via Vision and Language Knowledge Distillation. arXiv 2021, arXiv:2104.13921. [Google Scholar]
- Wu, H.; Yan, Y.; Chen, S.; Huang, X.; Wu, Q.; Ng, M.K. Joint Visual and Semantic Optimization for zero-shot learning. Knowl. Based Syst. 2021, 215, 106773. [Google Scholar] [CrossRef]
- Hwang, S.J.; Ravi, S.N.; Tao, Z.; Kim, H.J.; Collins, M.D.; Singh, V. Tensorize, Factorize and Regularize: Robust Visual Relationship Learning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Computer Vision Foundation/IEEE Computer Society, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1014–1023. [Google Scholar]
- Yin, G.; Sheng, L.; Liu, B.; Yu, N.; Wang, X.; Shao, J.; Loy, C.C. Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition. In Lecture Notes in Computer Science, Proceedings of the Computer Vision-ECCV 2018-15th European Conference, Part III, Munich, Germany, 8–14 September 2018; Springer: New York, NY, USA, 2018; Volume 11207, pp. 330–347. [Google Scholar]
- Bin, Y.; Yang, Y.; Tao, C.; Huang, Z.; Li, J.; Shen, H.T. MR-NET: Exploiting Mutual Relation for Visual Relationship Detection. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Palo Alto, CA, USA, 2019; pp. 8110–8117. [Google Scholar]
- Zhang, J.; Shih, K.J.; Elgammal, A.; Tao, A.; Catanzaro, B. Graphical Contrastive Losses for Scene Graph Parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Computer Vision Foundation/IEEE, Long Beach, CA, USA, 16–20 June 2019; pp. 11535–11543. [Google Scholar]
- Zhan, Y.; Yu, J.; Yu, T.; Tao, D. On Exploring Undetermined Relationships for Visual Relationship Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Computer Vision Foundation/IEEE, Long Beach, CA, USA, 16–20 June 2019; pp. 5128–5137. [Google Scholar]
Pre. Det. () | Phr. Det. () | Rel. Det. () | ||||
---|---|---|---|---|---|---|
Unseen | All | Unseen | All | Unseen | All | |
VRD-Full [1] | 12.3 | 47.9 | 5.1/5.7 | 16.2/17.0 | 4.8/5.4 | 13.9/14.7 |
VTransE [2] | - | 44.8 | 2.7/3.5 | 19.4/22.4 | 1.7/2.1 | 14.1/15.2 |
Weakly [4] | 19.8 | 49.7 | 6.8/7.4 | 16.1/17.7 | 6.2/6.8 | 14.5/15.8 |
PPR-FCN [5] | - | 47.4 | - | 19.6/23.2 | - | 14.4/15.7 |
IELKD:S [3] | 17.0 | - | 10.4/10.9 | - | 8.9/9.1 | - |
IELKD:T [3] | 8.8 | - | 6.5/6.7 | - | 6.1/6.4 | - |
IELKD:S+T [3] | - | 55.2 | - | 23.1/24.0 | - | 19.2/21.3 |
DVSRL [6] | - | - | 9.2/10.3 | 21.4/22.6 | 7.9/8.5 | 18.2/20.8 |
Robust [34] | 17.3 | 52.3 | 5.8/7.1 | 17.4/19.1 | 5.3/6.5 | 15.2/16.8 |
Zoom-Net [35] | - | 50.7 | - | 24.8/28.1 | - | 18.9/21.4 |
CAI+SCA-M [35] | - | 56.0 | - | 25.2/28.9 | - | 19.5/22.4 |
MR-NET [36] | - | 61.2 | - | - | - | 16.7/17.6 |
RelDN [37] | - | - | - | 31.3/36.4 | - | 25.3/28.6 |
MF-URLN [38] | 26.9 | 58.2 | 5.9/7.9 | 31.5/36.1 | 4.3/5.5 | 23.9/26.8 |
MF-URLN-IM [38] | 27.2 | - | 6.2/9.2 | - | 4.5/6.4 | - |
RDBN [7] | 22.5 | 50.1 | 7.4/8.2 | 16.1/17.8 | 6.8/7.3 | 14.4/15.8 |
RDBN () [7] | 21.6 | 52.3 | 9.8/11.2 | 17.8/19.6 | 9.5/10.3 | 15.9/17.3 |
RDBN () [7] | 42.6 | 55.2 | 11.5/11.5 | 19.6/20.6 | 6.6/6.6 | 14.1/15.0 |
Multi-view[Img] | 22.7 | 50.2 | 7.4/8.2 | 16.1/17.9 | 6.8/7.3 | 14.4/15.9 |
Multi-view[Img]() | 21.3 | 52.0 | 9.5/10.3 | 17.6/19.4 | 9.2/9.8 | 15.8/17.2 |
Multi-view[Img]() | 44.3 | 55.8 | 11.5/11.5 | 19.6/20.9 | 6.6/6.6 | 14.1/15.3 |
Multi-view[Reg] | 22.9 | 50.3 | 7.0/7.9 | 16.1/17.8 | 6.4/7.0 | 14.4/15.8 |
Multi-view[Reg]() | 22.4 | 52.2 | 9.5/10.6 | 17.8/19.5 | 9.2/10.1 | 15.8/17.2 |
Multi-view[Reg]() | 42.6 | 55.5 | 11.5/11.5 | 19.6/20.6 | 6.6/6.6 | 14.1/15.3 |
Pre. Det.() | Phr. Det.() | Rel. Det.() | |||||
---|---|---|---|---|---|---|---|
Unseen | All | Unseen | All | Unseen | All | ||
a. | RDBN[without BC] [7] | 21.0 | 49.9 | 7.0/7.6 | 16.2/17.7 | 6.4/6.9 | 14.5/15.7 |
b. | RDBN[without BC]() [7] | 19.3 | 51.7 | 9.2/10.1 | 17.7/19.3 | 8.6/9.5 | 15.8/17.2 |
c. | RDBN[without BC]() [7] | 39.3 | 54.3 | 9.8/9.8 | 19.3/20.2 | 6.6/6.6 | 14.4/15.0 |
d. | Multi-view[Img][without BC’] | 22.0 | 50.4 | 7.1/7.6 | 16.3/17.8 | 6.5/6.9 | 14.6/15.8 |
e. | Multi-view[Img][without BC’]() | 19.8 | 52.0 | 9.2/10.1 | 17.9/19.3 | 8.9/9.5 | 15.9/17.2 |
f. | Multi-view[Img][without BC’]() | 41.0 | 55.2 | 9.8/9.8 | 19.6/20.2 | 4.9/4.9 | 14.4/14.7 |
g. | Multi-view[Img][with BC’] | 22.7 | 50.2 | 7.4/8.2 | 16.1/17.9 | 6.8/7.3 | 14.4/15.9 |
h. | Multi-view[Img][with BC’]() | 21.3 | 52.0 | 9.5/10.3 | 17.6/19.4 | 9.2/9.8 | 15.8/17.2 |
i. | Multi-view[Img][with BC’]() | 44.3 | 55.8 | 11.5/11.5 | 19.6/20.9 | 6.6/6.6 | 14.1/15.3 |
j. | Multi-view[Reg][without BC’] | 22.0 | 50.4 | 7.0/7.5 | 16.3/17.8 | 6.4/6.8 | 14.5/15.9 |
k. | Multi-view[Reg][without BC’]() | 21.0 | 52.2 | 9.2/10.1 | 17.7/19.4 | 8.9/9.5 | 15.7/17.3 |
l. | Multi-view[Reg][without BC’]() | 39.3 | 54.3 | 9.8/9.8 | 19.9/20.6 | 4.9/4.9 | 14.7/15.0 |
m. | Multi-view[Reg][with BC’] | 22.9 | 50.3 | 7.0/7.9 | 16.1/17.8 | 6.4/7.0 | 14.4/15.8 |
n. | Multi-view[Reg][with BC’]() | 22.4 | 52.2 | 9.5/10.6 | 17.8/19.5 | 9.2/10.1 | 15.8/17.2 |
o. | Multi-view[Reg][with BC’]() | 42.6 | 55.5 | 11.5/11.5 | 19.6/20.6 | 6.6/6.6 | 14.1/15.3 |
With GT | With Candidates | ||||
---|---|---|---|---|---|
- | Union | Sub | Sub-Obj | ||
: | |||||
Weakly [4] | 56.8 | 17.0 | 15.9 | 13.4 | |
RDBN [7] | 60.1 | 17.3 | 16.3 | 13.5 | |
Multi-view (with ImageNorm) | 60.7 | 17.3 | 16.3 | 13.5 | |
Multi-view (with RegionNorm) | 60.2 | 17.3 | 16.3 | 13.5 | |
: | |||||
Weakly [4] | 56.8 | 15.2 | 11.0 | 7.7 | |
RDBN [7] | 60.1 | 15.6 | 11.6 | 7.8 | |
Multi-view (with ImageNorm) | 60.7 | 15.6 | 11.6 | 7.8 | |
Multi-view (with RegionNorm) | 60.2 | 15.5 | 11.6 | 7.8 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, X.; Gan, M.-G.; He, Y. Multi-View Visual Relationship Detection with Estimated Depth Map. Appl. Sci. 2022, 12, 4674. https://doi.org/10.3390/app12094674
Liu X, Gan M-G, He Y. Multi-View Visual Relationship Detection with Estimated Depth Map. Applied Sciences. 2022; 12(9):4674. https://doi.org/10.3390/app12094674
Chicago/Turabian StyleLiu, Xiaozhou, Ming-Gang Gan, and Yuxuan He. 2022. "Multi-View Visual Relationship Detection with Estimated Depth Map" Applied Sciences 12, no. 9: 4674. https://doi.org/10.3390/app12094674
APA StyleLiu, X., Gan, M.-G., & He, Y. (2022). Multi-View Visual Relationship Detection with Estimated Depth Map. Applied Sciences, 12(9), 4674. https://doi.org/10.3390/app12094674