Multi-View Visual Relationship Detection with Estimated Depth Map

Liu, Xiaozhou; Gan, Ming-Gang; He, Yuxuan

doi:10.3390/app12094674

Open AccessArticle

Multi-View Visual Relationship Detection with Estimated Depth Map

by

Xiaozhou Liu

,

Ming-Gang Gan

^* and

Yuxuan He

State Key Laboratory of Intelligent Control and Decision of Complex Systems, School of Automation, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(9), 4674; https://doi.org/10.3390/app12094674

Submission received: 9 April 2022 / Revised: 3 May 2022 / Accepted: 4 May 2022 / Published: 6 May 2022

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

Computer vision, visual scenes understanding, and image retrieval.

Abstract

The abundant visual information contained in multi-view images is widely used in computer vision tasks. Existing visual relationship detection frameworks have extended the feature vector to improve model performance. However, single view information can not fully reveal the visual relationships in complex visual scenes. To solve this problem and explore the multi-view information in a visual relationship detection (VRD) model, a novel multi-view VRD framework based on a monocular RGB image and an estimated depth map is proposed. The contributions of this paper are threefold. First, we construct a novel multi-view framework which fuses information of different views extracted from estimated RGB-D images. Second, a multi-view image generation method is proposed to transfer flat visual space to 3D multi-view space. Third, we redesign the visual relationship balanced classifier which can process multi-view feature vectors simultaneously. Detailed experiments were conducted on two datasets to demonstrate the effectiveness of the multi-view VRD framework. The experimental results showed that the multi-view VRD framework resulted in state-of-the-art zero-shot learning performance in specific depth conditions.

Keywords:

visual relationship detection; RGB-D image; depth map; multi view; computer vision

1. Introduction

Visual relationship detection (VRD) focuses not only on detecting separate objects but also extracting the interactions between objects, which plays a vital role in the understanding of visual scenes. As illustrated in Figure 1, there are different visual relationships between separate detected objects. With the development of object detection, numerous studies on visual relationship detection [1,2,3,4,5,6,7] have been conducted in recent years. Peyre et al. [4] combined appearance RGB features with flat spatial features between detected bounding boxes, to improve the model performance of visual relationship detection and zero-shot learning. However, in practical scenes, numerous objects were distributed in 3D space instead of 2D flat space. Thus, it is important to research the RGB-D visual relationship detection framework. Liu et al. [7] proposed a novel VRD framework which extended Peyre et al.’s work [4] from RGB features to RGB-D features to dramatically enhance the generalization ability of a VRD model in complex visual scenes. Consequently, depth maps corresponding to RGB images are valuable in visual relationship detection because the depth information can reveal additional semantic relationships which are not contained in RGB information.

Several papers have introduced depth maps in visual relationship detection tasks to extract additional provided visual information on object interactions, such as RDBN [7] and the VG-DepthModel [8]. As we obtain the 3D visual information, multi-view methods can be utilized to further improve VRD performance. Existing multi-view methods have been widely used in visual scene understanding tasks, such as classification [9,10], object detection [11,12,13], knowledge-base relation detection [14,15] and image text detection [16,17]. However, to the best of our knowledge, the multi-view method based on RGB-D information has not been researched in visual relationship detection, as there are some difficulties associated with this novel task.

The first challenge is how to generate corresponding multi-view images only based on monocular RGB images. As described in [7], existing datasets for visual relationship detection only contain RGB information. Furthermore, in complex visual scenes, obtaining accurate depth data with an RGB-D camera is also difficult. Thus, first, we use a monocular depth estimation model [18] to generate inaccurate depth maps corresponding to original RGB images. Then, we propose a novel VRD framework, including a visual strategy, to transform the obtained estimated RGB-D information into multi-view images which can reveal additional visual semantic relationships.

The second challenge is how to reduce the negative effects of inaccurate information in estimated depth maps and generated multi-view images. In [7], depth information was mapped into feature vectors before they were input to a visual relationship classifier where RGB and depth information were processed to reduce negative effects from inaccurate information. However, for multi-view images generated from estimated depth maps, there is even more inaccurate information than in depth maps, because every depth map corresponds to several multi-view images. In this paper, we improve the balanced visual relationship classifier to combine flat RGB images with corresponding multi-view images.

In summary, there are three main contributions of this paper:

To introduce the multi-view information, based only on limited monocular RGB images, into a visual relationship detection task, we construct a novel multi-view VRD framework composed of three modules to generate and utilize multi-view images. This framework can take advantage of multi-view information to extract potential visual features hidden in flat RGB images.
We improve the visual relationship classifier and mask matrices in a novel balanced classifier to process multi-view features, and then predict relationships with information in different views.
Detailed experiments were conducted to show the effectiveness of the multi-view VRD framework. A comparison of results demonstrated that our proposed novel framework achieved state-of-the-art generalization ability in specific depth conditions. We also analyzed the effects of different multi-view feature normalization strategies.

The rest of this paper is organized as follows: First, related studies on visual relationship detection and multi-view methods are discussed in Section 2. Second, a multi-view framework is constructed and the included modules are elaborated in Section 3. Then, a comparison and analysis of experimental results are conducted in Section 4. Finally, we conclude this paper in Section 5.

2. Related Work

To explore visual relationship detection with multi-view information, in this section, we first introduce existing object detection methods and relationship triplet learning frameworks. Second, we analyze the depth map embedding methods in the visual relationship research field. Then, multi-view and zero-shot learning methods for the visual scenes understanding task are introduced.

2.1. Object Detection

Object detection is the first step for a VRD framework. Thus, the performance of the object detection module dramatically affects the performance of a VRD framework. For an image VRD framework, CNN [19] is the core module extracting visual features. Because of its strong feature extraction ability, CNN has been widely utilized in visual recognition tasks, such as ImageNet [20]. With the appearance of R-CNN [21], the object detection research field underwent significant improvement. R-CNN processes every region-of-interest (ROI) to classify all the object candidates to greatly improve detection accuracy. Subsequent research has mainly focused on how to improve the processing speed of object detection frameworks, such as Faster R-CNN [22]. Faster R-CNN utilizes CNN once only to extract the visual features in all candidate regions to reduce the computational cost.

In this paper, we utilize Faster R-CNN as the object detector in a proposed multi-view VRD framework to locate objects in visual scenes and then extract the corresponding visual features.

2.2. Relationship Triplet Learning

A relationship triplet can be represented as subject, predicate, and object, where subject and object are located by an object detector and the predicate is extracted by a visual relationship classifier. Relationship triplet learning is a core module in visual scenes understanding, such as visual relationship detection [2,3,5,23], visual question answering [24,25] and image retrieval [26,27,28].

Generally, in a visual relationship detection framework, the first module is an object detector. Lu et al. [1] located objects with R-CNN [21], Peyre et al. [4] regarded Fast R-CNN [29] as the object detector, and Liu et al. [7] utilized Faster R-CNN [22] to generate candidate object pairs.

After obtaining visual features generated by an object detector, the features need to be processed and then input into a visual relationship classifier to extract the categories of predicates between the detected object pairs. Lu et al. [1] proposed a novel VRD framework which combined a visual module with a language module to share knowledge among visual relationship classifiers (each classifier corresponded to one kind of predicate). Thus, the relevance of object pairs which had already appeared in a training dataset was enhanced, and the VRD accuracy was improved. However, in practical visual scenes, there are numerous combinations of relationship triplets. Although subject, object and predicate may appear in a training dataset, the combination of them may not appear. We call this situation zero-shot learning. To improve zero-shot learning performance, Peyre et al. [4] extracted spatial features by encoding detected bounding box coordinates into spatial vectors, and then combined appearance features with spatial features. Furthermore, ref. [4] proposed an UnRel dataset including unusual visual relationships to evaluate the generalization ability and zero-shot learning performance of a VRD model. To extract more visual information in a VRD task, Liu et al. [7] introduced estimated depth maps into original flat RGB images. In [7], an RDBN framework was proposed to combine RGB information with inaccurate depth features and to balance them in a visual relationship classifier training process. Consequently, depth information was able to improve the accuracy and zero-shot learning performance of a VRD framework, especially in specific depth conditions.

In this paper, we extend the visual feature dimensions in the same limited VRD dataset. First, we estimate depth maps corresponding to monocular RGB images by depth estimation deep neural networks. Second, using a novel multi-view strategy, multi-view images are generated based on original RGB images and inaccurate estimated depth maps. Finally, using a proposed novel visual module, multi-view visual feature vectors are extracted and input into a novel relationship classifier which can process multi-view RGB and depth information simultaneously. The results of experiments undertaken showed that the coordinated multi-view information was able to improve the detection accuracy and zero-shot learning performance of a VRD framework.

2.3. Depth Map Embedding

To introduce depth information into original flat RGB images, the first step is monocular depth estimation. Eigen et al. [18] conducted depth estimation, surface normal estimation and semantic labeling simultaneously in a proposed single multiscale CNN. This depth estimation network was composed of three modules: (1) A full-image view module. The job of this module was to predict the coarse visual features of an entire input image for the full-image field of view; (2) A predictions module. This module produced predictions at mid-level resolution utilizing the full-image information output by the coarse network; and (3) A higher resolution module. This final module refined the predictions to a higher resolution which were regarded as the output of the overall network.

Based on [18], for the first time, Liu et al. [7] combined RGB information with estimated depth maps in a visual relationship detection task. The RDBN framework [7] not only mapped generated depth maps into depth appearance feature vectors which revealed contour features of detected objects, but also utilized fuzzy strategy to construct depth spatial feature vectors, which revealed depth location relationships between detected objects. In an existing RGB-D dataset, ref. [8] trained a mapping function of RGB images to the corresponding estimated depth maps. Based on the trained depth estimation model, ref. [8] obtained estimated depth maps of a VG dataset [30]. In contrast to [7,8] did not consider the negative effects of inaccurate information in the estimated depth maps.

In this paper, we estimated corresponding depth maps in a VRD dataset [1] with a monocular depth estimation model [18]. Then, multi-view images were generated using a specific strategy for RGB-D information. Finally, multi-view images were mapped into feature vectors which were regarded as the input to a visual relationship classifier.

2.4. Multi-View Framework

In practical visual scenes, information from one specific view is limited. Therefore, multi-view information can supply more abundant visual features of different perspectives of objects. Multi-view information has been applied in numerous visual tasks, such as object detection [9,11,12,13] and text detection [16,17]. In earlier work, Thomas et al. [9] proposed an object detection model which could detect object instances from arbitrary viewpoints. With the appearance of CNN, further multi-view deep models for object detection were constructed. Chen et al. [13] proposed a framework which was composed of a 3D object proposal generation module and a multi-view features fusion module. Rubino et al. [11] utilized only 2D object detections from multi-view images to recover 3D object positions in generic visual scenes. Tang et al. [12] proposed a multi-view detection approach to improve the ability to extract small objects. To solve the problems of occlusion and perspective distortion, Wang et al. [16] constructed a novel text detection model based on multi-view images to identify co-occurring texts.

In our proposed multi-view visual relationship detection framework, three image views are generated based on original RGB images and estimated depth maps. These multi-view images are mapped into appearance features and spatial features before they are input to a visual relationship classifier which can process different visual components in multi-view images simultaneously.

2.5. Zero-Shot Learning

Generally, zero-shot learning [31] aims to learn knowledge, such as object classes and specific visual relationship combinations, which could be unseen in a training dataset. For an object detection task, Gu et al. [32] proposed a model, trained by a vision and language distillation method, to detect novel objects which were not annotated by bounding boxes or masks. To model the relations between visual and semantic spaces, Wu et al. [33] exploited the common features shared by both visual and semantic spaces to improve the generalization ability of instance classification. In visual relationship detection, for the first time, ref. [1] defined zero-shot learning in VRD. Furthermore, Peyre et al. [4] improved VRD zero-shot learning by introducing flat spatial feature vectors into original RGB features. In RDBN [7], a depth map made its first appearance in visual relationship detection tasks to further enhance zero-shot learning performance.

Previous studies have demonstrated that extending the dimensions of visual features could improve the generalization ability of visual relationship detection models. Similarly, for the first time, we introduce multi-view information into a VRD to exploit more visual features which have not been discussed. Moreover, in this paper, the functions of different visual components are analyzed to show the effectiveness of the proposed VRD framework.

3. Methodology

In this section, we elaborate the proposed novel VRD framework where multi-view information is introduced into original RGB images to utilize extended dimensional visual features. Firstly, we construct the multi-view VRD framework and introduce the main modules. Secondly, we propose the methods generating multi-view images based on estimated RGB-D information. Moreover, different visual components of feature vectors are analyzed and fused. Thirdly, an improved visual relationship classifier is proposed to process different visual components simultaneously.

3.1. Multi-View VRD Framework

To utilize the multi-view information in visual relationship detection, based only on monocular RGB input images, we constructed a multi-view VRD framework composed of three modules. An overview of the proposed framework is illustrated in Figure 2.

The first module is the RGB-D object detector. Following the settings of [7], this module includes a depth estimation network [18] and two Faster R-CNN models [22] with different parameters. These deep neural networks are all trained on a VRD dataset [1]. Firstly, the original monocular RGB image is input into the depth estimation network to generate the depth map. Then, a monocular RGB image and the corresponding depth map are input into the RGB and depth Faster R-CNN, respectively, to detect candidate object pairs in the visual scene. Furthermore, we can obtain the RGB and depth appearance feature vectors in fully connected layers of the Faster R-CNN.

The second module is a multi-view features generator. A function of this module is to generate multi-view images according to a monocular RGB image, corresponding depth map and object bounding boxes. Moreover, after obtaining multi-view images based on specific proposed strategies, flat and depth spatial feature vectors in different views can be generated by a spatial feature extractor in this module. The authors of [4] introduced a flat spatial feature into monocular RGB information to extract the potential semantic relationships in visual features. After that, RDBN [7] combined a flat visual feature with estimated depth information to further extend the dimensions of feature vectors. For our proposed multi-view VRD framework, visual features in different views, which include flat visual features and depth visual features, are fused together to utilize more comprehensive visual information.

The third module is a visual relationship classifier. The final visual feature vectors, composed of appearance features and spatial features generated by the second module, are input into a visual relationship classifier to extract the visual relationship triplets. Based on the balanced classifier in RDBN [7], we construct an improved classifier which can process multi-view visual features simultaneously.

The settings of the first module are the same as RDBN [7]; thus, we elaborate the remaining two modules in Section 3.2 and Section 3.3, respectively.

3.2. Multi-View Features Generator

As we already obtain the candidate object bounding boxes in an RGB image and corresponding depth values in an estimated depth map, the locations in 3D visual space can be represented. As illustrated in Figure 3, subject A and object B are the detected object candidates; the front view image represents the view of the original monocular RGB image.

To simplify the visual feature extraction,

△ w

and

△ h

are calculated by centers of the bounding boxes of A and B. Moreover,

△ Depth

is the difference in depth values between A and B. The depth value of a detected object can be represented as:

{Depth}_{n} = \frac{\sum_{i = 1}^{w_{n}} \sum_{j = 1}^{h_{n}} GrayValue (i, j)}{w_{n} h_{n}}

(1)

where

w_{n}

and

h_{n}

are the width and height of the nth detected object bounding box, and

GrayValue (i, j)

is the gray value of the point at the ith column and jth row in the corresponding depth map.

In front view, the multi-view spatial relationship between A and B can be represented as

(△ w, △ h, △ Depth)

. Thus, in left view, the multi-view spatial relationship is

(△ Depth, △ h, △ w)

. Note that, intuitively,

△ w

represents the “depth difference” in left view. Similarly, in top view, the multi-view spatial relationship is

(△ w, △ Depth, △ h)

, where

△ h

represents the “depth difference” in top view. Because multi-view images are generated based on only one 3D coordinate system,

△ w

,

△ h

and

△ Depth

have the same values, respectively, in different views.

Based on the multi-view spatial relationships in three kinds of view images, intuitively we can utilize a fuzzy strategy to map the depth difference between detected objects in multi-view images into threefold feature vectors (and each vector is 3-dimensional). For every multi-view image, the third term of the corresponding multi-view spatial relationship can be regarded as the depth difference between objects. Thus, we can process the last term of the multi-view spatial relationship by fuzzy membership function as follows:

S_{D e p t h X} (X \in \{F, L, T\}) = \{\begin{matrix} (1, 0, 0) & △ D_{X} < - b \\ (- \frac{△ D_{X}}{b}, \frac{△ D_{X}}{b} + 1, 0) & - b ⩽ △ D_{X} < 0 \\ (0, - \frac{△ D_{X}}{b} + 1, \frac{△ D_{X}}{b}) & 0 ⩽ △ D_{X} < b \\ (0, 0, 1) & △ D_{X} ⩾ b \end{matrix}

(2)

where

S_{D e p t h X}

is the depth spatial feature of the X view image (

X \in \{F, L, T\}

),

△ D_{X}

is the third term of the multi-view spatial relationship in the X view image, and b is the boundary point which can distinguish three kinds of depth relationship (near, middle, far). Clearly, each dimension of

S_{D e p t h X}

represents the possibility that the depth difference in the X view image belongs to each kind of depth relationship (near, middle, far). Because

△ w

and

△ h

are generated based on the original RGB image but

△ depth

is calculated in an estimated depth map,

S_{D e p t h F}

is not completely accurate; however,

S_{D e p t h L}

and

S_{D e p t h T}

are accurate.

However, if there is no normalization, the value ranges of the third term of the multi-view spatial relationship in different view images vary. For example, the range of

△ Depth

is from 0 to 255, but the ranges of

△ w

and

△ h

depend on the size of the original images. Thus, we need to resize and normalize the terms which represent the depth information in different views. Two normalization strategies are proposed, as follows:

ImageNorm: \{\begin{matrix} △ D_{F} = △ Depth \\ △ D_{L} = \frac{△ w}{ImgWidth} \times MaxDepth \\ △ D_{T} = \frac{△ h}{ImgHeight} \times MaxDepth \end{matrix}

(3)

RegionNorm: \{\begin{matrix} △ D_{F} = △ Depth \\ △ D_{L} = \frac{△ w}{W_{R \{B_{s u b} \cup B_{o b j}\}}} \times MaxDepth \\ △ D_{T} = \frac{△ h}{H_{R \{B_{s u b} \cup B_{o b j}\}}} \times MaxDepth \end{matrix}

(4)

where

MaxDepth = 256

, ImgWidth represents the width of the original image,

W_{R \{B_{s u b} \cup B_{o b j}\}}

is the width of the region which exactly contains the detected subject bounding box and the object bounding box. We call Equation (3) the ImageNorm and call Equation (4) the RegionNorm. These proposed two normalization methods map the value range of

△ D_{X} (X \in \{F, L, T\})

into the gray value range of the depth map. Thus,

△ D_{X} (X \in \{F, L, T\})

can be applied and processed by one fuzzy membership function (Equation (2)). Note that these two different normalization strategies can result in different model performance—the comparison details will be elaborated in the experiment section. The multi-view depth spatial feature vector can be represented as:

S_{D e p t h} = (S_{D e p t h L}, S_{D e p t h T}, S_{D e p t h F})

(5)

where

S_{D e p t h}

is a 9-dimensional feature vector representing multi-view depth features.

Now, as we already obtain the multi-view depth spatial feature vectors between detected objects, as illustrated in Figure 4, the remaining part of the spatial feature is an RGB flat spatial feature. Following the strategies for flat spatial feature vector extraction based on bounding boxes in [4,7], the RGB flat spatial feature vector can be represented as:

S_{R G B} (B_{s}, B_{o}) = [\underset{S_{1}}{\underset{︸}{\frac{x_{o} - x_{s}}{\sqrt{w_{s} h_{s}}}}}, \underset{S_{2}}{\underset{︸}{\frac{y_{o} - y_{s}}{\sqrt{w_{s} h_{s}}}}}, \underset{S_{3}}{\underset{︸}{\sqrt{\frac{w_{o} h_{o}}{w_{s} h_{s}}}}}, \underset{S_{4}}{\underset{︸}{\frac{B_{s} \cap B_{o}}{B_{s} \cup B_{o}}}}, \underset{S_{5}}{\underset{︸}{\frac{w_{s}}{h_{s}}}}, \underset{S_{6}}{\underset{︸}{\frac{w_{o}}{h_{o}}}}]

(6)

where

B_{s} = [x_{s}, y_{s}, w_{s}, h_{s}]

,

B_{o} = [x_{o}, y_{o}, w_{o}, h_{o}]

are bounding boxes of a detected object pair,

(x, y)

is the center of the bounding box, and w and h are the width and height of the bounding box, respectively. Then, a k-order Gaussian mixture model (GMM) is utilized to extend the dimensions of the RGB flat spatial feature vector. In experiments, we set

k = 200

to obtain a 200-dimensional RGB flat spatial feature vector. Now the RGB flat spatial feature

S_{R G B}

can be combined with the depth spatial feature

S_{D e p t h}

to construct a “spatial feature” in Figure 4.

For the “appearance feature” in Figure 4, the depth appearance feature

A_{D e p t h}

and RGB appearance feature

A_{R G B}

were extracted from the fc7 layer of the corresponding Faster R-CNN [22] in the first module of Figure 2. Note that principal component analysis (PCA) was applied to reduce the dimensions of the L2 normalized fc7 features from 4096 to 300. Thus, the depth appearance feature

A_{D e p t h}

and the RGB appearance feature

A_{R G B}

were both vectors with 600 dimensions (every detected object corresponds to a 300-dimensional vector).

In conclusion, after processing by the multi-view features generator module, visual features are mapped into final combined feature vectors which contain an RGB flat spatial feature

S_{R G B}

, a depth spatial feature

S_{D e p t h}

, a depth appearance feature

A_{D e p t h}

and an RGB appearance feature

A_{R G B}

. The next step was to input the final combined feature, which was a 1409-dimensional vector, into the visual relationship classifier to predict the predicates and visual triplets.

3.3. Visual Relationship Classifier

Based on the final combined feature vectors output by the multi-view features generator, the visual relationship classifier can extract the predicates between object pairs corresponding to the input feature vectors. Because estimated depth maps are utilized, there is inaccurate information in the final features. The RGB information (

S_{R G B}

and

A_{R G B}

) is accurate as it is based on the original input images. However, the depth information (

S_{D e p t h}

and

A_{D e p t h}

) is not completely accurate. Note that

S_{D e p t h}

is relatively accurate because of the fuzzy processing. Compared with RDBN [7], these final combined visual features contain different information in multi-view images; thus, we need an improved balanced relationship classifier with novel mask matrices to separately train the different visual components in the final feature vectors.

For the visual relationship classifier

W = [w_{1}, \dots, w_{P}]

, every sub-classifier corresponds to one kind of predicate. X is the matrix of final combined feature vectors; thus,

X W

is the predicate classification results matrix. To process multi-view features, we utilize the loss function in [7] with novel mask matrices. The loss function can be represented as:

L = \frac{1 - α}{N} {∥Y - X A W∥}_{F}^{2} + \frac{α}{N} {∥Y - X B W∥}_{F}^{2} + λ {∥W∥}_{F}^{2}

(7)

where N is the quantity of training samples,

Y \in {0, 1}

is a matrix of ground truth labels,

X = {[x_{1}, \dots, x_{N}]}^{T}

is the final feature vectors matrix where

x_{n}

corresponds nth detected object pair, and

λ

is the coefficient of the regularization term. The remaining parameters are related with mask matrices A and B.

To train the visual relationship classifier with separated multi-view visual components, we redesigned the mask matrices A and B, which are illustrated in Figure 5. Mask A and B are both composed of four diagonal matrices. They can be represented as:

A = D i a g [I^{S_{R G B}}, I^{S_{D e p t h X (X \in L, T, F)}}, I^{A_{D e p t h}}, O^{A_{R G B}}]

(8)

B = D i a g [I^{S_{R G B}}, [I^{S_{D e p t h L}}, I^{S_{D e p t h T}}, O^{S_{D e p t h F}}], O^{A_{D e p t h}}, I^{A_{R G B}}]

(9)

Thus,

X A

and

X B

in Equation (7) can be represented as:

\begin{matrix} \begin{matrix} X A = & [S_{R G B}, [S_{D e p t h L}, S_{D e p t h T}, S_{D e p t h F}], A_{D e p t h}, A_{R G B}] \\ \cdot D i a g [I^{S_{R G B}}, I^{S_{D e p t h X (X \in L, T, F)}}, I^{A_{D e p t h}}, O^{A_{R G B}}] \end{matrix} \end{matrix}

(10)

\begin{matrix} \begin{matrix} X B = & [S_{R G B}, [S_{D e p t h L}, S_{D e p t h T}, S_{D e p t h F}], A_{D e p t h}, A_{R G B}] \\ \cdot D i a g [I^{S_{R G B}}, [I^{S_{D e p t h L}}, I^{S_{D e p t h T}}, O^{S_{D e p t h F}}], O^{A_{D e p t h}}, I^{A_{R G B}}] \end{matrix} \end{matrix}

(11)

The visual components in the final combined feature, which correspond to the zero terms in the mask, will be eliminated. Particularly, in mask B, for the sub diagonal matrix corresponding to

S_{D e p t h}

, the last three terms are set to zero. Intuitively, this is because in

S_{D e p t h} = (S_{D e p t h L}, S_{D e p t h T}, S_{D e p t h F})

,

S_{D e p t h L}

and

S_{D e p t h T}

are more accurate than

S_{D e p t h F}

which is based on an estimated depth map. Thus, in mask B,

S_{D e p t h F}

and

A_{D e p t h}

are both inaccurate and are eliminated together. In Equation (7), with mask matrices,

α

represents the training weight of the depth information and the RGB information.

The visual triplet score which evaluates the confidence of the detected visual relationship can be constructed as:

\begin{matrix} \begin{matrix} TriScore (B_{s}, B_{o}) = & RelClassifier (B_{s}, B_{o}) + θ \cdot ObjDetector (B_{s}, B_{o}) \\ = & X \cdot W_{t r a i n e d} + θ \cdot [D_{s u b} (B_{s}) + D_{o b j} (B_{o})] \end{matrix} \end{matrix}

(12)

Note that

α = 0.1

in Equation (7) and

θ = 0.1

in Equation (12) are optimized by grid research on the validation set of the VRD dataset [1].

4. Experimental Procedure

In the experimental section, the performance of the existing visual relationship detection frameworks in two datasets is compared and analyzed. First, we evaluated the visual relationship recall performance in the VRD dataset [1] to demonstrate the effectiveness of the proposed multi-view VRD framework. Then, we conducted an ablation study to further demonstrate the effectiveness of the multi-view information and proposed balanced classifier. Finally, we applied the proposed multi-view VRD framework to an image retrieval task in an UnRel dataset [4].

4.1. Visual Relationship Detection

4.1.1. Dataset

The VRD dataset [1] is widely used in performance evaluation for visual relationship detection framework. This dataset includes 4000 training images and 1000 test images, where the locations of the bounding boxes, categories of objects and the categories of predicates are annotated. The VRD dataset [1] contains 100 kinds of objects, 70 kinds of predicates, and 6672 different kinds of visual relationship triplets annotations. Furthermore, there are 1877 specific visual relationship triplets which never occur in the training set to evaluate the zero-shot learning performance and generalization ability of the visual relationship detection model.

4.1.2. Setup

Based on the recall evaluation setups in [4,7], the performance of the proposed multi-view VRD framework was evaluated in conditions of recall@50 and recall@100 (recall@x is the fraction of times the correct relationships are extracted in the top x confident relationship extractions). There were three visual relationship detection tasks in the recall evaluation:

Predicate detection (Pre. Det.). The input was an original RGB image and a set of ground truth objects annotations. Our task was to extract the possible predicates between the located object pairs. This task condition was to evaluate the visual relationship classifier performance without effects of the object detector.
Phrase detection (Phr. Det.). The input was only an original RGB image. Our task was to output a visual relationship triplet (subject-predicate-object) and the entire bounding box which contained the corresponding subject and object simultaneously. This entire bounding box should have at least $I o U = 0.5$ overlap with the ground truth region.
Relationship detection (Rel. Det.). The input was only an original RGB image. Our task was to output a visual relationship triplet (subject-predicate-object), and locate the subject and object, respectively, in the image. These two bounding boxes should both have at least $I o U = 0.3$ overlap with the ground truth bounding boxes.

The training parameters of the all Faster R-CNN in the RGB-D object detector module were as follows: initial learning rate was 0.001, the optimizer was SGD, the bath size was 256, iterations of the RGB Faster R-CNN were 4 × 100,000, and iterations of the depth Faster R-CNN were 4 × 90,000.

4.1.3. Recall Performance Evaluation and Comparison

In this section, we compare the multi-view VRD framework with existing strong baselines to demonstrate the effectiveness of multi-view information in the visual relationship detection framework. Two evaluation conditions were formulated for recall performance testing: different normalization strategies (ImageNorm in Equation (3) and RegionNorm in Equation (4)) for depth multi-view spatial features. The recall performance of the different VRD frameworks is shown in Table 1.

First, we analyzed the performance of the multi-view VRD framework with an ImageNorm normalization strategy (Multi-view[Img]). Compared with RDBN [7], which the multi-view framework is based on, the performance of the proposed framework with multi-view information was better in most of the recall evaluation tasks. Particularly for the zero-shot learning task (“Unseen” column) of predicate detection, the multi-view framework achieved state-of-the-art performance compared with existing VRD methods, especially in the specific depth condition

|Δ Depth| > 100

(from 42.6% in RDBN to 44.3% in multi-view). This showed that the multi-view information generated from the original RGB images and estimated corresponding depth maps was able to enhance the generalization ability of the visual relationship detection framework. Intuitively, in the depth condition

|Δ Depth| > 100

, the subject is far from the object; thus, the estimated depth maps can provide relatively accurate depth information. Furthermore, with multi-view visual features generated from RGB-D images, the dimensions of the visual feature vectors were extended compared with single-view feature vectors. The performance comparison showed that feature vectors with more abundant visual information can improve the detection accuracy and generalization ability of the VRD model.

Then, we analyzed the performance of the multi-view VRD framework with a RegionNorm normalization strategy (Multi-view[Reg]). We found that the multi-view VRD framework with RegionNorm was good at visual relationship detection in

10 < |Δ Depth| < 30

, compared with the framework based on ImageNorm. For example, compared with baseline RDBN in zero-shot learning of a predicate detection task, the multi-view model improved performance from 21.6% to 22.4%. In all depth ranges, compared with RDBN, which the multi-view model is based on, the multi-view framework enhanced the performance of zero-shot learning in a predicate detection task from 22.5% to 22.9%. Intuitively, in a depth range of

10 < |Δ Depth| < 30

, the subject is closed to the object. Thus, in this depth condition, the RegionNorm strategy was more adapted to the spatial relationship between the object pairs.

In conclusion, the multi-view visual relationship detection framework with more abundant visual information was able to improve the detection accuracy and generalization ability of the VRD model.

4.1.4. Ablation Study

In this section, we further demonstrate the effectiveness of the proposed multi-view VRD framework by means of an ablation study for multi-view visual information and the novel visual relationship balanced classifier. The performance in different conditions is shown in Table 2.

First, by comparing the performance of rows a.b.c. with rows d.e.f., it is observed that the multi-view VRD framework outperformed RDBN [7] in the condition where both of these VRD frameworks have a traditional predicate classifier (without balanced classifier). Thus, we can conclude that the multi-view information generated from monocular RGB images was able to enhance the VRD framework performance. Furthermore, in a

|Δ Depth| > 100

predicate detection task, the zero-shot learning performance of the multi-view VRD framework increased from 39.3% to 41.0% compared with RDBN.

By comparing the performance of rows d.e.f. with rows g.h.i., it is evident that the recall performance of the multi-view VRD framework can be further improved after applying the novel visual relationship balanced classifier and mask matrices. Specifically, in the

|Δ Depth| > 100

predicate detection task, the zero-shot performance increased from 41.0% to 44.3%. Thus, it was demonstrated that our proposed novel visual relationship balanced classifier was able to improve the detection accuracy and generalization ability of the VRD framework.

Finally, we analyzed the performance (row j. to o.) of the multi-view framework with a RegionNorm normalization strategy. Similarly, we can conclude that multi-view information and the novel balanced classifier can improve the detection accuracy and generalization ability of the VRD framework. Moreover, comparing the performance of [Img] with [Reg], we found that the RegionNorm strategy was better at processing the object candidates which were near in depth space. For example, in row n., the zero-shot learning performance of predicate detection was 22.4%, which was higher than 21.3% in row h.

In conclusion, multi-view information and the novel balanced classifier can both improve the detection accuracy and generalization ability of the VRD framework.

4.2. Image Retrieval on UnRel Dataset

4.2.1. Dataset

To solve the problems of missing annotations, a challenging UnRel dataset [4] which contained rare visual relationship triplets was constructed to evaluate the generalization ability of the VRD framework. The UnRel dataset included 1000 images and 76 kinds of visual triplets for the image retrieval task.

4.2.2. Setup

Following the setup of [4,7], the UnRel dataset [4] and the VRD dataset [1] were mixed in a new dataset where samples of UnRel were regarded as positive object pairs for image retrieval. Based on different overlap between the retrieval region and the ground truth, the retrieval performance was evaluated with two requirements:

I o U ⩾ 0.3

and

I o U ⩾ 0.5

. Furthermore, the retrieval performance was evaluated by mAP in two tasks:

With GT. The input was a visual triplet description by natural language. The bounding boxes of the subject and object were already given by ground truth. Our task was to retrieve images containing input visual triplet descriptions according to retrieval confident scores. The retrieval confident scores were generated only by the relationship classifier, without the effects from the object detector.
With candidates. The input was a visual triplet description by natural language. However, the bounding boxes were generated by the object detector. Thus, the retrieval confident scores were computed by Equation (12). There were three sub-tasks in this setup condition: mAP-Union computed the overlap between the bounding boxes union of the detected object pairs and the ground truth union, mAP-Sub, computed the overlap between the detected subject bounding box and the ground truth subject, the mAP-Sub-Obj computed the overlap of subject and object with the corresponding ground truth bounding box, respectively.

4.2.3. Retrieval Performance Evaluation and Comparison

In Table 3, the retrieval performance of the two baselines and our proposed multi-view VRD framework in different retrieval tasks is listed and compared. In the “With candidates” task, the retrieval performance of the multi-view model was similar to that of the Weakly [4] and the RDBN [7] framework. Because the retrieval performance in the condition without given object bounding boxes mainly depended on the performance of the object detector, the improvements in this task were not significant. We found that the novel multi-view VRD model achieved higher performance in the “With GT” task. That is, the multi-view features were able to reveal potential visual relationships in the original flat RGB images. However, in complex practical visual scenes, the Weakly framework [4], which was based on flat RGB features and RDBN [7], which is based on RGB-D features, did not extract the potential multi-view information related to the hidden visual relationships.

5. Conclusions

In this paper, a novel multi-view visual relationship detection framework is proposed for utilizing the multi-view visual features hidden in monocular RGB information. The multi-view VRD framework achieved state-of-the-art performance of zero-shot learning on a VRD dataset [1] and UnRel dataset [4] in specific depth conditions.

Firstly, we constructed the novel multi-view VRD framework composed of three main modules, based on estimated multi-view features, to improve the detection performance and generalization ability of the VRD model. Secondly, a novel multi-view images generation method, using only the original monocular RGB images, was proposed to transfer flat visual space to 3D multi-view space. Thirdly, we improved the visual relationship balanced classifier and mask matrices to enable the relationship predictor to be able to adapt to multi-view features. Finally, detailed comparison experiments were conducted to demonstrate the effectiveness of the multi-view VRD framework. Furthermore, we also analyzed the performance based on different feature vector normalization methods.

However, although the multi-view VRD framework reached SOTA performance in zero-shot learning, it did not achieve the SOTA performance in normal VRD conditions. How to further reduce the negative effects from inaccurate visual information and apply multi-view information into video visual relationship detection require investigation in future research.

To the best of our knowledge, this is the first time that multi-view information has been utilized in a visual relationship detection task. Our research has demonstrated the possibility that, even when input is only flat monocular RGB information, the multi-view features can be extracted and embedded in a VRD model.

Author Contributions

Conceptualization, X.L. and M.-G.G.; methodology, X.L.; software, X.L. and Y.H.; validation, X.L. and Y.H.; formal analysis, X.L.; investigation, X.L. and M.-G.G.; resources, X.L.; data curation, X.L.; writing—original draft preparation, X.L.; writing—review and editing, M.-G.G.; visualization, Y.H.; supervision, M.-G.G.; project administration, M.-G.G.; funding acquisition, M.-G.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China under Grants 2020YFB1708500.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lu, C.; Krishna, R.; Bernstein, M.S.; Fei-Fei, L. Visual Relationship Detection with Language Priors. In Lecture Notes in Computer Science, Proceedings of the Computer Vision-ECCV 2016—14th European Conference, Part I, Amsterdam, The Netherlands, 11–14 October 2016; Springer: New York, NY, USA, 2016; Volume 9905, pp. 852–869. [Google Scholar]
Zhang, H.; Kyaw, Z.; Chang, S.; Chua, T. Visual Translation Embedding Network for Visual Relation Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, IEEE Computer Society, Honolulu, HI, USA, 21–26 July 2017; pp. 3107–3115. [Google Scholar]
Yu, R.; Li, A.; Morariu, V.I.; Davis, L.S. Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, IEEE Computer Society, Venice, Italy, 22–29 October 2017; pp. 1068–1076. [Google Scholar]
Peyre, J.; Laptev, I.; Schmid, C.; Sivic, J. Weakly-Supervised Learning of Visual Relations. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, IEEE Computer Society, Venice, Italy, 22–29 October 2017; pp. 5189–5198. [Google Scholar]
Zhang, H.; Kyaw, Z.; Yu, J.; Chang, S. PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, IEEE Computer Society, Venice, Italy, 22–29 October 2017; pp. 4243–4251. [Google Scholar]
Liang, X.; Lee, L.; Xing, E.P. Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, IEEE Computer Society, Honolulu, HI, USA, 21–26 July 2017; pp. 4408–4417. [Google Scholar]
Liu, X.; Gan, M. RDBN: Visual relationship detection with inaccurate RGB-D images. Knowl. Based Syst. 2020, 204, 106142. [Google Scholar] [CrossRef]
Sharifzadeh, S.; Baharlou, S.M.; Berrendorf, M.; Koner, R.; Tresp, V. Improving Visual Relation Detection using Depth Maps. In Proceedings of the 25th International Conference on Pattern Recognition, ICPR 2020, Milan, Italy, 10–15 January 2021; pp. 3597–3604. [Google Scholar]
Thomas, A.; Ferrari, V.; Leibe, B.; Tuytelaars, T.; Schiele, B.; Gool, L.V. Towards Multi-View Object Class Detection. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), New York, NY, USA, 17–22 June 2006; IEEE Computer Society: New York, NY, USA, 2006; pp. 1589–1596. [Google Scholar]
Zhou, H.; Liu, A.; Nie, W.; Nie, J. Multi-View Saliency Guided Deep Neural Network for 3-D Object Retrieval and Classification. IEEE Trans. Multim. 2020, 22, 1496–1506. [Google Scholar] [CrossRef]
Rubino, C.; Crocco, M.; Bue, A.D. 3D Object Localisation from Multi-View Image Detections. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1281–1294. [Google Scholar] [CrossRef] [PubMed]
Tang, C.; Ling, Y.; Yang, X.; Jin, W.; Zheng, C. Multi-view object detection based on deep learning. Appl. Sci. 2018, 8, 1423. [Google Scholar] [CrossRef] [Green Version]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D Object Detection Network for Autonomous Driving. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, IEEE Computer Society, Honolulu, HI, USA, 21–26 July 2017; pp. 6526–6534. [Google Scholar]
Yu, Y.; Hasan, K.S.; Yu, M.; Zhang, W.; Wang, Z. Knowledge Base Relation Detection via Multi-View Matching. In Communications in Computer and Information Science, Proceedings of the New Trends in Databases and Information Systems-ADBIS 2018 Short Papers and Workshops, AI*QA, BIGPMED, CSACDB, M2U, BigDataMAPS, ISTREND, DC, Budapest, Hungary, 2–5 September 2018; Springer: New York, NY, USA, 2018; Volume 909, pp. 286–294. [Google Scholar]
Zhang, H.; Xu, G.; Liang, X.; Zhang, W.; Sun, X.; Huang, T. Multi-view multitask learning for knowledge base relation detection. Knowl. Based Syst. 2019, 183, 104870. [Google Scholar] [CrossRef]
Wang, C.; Fu, H.; Yang, L.; Cao, X. Text Co-Detection in Multi-View Scene. IEEE Trans. Image Process. 2020, 29, 4627–4642. [Google Scholar] [CrossRef] [PubMed]
Roy, S.; Shivakumara, P.; Pal, U.; Lu, T.; Kumar, G.H. Delaunay triangulation based text detection from multi-view images of natural scene. Pattern Recognit. Lett. 2020, 129, 92–100. [Google Scholar] [CrossRef]
Eigen, D.; Fergus, R. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture. In Proceedings of the 2015 IEEE International Conference on Computer Vision, ICCV 2015, IEEE Computer Society, Santiago, Chile, 7–13 December 2015; pp. 2650–2658. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1106–1114. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.S.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
Girshick, R.B.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, IEEE Computer Society, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Plummer, B.A.; Mallya, A.; Cervantes, C.M.; Hockenmaier, J.; Lazebnik, S. Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, IEEE Computer Society, Venice, Italy, 22–29 October 2017; pp. 1946–1955. [Google Scholar]
Sadeghi, F.; Divvala, S.K.; Farhadi, A. VisKE: Visual knowledge extraction and question answering by visual verification of relation phrases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, IEEE Computer Society, Boston, MA, USA, 7–12 June 2015; pp. 1456–1464. [Google Scholar]
Qiu, Y.; Satoh, Y.; Suzuki, R.; Iwata, K.; Kataoka, H. Multi-View Visual Question Answering with Active Viewpoint Selection. Sensors 2020, 20, 2281. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Hu, R.; Xu, H.; Rohrbach, M.; Feng, J.; Saenko, K.; Darrell, T. Natural Language Object Retrieval. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, IEEE Computer Society, Las Vegas, NV, USA, 27–30 June 2016; pp. 4555–4564. [Google Scholar]
Johnson, J.; Karpathy, A.; Li, F.-F. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 4565–4574. [Google Scholar]
Johnson, J.; Krishna, R.; Stark, M.; Li, L.; Shamma, D.A.; Bernstein, M.S.; Fei-Fei, L. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, IEEE Computer Society, Boston, MA, USA, 7–12 June 2015; pp. 3668–3678. [Google Scholar]
Girshick, R.B. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision, ICCV 2015, IEEE Computer Society, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.; Shamma, D.A.; et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef] [Green Version]
Fu, Y.; Hospedales, T.M.; Xiang, T.; Gong, S. Transductive Multi-View Zero-Shot Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 2332–2345. [Google Scholar] [CrossRef] [PubMed]
Gu, X.; Lin, T.; Kuo, W.; Cui, Y. Zero-Shot Detection via Vision and Language Knowledge Distillation. arXiv 2021, arXiv:2104.13921. [Google Scholar]
Wu, H.; Yan, Y.; Chen, S.; Huang, X.; Wu, Q.; Ng, M.K. Joint Visual and Semantic Optimization for zero-shot learning. Knowl. Based Syst. 2021, 215, 106773. [Google Scholar] [CrossRef]
Hwang, S.J.; Ravi, S.N.; Tao, Z.; Kim, H.J.; Collins, M.D.; Singh, V. Tensorize, Factorize and Regularize: Robust Visual Relationship Learning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Computer Vision Foundation/IEEE Computer Society, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1014–1023. [Google Scholar]
Yin, G.; Sheng, L.; Liu, B.; Yu, N.; Wang, X.; Shao, J.; Loy, C.C. Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition. In Lecture Notes in Computer Science, Proceedings of the Computer Vision-ECCV 2018-15th European Conference, Part III, Munich, Germany, 8–14 September 2018; Springer: New York, NY, USA, 2018; Volume 11207, pp. 330–347. [Google Scholar]
Bin, Y.; Yang, Y.; Tao, C.; Huang, Z.; Li, J.; Shen, H.T. MR-NET: Exploiting Mutual Relation for Visual Relationship Detection. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Palo Alto, CA, USA, 2019; pp. 8110–8117. [Google Scholar]
Zhang, J.; Shih, K.J.; Elgammal, A.; Tao, A.; Catanzaro, B. Graphical Contrastive Losses for Scene Graph Parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Computer Vision Foundation/IEEE, Long Beach, CA, USA, 16–20 June 2019; pp. 11535–11543. [Google Scholar]
Zhan, Y.; Yu, J.; Yu, T.; Tao, D. On Exploring Undetermined Relationships for Visual Relationship Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Computer Vision Foundation/IEEE, Long Beach, CA, USA, 16–20 June 2019; pp. 5128–5137. [Google Scholar]

Figure 1. Visual relationship detection examples between different objects. A visual relationship is composed of three parts, i.e., subject-predicate-object.

Figure 2. An overview of the proposed multi-view visual relationship detection framework. This framework is composed of three main modules: (1) RGB-D object detector. For a monocular RGB image, first it is input into a depth estimation network [18] to obtain the corresponding depth map. Then, the original RGB and generated depth map are input into two different Faster-RCNNs [22] to locate and classify objects. (2) Multi-view features generator. After obtaining the locations of objects in the RGB image and estimated depth map, multi-view images (front view, left view and top view) are generated by specific strategies. Then the spatial feature vector is extracted by an extractor. Moreover, the RGB and depth appearance feature vector are both concatenated with a spatial feature vector. (3) Visual relationship classifier. The concatenated final feature vector is input into a visual relationship classifier to predict the visual relationship triplets.

Figure 3. Multi-view images of two objects in 3D space. In the front view image,

△ w

and

△ h

can be calculated according to object center locations extracted by object detector.

△ Depth

is the difference of mean depth values between two detected bounding boxes. Note that

△ w

,

△ h

and

△ depth

in these three multi-view images have the same values.

Figure 3. Multi-view images of two objects in 3D space. In the front view image,

△ w

and

△ h

can be calculated according to object center locations extracted by object detector.

△ Depth

is the difference of mean depth values between two detected bounding boxes. Note that

△ w

,

△ h

and

△ depth

in these three multi-view images have the same values.

Figure 4. An overview of the final combined feature of a pair of detected objects. The final visual feature vector is composed of four parts:

S_{R G B}

,

S_{D e p t h}

,

A_{D e p t h}

,

A_{R G B}

. The spatial feature is extracted by the second module of the proposed multi-view VRD framework, and the appearance feature is generated by the object detector. Note that the depth spatial feature

S_{D e p t h}

is composed of

S_{D e p t h L}

,

S_{D e p t h T}

and

S_{D e p t h F}

.

Figure 4. An overview of the final combined feature of a pair of detected objects. The final visual feature vector is composed of four parts:

S_{R G B}

,

S_{D e p t h}

,

A_{D e p t h}

,

A_{R G B}

. The spatial feature is extracted by the second module of the proposed multi-view VRD framework, and the appearance feature is generated by the object detector. Note that the depth spatial feature

S_{D e p t h}

is composed of

S_{D e p t h L}

,

S_{D e p t h T}

and

S_{D e p t h F}

.

Figure 5. Improved mask matrices for multi-view relationship classifier. The first matrix is mask A, the second matrix is mask B, and the third one is part of mask B to process the multi-view depth spatial feature vector

S_{D e p t h}

. Mask A and B can be divided into four diagonal matrices. Every diagonal matrix processes the corresponding component in the final combined feature vector.

Figure 5. Improved mask matrices for multi-view relationship classifier. The first matrix is mask A, the second matrix is mask B, and the third one is part of mask B to process the multi-view depth spatial feature vector

S_{D e p t h}

. Mask A and B can be divided into four diagonal matrices. Every diagonal matrix processes the corresponding component in the final combined feature vector.

Table 1. Recall performance comparison (%) with existing strong VRD baselines on VRD dataset [1]. Multi-view[Img] and Multi-view[Reg] represent multi-view VRD framework with ImageNorm and RegionNorm normalization strategy, respectively. The “Unseen” column shows zero-shot learning performance and the “All” column shows performance in normal dataset condition. “-” denotes the result is unavailable. The best performances in the specific depth condition are highlighted in boldface. Note that in existing methods, only the multi-view framework and RDBN [7] can evaluate the performance in specific depth ranges.

	Pre. Det. ( $R_{50 & 100}$ )		Phr. Det. ( $R_{50 / 100}$ )		Rel. Det. ( $R_{50 / 100}$ )
	Unseen	All	Unseen	All	Unseen	All
VRD-Full [1]	12.3	47.9	5.1/5.7	16.2/17.0	4.8/5.4	13.9/14.7
VTransE [2]	-	44.8	2.7/3.5	19.4/22.4	1.7/2.1	14.1/15.2
Weakly [4]	19.8	49.7	6.8/7.4	16.1/17.7	6.2/6.8	14.5/15.8
PPR-FCN [5]	-	47.4	-	19.6/23.2	-	14.4/15.7
IELKD:S [3]	17.0	-	10.4/10.9	-	8.9/9.1	-
IELKD:T [3]	8.8	-	6.5/6.7	-	6.1/6.4	-
IELKD:S+T [3]	-	55.2	-	23.1/24.0	-	19.2/21.3
DVSRL [6]	-	-	9.2/10.3	21.4/22.6	7.9/8.5	18.2/20.8
Robust [34]	17.3	52.3	5.8/7.1	17.4/19.1	5.3/6.5	15.2/16.8
Zoom-Net [35]	-	50.7	-	24.8/28.1	-	18.9/21.4
CAI+SCA-M [35]	-	56.0	-	25.2/28.9	-	19.5/22.4
MR-NET [36]	-	61.2	-	-	-	16.7/17.6
RelDN [37]	-	-	-	31.3/36.4	-	25.3/28.6
MF-URLN [38]	26.9	58.2	5.9/7.9	31.5/36.1	4.3/5.5	23.9/26.8
MF-URLN-IM [38]	27.2	-	6.2/9.2	-	4.5/6.4	-
RDBN [7]	22.5	50.1	7.4/8.2	16.1/17.8	6.8/7.3	14.4/15.8
RDBN ( $10 < \|Δ Depth\| < 30$ ) [7]	21.6	52.3	9.8/11.2	17.8/19.6	9.5/10.3	15.9/17.3
RDBN ( $\|Δ Depth\| > 100$ ) [7]	42.6	55.2	11.5/11.5	19.6/20.6	6.6/6.6	14.1/15.0
Multi-view[Img]	22.7	50.2	7.4/8.2	16.1/17.9	6.8/7.3	14.4/15.9
Multi-view[Img]( $10 < \|Δ Depth\| < 30$ )	21.3	52.0	9.5/10.3	17.6/19.4	9.2/9.8	15.8/17.2
Multi-view[Img]( $\|Δ Depth\| > 100$ )	44.3	55.8	11.5/11.5	19.6/20.9	6.6/6.6	14.1/15.3
Multi-view[Reg]	22.9	50.3	7.0/7.9	16.1/17.8	6.4/7.0	14.4/15.8
Multi-view[Reg]( $10 < \|Δ Depth\| < 30$ )	22.4	52.2	9.5/10.6	17.8/19.5	9.2/10.1	15.8/17.2
Multi-view[Reg]( $\|Δ Depth\| > 100$ )	42.6	55.5	11.5/11.5	19.6/20.6	6.6/6.6	14.1/15.3

Table 2. Ablation study for multi-view information and proposed novel visual relationship balanced classifier BC’ (BC represents the balanced classifier proposed in RDBN [7]). [without BC’] represents VRD framework with traditional predicate classifier. Multi-view[Img] and Multi-view[Reg] represent multi-view VRD framework with ImageNorm and RegionNorm normalization strategy, respectively. The best performances of the framework with these two normalization strategies is highlighted in boldface.

		Pre. Det.( $R_{50 & 100}$ )		Phr. Det.( $R_{50 / 100}$ )		Rel. Det.( $R_{50 / 100}$ )
		Unseen	All	Unseen	All	Unseen	All
a.	RDBN[without BC] [7]	21.0	49.9	7.0/7.6	16.2/17.7	6.4/6.9	14.5/15.7
b.	RDBN[without BC]( $10 < \|Δ Depth\| < 30$ ) [7]	19.3	51.7	9.2/10.1	17.7/19.3	8.6/9.5	15.8/17.2
c.	RDBN[without BC]( $\|Δ Depth\| > 100$ ) [7]	39.3	54.3	9.8/9.8	19.3/20.2	6.6/6.6	14.4/15.0
d.	Multi-view[Img][without BC’]	22.0	50.4	7.1/7.6	16.3/17.8	6.5/6.9	14.6/15.8
e.	Multi-view[Img][without BC’]( $10 < \|Δ Depth\| < 30$ )	19.8	52.0	9.2/10.1	17.9/19.3	8.9/9.5	15.9/17.2
f.	Multi-view[Img][without BC’]( $\|Δ Depth\| > 100$ )	41.0	55.2	9.8/9.8	19.6/20.2	4.9/4.9	14.4/14.7
g.	Multi-view[Img][with BC’]	22.7	50.2	7.4/8.2	16.1/17.9	6.8/7.3	14.4/15.9
h.	Multi-view[Img][with BC’]( $10 < \|Δ Depth\| < 30$ )	21.3	52.0	9.5/10.3	17.6/19.4	9.2/9.8	15.8/17.2
i.	Multi-view[Img][with BC’]( $\|Δ Depth\| > 100$ )	44.3	55.8	11.5/11.5	19.6/20.9	6.6/6.6	14.1/15.3
j.	Multi-view[Reg][without BC’]	22.0	50.4	7.0/7.5	16.3/17.8	6.4/6.8	14.5/15.9
k.	Multi-view[Reg][without BC’]( $10 < \|Δ Depth\| < 30$ )	21.0	52.2	9.2/10.1	17.7/19.4	8.9/9.5	15.7/17.3
l.	Multi-view[Reg][without BC’]( $\|Δ Depth\| > 100$ )	39.3	54.3	9.8/9.8	19.9/20.6	4.9/4.9	14.7/15.0
m.	Multi-view[Reg][with BC’]	22.9	50.3	7.0/7.9	16.1/17.8	6.4/7.0	14.4/15.8
n.	Multi-view[Reg][with BC’]( $10 < \|Δ Depth\| < 30$ )	22.4	52.2	9.5/10.6	17.8/19.5	9.2/10.1	15.8/17.2
o.	Multi-view[Reg][with BC’]( $\|Δ Depth\| > 100$ )	42.6	55.5	11.5/11.5	19.6/20.6	6.6/6.6	14.1/15.3

Table 3. Image retrieval performance (% mAP) of different VRD frameworks on UnRel dataset [4]. The IoU threshold is higher, the retrieval task is harder.

		With GT	With Candidates
		-	Union	Sub	Sub-Obj
$I o U ⩾ 0.3$ :
	Weakly [4]	56.8	17.0	15.9	13.4
	RDBN [7]	60.1	17.3	16.3	13.5
	Multi-view (with ImageNorm)	60.7	17.3	16.3	13.5
	Multi-view (with RegionNorm)	60.2	17.3	16.3	13.5
$I o U ⩾ 0.5$ :
	Weakly [4]	56.8	15.2	11.0	7.7
	RDBN [7]	60.1	15.6	11.6	7.8
	Multi-view (with ImageNorm)	60.7	15.6	11.6	7.8
	Multi-view (with RegionNorm)	60.2	15.5	11.6	7.8

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; Gan, M.-G.; He, Y. Multi-View Visual Relationship Detection with Estimated Depth Map. Appl. Sci. 2022, 12, 4674. https://doi.org/10.3390/app12094674

AMA Style

Liu X, Gan M-G, He Y. Multi-View Visual Relationship Detection with Estimated Depth Map. Applied Sciences. 2022; 12(9):4674. https://doi.org/10.3390/app12094674

Chicago/Turabian Style

Liu, Xiaozhou, Ming-Gang Gan, and Yuxuan He. 2022. "Multi-View Visual Relationship Detection with Estimated Depth Map" Applied Sciences 12, no. 9: 4674. https://doi.org/10.3390/app12094674

APA Style

Liu, X., Gan, M.-G., & He, Y. (2022). Multi-View Visual Relationship Detection with Estimated Depth Map. Applied Sciences, 12(9), 4674. https://doi.org/10.3390/app12094674

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-View Visual Relationship Detection with Estimated Depth Map

Abstract

Featured Application

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Relationship Triplet Learning

2.3. Depth Map Embedding

2.4. Multi-View Framework

2.5. Zero-Shot Learning

3. Methodology

3.1. Multi-View VRD Framework

3.2. Multi-View Features Generator

3.3. Visual Relationship Classifier

4. Experimental Procedure

4.1. Visual Relationship Detection

4.1.1. Dataset

4.1.2. Setup

4.1.3. Recall Performance Evaluation and Comparison

4.1.4. Ablation Study

4.2. Image Retrieval on UnRel Dataset

4.2.1. Dataset

4.2.2. Setup

4.2.3. Retrieval Performance Evaluation and Comparison

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI