A Novel Part Refinement Tandem Transformer for Human–Object Interaction Detection

Su, Zhan; Yang, Hongzhe

doi:10.3390/s24134278

Open AccessArticle

A Novel Part Refinement Tandem Transformer for Human–Object Interaction Detection

by

Zhan Su

^*

and

Hongzhe Yang

School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China

^*

Author to whom correspondence should be addressed.

Sensors 2024, 24(13), 4278; https://doi.org/10.3390/s24134278

Submission received: 19 May 2024 / Revised: 19 June 2024 / Accepted: 22 June 2024 / Published: 1 July 2024

(This article belongs to the Special Issue AI-Driven Sensing for Image Processing and Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

Human–object interaction (HOI) detection identifies a “set of interactions” in an image involving the recognition of interacting instances and the classification of interaction categories. The complexity and variety of image content make this task challenging. Recently, the Transformer has been applied in computer vision and received attention in the HOI detection task. Therefore, this paper proposes a novel Part Refinement Tandem Transformer (PRTT) for HOI detection. Unlike the previous Transformer-based HOI method, PRTT utilizes multiple decoders to split and process rich elements of HOI prediction and introduces a new part state feature extraction (PSFE) module to help improve the final interaction category classification. We adopt a novel prior feature integrated cross-attention (PFIC) to utilize the fine-grained partial state semantic and appearance feature output obtained by the PSFE module to guide queries. We validate our method on two public datasets, V-COCO and HICO-DET. Compared to state-of-the-art models, the performance of detecting human–object interaction is significantly improved by the PRTT.

Keywords:

HOI detection; deep learning; vision transformer; visual relationship

1. Introduction

Given an image containing multiple humans and objects, human–object interaction (HOI) detection is a task of predicting a set of ⟨human, object, interaction⟩ triples in an image [1]. Due to its grand prospect in high-level human-centric scene understanding tasks, precise detection of human–object interactions can profit numerous subsequent activities, including action detection [2,3] and scene graph generation [4,5], so it has recently attracted considerable research interest.

Traditional HOI detection methods are generalized into two modes: two-phase or one-phase. In the basic two-phase detection framework, human and object features are often extracted using an object detection network, from which interactions are inferred. For the two-phase mode, many researchers use additional features such as relative spatial configuration [6,7], interactiveness field [8], human pose estimation [9], body part features [10], or scene graph [11] to enhance these features. However, in the object detection stage, the exhaustion of human and object instance pairings and some post-processing, such as NMS, lead to additional time complexity. For faster detection, the single-stage mode performs interaction prediction and object detection in parallel [12,13]. However, the detection result will be inaccurate when the image’s content is a multi-overlapping crowd scene or a special scene, such as humans and objects far away in space.

Recently, Vision Transformer [14] has revolutionized tasks in the vision domain, overcoming traditional methods’ problems and implementing a competitive technique in both accuracy and detection time for HOI detection. This article belongs to Transformer-based work. The Transformer architecture’s self-attention and cross-attention [15] can better obtain the contextual information between different instances, which is particularly appropriate for detecting HOI. QPIC [16] and HOITransformer [14] define a set of learnable queries containing different types of elements to compose HOI triplet predictions. HOTR [17] sets two sets of queries for a pair of parallel decoders, but it uses a complex pointer mechanism to combine the outputs of the two tasks. Similarly, HOICLIP [18] uses matching to obtain the initial query group for interactive classification. CDN [19] adopts a dual decoder design, but it only uses the previous query result as the input of the next query with a simple guidance strategy. Zhang et al. [20] exploited unary and pairwise representations for HOIs with the same Transformer. Most works follow the simple design of utilizing a single decoder to predict all HOI prediction elements directly. Although this architecture is successful, it also has its drawbacks: (i) Due to the ambiguity of interactions in special cases (for example, when a human stands in front of a motorcycle, the interaction between the human and the motorcycle is most likely to occur in image space, but the human is not riding a motorcycle) and the large gap from pixels to activity concepts, it is not enough to find contextual features by simple initialized query, and self-attention mechanisms in one-shot networks. Interaction queries require additional guidance. (ii) Since the HOI prediction contains too many elements (human location, object location, instance category, and interaction category), it is not easy to focus on all element-related features and achieve a good trade-off using only a set of simple queries. However, in some works [14,21], multiple queries require additional and time-consuming matching operations. Our work addresses these issues well.

We propose a novel end-to-end model, Part Refinement Tandem Transformer for HOI Detection (PRTT), to address the above drawbacks. It introduces a part state feature extraction (PSFE) module to improve the previous Transformer-based HOI detection design in the intermediate stage. The local features between human body parts and objects are essential in interactions. For instance, when a photographer operates a camera, it is crucial to analyze how the photographer’s hands interact with the camera. The positioning, posture, and contact points between the hands and the camera deliver key insights into the mechanics, timing, and reasons behind the interaction. By leveraging these features as guiding elements, we can significantly bridge the gap from mere pixels to meaningful activity concepts, thus providing more profound and accurate contextual guidance for detecting human–object interactions (HOIs). The main innovative idea of PRTT is shown in Figure 1. It utilizes human pose key points as clues to extract the appearance features and semantics features of human part states and encodes them to support and guide queries. Simultaneously, for the second drawback, PRTT effectively focuses on all element-related features and achieves a good trade-off by disassembling and querying the rich HOI prediction elements multiple times. Through the multiple tandem decoders strategy, the output of the previous decoder query is utilized as the input of the next decoder query. The two query results correspond individually, avoiding the additional and time-consuming matching operations [14,21]. Lastly, experiments on the HICO-DET [1] and V-COCO [22] datasets demonstrate the effectiveness of our method. The main contributions of our paper can be summarized as follows:

This study disassembles the rich HOI prediction elements and performs multiple queries to focus on all element-related features and achieve a good trade-off effectively. Simultaneously, it adopts the multiple tandem decoders strategy to avoid additional and time-consuming matching operations.
We efficiently encode and integrate appearance features and state semantics through a pretrained Bert model with human pose key points as clues.
This study adopts a novel prior feature integrated cross-attention layer to efficiently introduce fine-grained part-state semantics and appearance features in the second stage to guide and improve queries.

2. Related Work

2.1. Object Detection

Object detection based on CNN is divided into two-stage and one-stage object detection according to its structure. Girshick et al. [23] proposed R-CNN, which has also become the originator of CNN-based object detection. Girshick et al. [24] proposed Fast R-CNN based on R-CNN, using RoI Pooling Layer, adding the classification step and bounding box regression after feature extraction. Compared with the multi-stage training of R-CNN, the training of Fast R-CNN is more concise and efficient. He et al. [25] proposed a Mask R-CNN model for image segmentation task, adding a parallel branch for predicting object masks based on Faster R-CNN. Redmon et al. [26] proposed an object detection model named “You Only Look Once” with only one forward pass. At the beginning of 2018, YOLOv3 underwent major changes in the overall structure compared with the previous one, using multiple independent logical classifiers to replace the softmax function [27], and it has developed into YOLOv8 by 2023 [28]. DETR (Detection Transformer) [21] first used a transformer in computer vision to implement an end-to-end object detection method. The results are comparable to Faster R-CNN on the COCO small object dataset but outperform Faster R-CNN on large objects. Numerous approaches to improving DETR performance have subsequently emerged.

2.2. Human–Object Interaction Detection

Many traditional HOI detection methods using CNN features have promoted the advancement of HOI detection, which is often divided into two paradigms: two-phase strategy and one-phase strategy.

Two-phase strategy: In the first stage, off-the-shelf object detectors are usually utilized to localize objects, including humans and objects. Interaction labels are then predicted in the second stage by incorporating additional features. Chao et al. [1] proposed a three-branch network, HO-RCNN, to extract the features of human–object spatial relations. Gao et al. [7] proposed a human-centric attention module, iCAN, to emphasize important regions related to interactions in the image. Additionally, other works fuse pose estimation with visual features to provide accurate features of the human form, further improving network performance. Li et al. [29] used the pose estimation network and the human pose stream to extract human pose features. Many works [10,30,31] have involved body part features as important auxiliary features in interaction detection. Liu et al. [32] constructed a body-part-based dataset, HAKE, and proposed a multi-level pairwise feature network, PFNet. However, two-phase methods are usually inefficient due to handling many noninteractive detected objects and redundant combinations of human object instances, whereas the accuracy of object detection greatly influences the network’s performance.

One-phase strategy: Recent works have attempted to address the problems faced by two-phase networks within a single-phase framework and have attracted widespread attention. Based on CenterNet [33], Liao et al. [12] proposed PPDM (parallel point detection and matching), where the point-matching branch matches human and object points originating from the same interaction point. Such operations reduce the number of candidate interaction points screened and save computational costs. IP-Net [13] is similar. Zhong et al. [34] designed a single-stage GGNet (glance and gaze network) to adaptively model a set of action-aware points in two steps of glance and gaze to improve the performance of point-based policies. UnionDet [35] eliminates the extra inference stage by directly predicting the union box with an extra branch. Despite the great efficiency gains, this strategy combines two different tasks and poses a great limitation in terms of performance.

Recently, a Transformer architecture [36] was applied in various tasks in computer vision, such as video inpainting [37] and medical image quality assessment [38], and it achieved state-of-the-art performance. Transformer-based one-phase methods treat it as a set prediction problem in the HOI detection task. More specifically, following DETR [21], HOI-Transformer [14] and QPIC [16] use a typical transformer with encoder–decoder architecture to define the prediction of a learnable query as <human, object, action> triples. In these works, matching instances before interactive classification is unnecessary. HOTR [17] and AS-Net [39] deploy dual concurrent decoders for predicting HO pairs and interaction classification correspondingly and then perform a matching operation on the output results of the two decoders. Although these transformer-based methods achieve remarkable performance, these studies depend solely on self-attention mechanisms for discovering prominent context features, where queries during interaction classification are initialized to zero. This results in a lack of corresponding guidance in the query process and the subsequent need for a time-consuming matching process. Instead, we introduce the semantic features of part states as additional features and input them into PFIC layers together with the output of object detection to guide queries for interactive category classification. There is a one-to-one correspondence between the outputs of the tandem structure, which saves the matching operation process.

3. Method

3.1. Overview

Figure 2 shows the specific implementation process framework of the idea in Figure 1. Our proposed PRTT consists of four steps. Following the processing steps of DETR [21], we extract its features using a CNN backbone and combine positional encoding inputs to a Transformer encoder to transform these. Then, a set of queries is input into the Interaction Instance Decoder to identify the HO pair instance proposals and interaction scores. Next, the feature map extracted by CNN and the proposals obtained in the previous step are input into the part state feature extraction module to obtain N part state features. After that, we utilize the output of the previous decoder, N partial state features, and global memory as the input of the Interaction Category Decoder to query the interaction category corresponding to the HO pair. Finally, we combine the outputs of the two tasks to form HOI triples.

3.2. Backbone

The input is an RGB image of shape

(H_{o}; W_{o}; C_{o})

, where

H_{o}

,

W_{o}

, and

C_{o}

represent the picture’s dimensions and RGB channels. We use a standard CNN feature extractor network to obtain feature maps indicated for

F (x) \in R^{H \times W \times D_{c}}

. Subsequently, we feed

F (x)

into a layer of convolution using a kernel size of

1 \times 1

, which reduces its channel dimension

D_{c}

to a smaller value

D_{d}

. The new feature map is

F (x) \in R^{H \times W \times D_{d}}

, where

D_{d}

defaults to 256. Next, we flatten

F_{d} (x)

using the flatten operator to generate the flatten feature

F_{v} (x) \in R^{D_{d} \times H W}

. Following previous work [14,16,17,39], we add a fixed positional encoding

F_{p o s} (x) \in R^{D_{d} \times H W}

in the

F_{v} (x)

to add the position details. The encoder implements the regular structure of transformers and is composed of several encoder layers, in which each of them is primarily composed of a self-attention layer and a feedforward (FFN) layer. It aggregates global information to output a global memory of dimension

D_{d}

. The calculation process of the transformer encoder is as follows:

F_{e n c} (x) = f_{e n c o d e r} (F_{v} (x), F_{p o s} (x))

(1)

3.3. Tandem Transformer

Multiple tandem decoders strategy: In our proposed tandem transformer architecture, the HOI predictions are divided into HO pair recognition and interacting category identification, similar to the two-phase structure method. The two-phase approach utilizes the off-the-shelf object recognizer for preprocessing, while the subsequent network focuses on interaction category classification using the feature maps obtained from the recognizer. This design allows the two-phase HOI detection method to perform better and be more stable than traditional single-stage detection methods. Therefore, the two decoders can focus on corresponding task-related features from the global memory shared by the encoder output. The multiple tandem decoders strategy enables the query process of the two decoders to correspond one to one, which enables the output results of the two queries to be directly combined to form HOI predictions.

Interaction Instance Decoder: The Interaction Instance Decoder we designed refers to the basic structure of the transformer-based object detector DETR [21]. It is composed of several standard transformer decoder layers, each containing a self-attention component, FFN, and a cross-attention layer. The cross-attention layer aggregates the embedding features

F_{e n c} (x)

output by the encoder into

N_{q}

queries. We take the learnable query

Q_{z} \in R^{N_{q} \times D_{d}}

and the encoder output global memory as input. It is transformed into another set of output queries

Q_{o} \in R^{N_{q} \times D_{d}}

. For each query, PRTT applies three FFN heads and one binary score head to predict human bounding boxes, object bounding boxes, object categories, and binary interaction scores, thus composing the set of interaction instances pair predictions

P_{o}

and the corresponding interaction scores. The interaction score (IS) indicates whether the interaction instance pair produces an interaction. For samples without interaction, it has the effect of reducing the final score. Therefore, the Interaction Instance Decoder can be expressed as follows:

P_{o} = f_{d e c o d e r_{z}} (F_{e n c} (x), Q_{z}, F_{p o s} (x))

(2)

We send output queries

Q_{o}

to the interactive category decoder. Simultaneously, PRTT performs the part state feature extraction on the prediction

P_{o}

of the Interactive Instance Decoder and the feature map of the CNN backbone. Here, we first perform pose estimation on the human region in the image to obtain N key points according to a set of HO pair predictions of the previous decoder. Next, PRTT crops the N body part area features and the object region features on the CNN feature map. The combinations of these features are input to our PSFE (part state feature extraction) module. Then, we utilize the human body part state semantic clues to generate the part state vector as the supporting feature

F_{s u p p} (x)

. The implementation describes for the PSFE module are provided in Section 3.4. Therefore, obtaining the supporting feature

F_{s u p p} (x)

can be simply represented as follows:

F_{s u p p} (x) = f_{s u p p} (F (x), P_{o})

(3)

Interaction Category Decoder: We utilize the Interaction Category Decoder to classify the corresponding interaction instance pairs, a multi-label classification task. To better guide it to aggregate classification-related features, we utilize supporting features

F_{s u p p} (x)

, a set of output queries

Q_{o}

and global memory as input to the Interaction Category Decoder. In this process, we project the supporting features

F_{s u p p} (x)

to the same feature space as

Q_{o}

to obtain

F_{s u p p} {(x)}^{'}

. The Interaction Category Decoder consists of M PFIC layers, and its structure is shown in Figure 3.

Q_{o}

is processed through multiple PFIC decoder layers and converted into another set of output queries

Q_{f}

. After passing through the FFN header, a collection of interacting classes

P_{f} = {a_{i} | i \in {1, 2, . . ., N_{d}}}

is generated. Thus, the Interaction Category Decoder can be represented as follows:

P_{f} = f_{d e c o d e r_{o}} (F_{s u p p} {(x)}^{'}, Q_{o}, F_{e n c} (x), F_{p o s} (x))

(4)

In the PFIC layer,

Q_{o}

is first updated by self-attention and then input to the cross-attention modules of

F_{e n c} (x)

and

F_{s u p p} {(x)}^{'}

respectively. The two output features are obtained. The two outputs are then added and fed into a feedforward network, as shown in the following formula:

\begin{matrix} Q_{f} & = S e l f A t t n (Q_{f}) \end{matrix}

(5)

\begin{matrix} C_{f} & = C r o s s A t t n (Q_{f}, F_{e n c} (x)) \end{matrix}

(6)

\begin{matrix} S_{f} & = C r o s s A t t n (Q_{f}, F_{s u p p} {(x)}^{'}) \end{matrix}

(7)

\begin{matrix} Q_{f} & = F F N (C_{f} + S_{f}) \end{matrix}

(8)

3.4. Part State Feature Extraction

Based on the CNN feature map and the prediction of the Interaction Instance Decoder, PRTT obtains each part state feature (appearance visual feature and semantic feature) of a human in the HO pair as additional features. The process is shown in Figure 4. Firstly, we crop the human region according to the prediction result of the Interaction Instance Decoder and utilize CPN [40] to perform the pose estimation operation on it to obtain the coordinates of N key points

L_{p n} = {x_{p n}, y_{p n}}

. Then, with N key points as the center, a region

R_{p n} = {h_{p n}, w_{p n}, x_{p n}, y_{p n}}

is generated, where

h_{p n}

and

w_{p n}

are the height and width of the square region of the human part. The following is the calculation formula:

h_{p n} = w_{p n} = [γ \sqrt{S_{h u m a n}}] 1 \leq n \leq N

(9)

where

S_{h u m a n}

represents the area of the human region, the

[]

denotes a rounding operation, and

γ

denotes a scale parameter empirically determined to be 0.1. Thus, we have

N + 1

regions of human parts and the object. Then, we perform ROI Align, residual and GAP operations on the

F (x)

to produce

N + 1

region features.

Feature refinement: Referring to the work in PGPN [30], the features are propagated through one GCN layer according to the graph structure in the Feature Refinement Component in Figure 4. The refined part features collect the feature information of the object, which is equivalent to using the advanced features of the object. By avoiding the repetition of human and parts information, this refinement process can increase interaction detection accuracy. We connect N refined part features

{f_{p n}}_{n = 1}^{N}

and refined object features

f_{o b j e c t}

to form N combine feature vectors and input them into the PSFE module to calculate the parts state vectors.

As presented in Figure 5, the N combined features will first be input to the weight generator. The weight of a human part represents the importance of the part to the interactions between the corresponding HO pair instances. For example, the state of the head has little effect on judging the interaction of the “ride bicycle” label. On the contrary, the state of the head is very important for the interaction of the “talk on a phone” label. This section refers to the related work of HAKE-Action [32], and our weight labels are directly converted from the labels in the HAKE dataset. If the label of the part and the corresponding object in HAKE is “no interaction”, its value is set to 0; in contrast, the body part has clues to the inference of interactions, and its value is set to 1. With weight labels as supervision, we utilize a weight generator consisting of fully connected layers and sigmoids to generate a set of weights

{α_{n}}_{n = 1}^{N}

for every HO pair. Therefore, the calculation process for weights among the human part–object can be expressed as follows:

α_{n} = f_{w e i g h t} (f_{p n}, f_{o b j e c t})

(10)

We multiply the original combined feature with weight to obtain a new feature

f_{p n}^{*}

. After that, relationship recognition is performed for each human part–object pair, and its output is a triple in the form of <part, verb, object>. Relationship recognition is a multi-label classification task, such as <hand, hold, ball>, and <hand, throw, ball> may be correct simultaneously, so we utilize a relationship classifier consisting of multiple fully connected layers and multi-sigmoids to deal with it. The loss used in generating weight and relationship recognition is as follows:

L_{p s} = \sum_{n}^{N} (L_{w e i g h t}^{n} + L_{r}^{n})

(11)

where

L_{w e i g h t}

is the cross-entropy loss that generates weight

α_{n}

, and

L_{r}

is the cross-entropy loss of relationship recognition. We obtain a relational triplet set of N human part–object pairs from this. From this set of triples, we innovatively extract the corresponding semantic feature (

P S F E_{s}

) through the BERT-based pretrained model [41]. Specifically, the human part–object contains m relational triples, and each word in each relational triple acquires a 768-dimensional feature through the pretrained Bert model. Then, PRTT performs the concatenate operation and multiplies the corresponding probabilities score of its triple in the relationship classifier as the semantic feature of the 2304*m-dimension of this part. After that, we utilize the pool and resize operations to reduce it to 3584 dimensions. Finally, we concatenate the semantic feature with the 512-dimension appearance visual feature (

P S F E_{v}

) extracted directly from the last FC layer of the relationship classifier to obtain 4096-dimensional additional features

f_{p s}

.

3.5. Inference and Loss Function

The loss calculation consists of two steps: a bipartite matching step between predictions and ground truth and a loss calculation step for matched pairs. For bipartite matching, we fill the ground truth set of HO pairs with

φ

(no pair), expanding the ground truth set to

N_{q}

. Our work follows the training process of QPIC [16] and matches each ground truth with its best matching prediction using the binary matching of the Hungarian algorithm [42]. A loss is generated between the matched predictions and the corresponding ground truths. Here, the prediction contains the output of two tandem decoders. In addition to following QPIC’s box-regression loss

L_{b}

, intersection-over-union (IoU) loss

L_{u}

, object-class loss

L_{c}

, and action loss

L_{a}

, the PRTT loss also includes the interactive score loss

L_{s}

corresponding to the interaction score output by the Interaction Instance Decoder.

L_{f} = \sum_{k \in (h, o)} (λ_{b} L_{b_{k}} + λ_{u} L_{u_{k}}) + λ_{c} L_{c} + λ_{a} L_{a} + λ_{s} L_{s}

(12)

where

λ_{b}, λ_{u}, λ_{c}, λ_{a}, and λ_{s}

are the hyperparameters for adjusting the weights of each loss.

After tandem decoders generate the output results, since the query output by the Interaction Instance Decoder is refined by the additional feature as the learnable query input of the Interaction Category Decoder, the relationship between the output results is also one-to-one correspondence, so we can combine them to obtain five-tuple <human bounding box, object bounding box, object class, interactive score, interaction class> output set.

< b_{j}^{h}, b_{j}^{o}, a r g m a x_{k} s_{j}^{h o i} (k) >

is the j-th prediction. Then, we set the prediction score

s_{j}^{h o i}

as

s_{j}^{a} s_{j}^{o} s_{j}^{i}

, where

s_{j}^{a}

and

s_{j}^{o}

are the scores of interaction classification and object classification, respectively, and

s_{j}^{i}

is the score of whether the HO pair produces an interactive action. Simultaneously, the prediction scores are also used to sort the prediction set. We adopt the pairwise nonmaximum suppression method [19] to filter the repeated prediction results and take the top K HOI prediction results after sorting. In this process, the threshold value, parameter

α

, and parameter

β

take 0.7, 1, and 0.5, respectively.

4. Experiments

In this section, we summarize our experimental results to demonstrate the superiority of our proposed model. First, Section 4.1 briefly introduces the datasets and experimental metrics we use. Then, Section 4.2 describes the implementation details of our model. We then evaluate the performance of PRTT by comparing it with previous state-of-the-art methods. Finally, in Section 4.5, we perform an ablation study to validate our design choices.

4.1. Experimental Setup

Datasets: We evaluate PRTT on two publicly available datasets: V-COCO [22] and HICO-DET [1]. HICO-DET has a total of 47,776 images: 38,118 for training and 9648 for testing. These include 600 HOI categories (full) with over 117 interactions and 80 object categories. Based on the training instances, it is further divided into 138 Rare (HOI categories with less than ten samples) and 462 Non-Rare (HOI categories with more than ten samples). V-COCO is a relatively small dataset, a subset of COCO [43]. It consists of 2533 and 2867 images for training and validation and 4946 for testing. The images are annotated with 80 objects and 29 action classes. Of its 29 classes, four lack object annotations, and one has only 21 photos in its sample pool. In Section 3.4, we refer to the relevant work of HAKE Action and its dataset labels. The HAKE dataset is a knowledge-driven dataset. HAKE includes 26

M^{+}

human body component level atomic action labels (component status), logic rules of component status, overall object knowledge labels (category, attribute, affordance), and their causal relationships.

Evaluation metric: The prediction is correct for a positively predicted HOI triplet <human, action, object> if the predicted human and object bounding boxes overlap with their respective ground truth boxes (

I o U

greater than 0.5) and the predicted interaction class is correct. We follow the protocol recommended by both datasets [1,22] to evaluate the results on both datasets using mean average precision (

m A P_{r o l e}

) [1]. For V-COCO, there are two protocols: Scenario 1 and Scenario 2. When there is interaction without any objects (humans only), Scenario 1 regards a tough evaluation criterion requiring an empty bounding box with origin coordinates. Another scenario settles the situation by skipping the predicted bounding box for evaluation for such cases.

4.2. Implementation Details

As the backbone of PRTT, we select the ResNet-101 with a 6-layer transformer encoder as the visual feature extractor based on its performance on the V-COCO and HICO-DET training sets. We deploy the PSFE module to obtain additional features, and this module utilizes CPN to estimate human pose. These are pretrained on the HAKE dataset. During the training phase, we follow the strategy of the previous work [14,16,17,39] to initialize the network with the parameters of DETR [21] pretrained on the MS COCO dataset. The dimension of queries

D_{d}

is set to 256. For the HICO-DET and V-COCO datasets,

N_{q}

is set to 64 and 100, respectively. Using the AdamW [44] optimizer with a weight decay of 0.0001, we train the entire model on 100 epochs, set the learning rate to 0.0001 out of the initial 60 epochs, and then reduce it to one-tenth of the original. Our network is trained on GeForce RTX NVIDIA GPUs (8 × 2080Ti) (NVIDIA, Santa Clara, CA, USA) with a batch size of 16. Each decoder in our work is equipped with a 6-layer transformer. Of the two decoders, the FFN header for the output human and object boxes has three linear layers with ReLU, while the FFN header for the output object and interaction category has only one linear layer. For the loss function, we follow the work of QPIC by setting

λ_{b}, λ_{u}, λ_{c}, λ_{a}

, and

λ_{s}

to 2.5, 1, 1, 1, and 1, respectively.

4.3. Results

In this section, we compare the performance of PRTT with current state-of-the-art methods, as shown in Table 1 and Table 2. As described in Section 4.1, we follow the proposed evaluation protocol on the V-COCO [22] and HICO-DET [1] datasets.

As shown in Table 1, our proposed model outperforms the previous state-of-the-art methods by a large margin on the V-COCO dataset. Specifically, compared with PGPN [30], which also uses pose-guided local feature extraction, and SMPNet [31], which uses a multi-level feature fusion strategy including part features, our method has significant advantages, improving by 15 mAP and 12.4 mAP in Scenario 1, respectively. The two-phase method FCMNet [45] with ResNet-50 as the backbone achieves an

m A P_{r o l e}

of 53.1 in Scenario 1. Using two parallel decoders to process two tasks, respectively, and then pairing the results with pointers, the HOTR achieves an

m A P_{r o l e}

of 55.2. Compared with ASNet [39], a two-stage method using a similar strategy to HOTR, which has a performance value of 53.9 mAP, our method obtains an improvement of 11.3 mAP. The GTNet [47] proposed the object semantic-guided model trained with relative spatial configuration, which provides

m A P_{r o l e}

of 56.2 and

m A P_{r o l e}

of 60.1 in Scenarios 1 and 2. In the current work, the Transformer-based one-phase method QPIC [16] achieves 58.3

m A P_{r o l e}

in the Scenario 1 set. The CDN using a simple guiding strategy achieves 63.9

m A P_{r o l e}

and 65.9

m A P_{r o l e}

in Scenario 1 and Scenario 2, respectively, while our complete model achieves 65.2

m A P_{r o l e}

and 66.8

m A P_{r o l e}

top results in the two scenarios, respectively.

The results of each method on the HICO-DET dataset are shown in Table 2. As mentioned above, we adopted the evaluation metrics proposed in the work of Chao et al. [1]. Our proposed model is evaluated with default settings on three HOI categories: full, rare, and non-rare. Specifically, compared to QPIC [16], PRTT achieves a gain of 5.16

m A P_{r o l e}

on the full set of the HICO-DET dataset. Our method outperforms the SOTA method, achieving an

m A P_{r o l e}

of 35.06 on the full set. Compared with PGPN [30] and SMPNet [31], which use pose-guided body features for HOI detection, PRTT achieves improvements of 17.66 mAP and 14.75 mAP, respectively. Compared with HOTR [17] and ASNet [39] using parallel decoders, PRTT improves by 9.96 mAP and 6.19 mAP respectively. Using the same backbone ResNet-50, our model performs slightly worse than OCN with 0.3 mAP on the V-COCO dataset but performs much better than OCN [49], with 3.56 mAP on the HICO-DET dataset with fine-grained interaction labels. This is because the additional features extracted by the PSFE module have a significant guiding effect in distinguishing similar interaction categories.

4.4. Qualitative Visualization Comparison

To better analyze the model conduct, we compare the attention maps of the last layer of the decoder from the traditional one-phase method QPIC and the two decoders of PRTT in Figure 6. We can see from the results in the figure that the weights of the Interaction Instance Decoder and QPIC tend to be similar regions (boundaries and intersection regions of the HO pair), so the decoder can effectively collect relevant features to guide the region localization of the HO pair by focusing on these regions. On the other hand, the weights of the Interaction Category Decoder tend to focus on the representative parts and regions of the human to guide the identify the interaction type. Taking the first column in Figure 6 as an example, the two PRTT decoders focus on the laptop’s shape and the hand of the person operating the laptop, respectively. Also, for the rider in the fourth column in Figure 6, the decoder that handles the positioning task focuses more on the edges of the horse and the rider. The decoder dealing with interactive action classification pays more attention to the rider’s bent knees and hand parts because these human parts are used as an additional feature to guide the decoder. Compared to PRTT, QPIC extracts two tendencies of features with a separate decoder, but neither is prominent.

We show the qualitative results of PRTT and compare them to the baseline (QPIC). Figure 7 represents the result of the example chosen from various interaction categories. We find that PRTT produced more reliable scores for HOI, which has an indicative body part. This shows that the part-level feature has effectively guided the decoder to collect more relevant information.

4.5. Ablation Study

In this section, we explore how each component of PRTT contributes to the final performance. All experiments are performed on the V-COCO dataset. Table 3 shows the final performance after excluding each component of PRTT in the V-COCO test set. We set a pure model as the base model, which has two tandem transformers to process two tasks separately without additional post-processing operations.

Multiple tandem decoders strategy: As shown in Table 3, we improved the query strategy of two tasks in the same decoder to be processed by two tandem decoders separately, that is, the base model, which achieved 62.1 mAP. There is an improvement of 3.3 mAP compared to the one-phase method QPIC. The result is close to the CDN-B variant’s [19] 62.29

m A P_{r o l e}

. In this variant, we directly input the output of the previous decoder as a learnable query into the latter decoder without using additional features as guidance, and the Interaction Category Decoder adopts multiple typical transformer decoder layers.

Appearance visual feature (

P S F E_{v}

): This is an important additional feature extracted by the PSFE module before the interactive action classification task: obtaining the appearance features of human body parts from the CNN feature maps and the Interaction Instance Decoder’s prediction. Two simplified versions of our model were executed to evaluate the impact of this component. By comparing the variant model lacking appearance visual feature with the complete model, it is found that FR +

P S F E_{v}

improves the performance by 0.9 mAP. Compared with the base model, the mAP of the guided query using visual features as additional features in the PFIC layer increases from 62.1 to 63.3 mAP.

Feature refinement (FR): As presented in Table 3, according to the graph structure in Figure 4, PRTT utilizes a GCN layer to update the parts and object features. The refined part features collect the feature information of the object. The repetition of human and component features can be prevented by performing this refinement operation utilizing the object’s high-level features. By comparing the variant model without feature refinement to process the appearance visual feature with the complete model, it is found that FR marginally enhances the model’s effectiveness by 0.3 mAP.

Semantic feature (

P S F E_{s}

): In our model’s PSFE module, we use HAKE labels and BERT to obtain semantic representations of human body part states as additional features to guide the query. A simplified version of our approach is executed without this branch. Compared to other results, experiments display a gain of 1.1 mAP.

Interaction score (IS): This is comparable to the work of Shen et al. [51] but different from QPIC; we add the output

s_{j}^{i}

value to the Interaction Instance Decoder to measure whether there is an interaction between human and object. It has the effect of lowering the score in the case of no interaction. Performance is improved by 0.4 mAP with IS.

5. Conclusions

In this paper, we introduce PRTT, a novel Transformer-based ensemble prediction method proposed for detecting human–object interactions. The model utilizes multiple decoders to split and process the elements of HOI prediction correspondingly to focus on the features related to the elements. In the intermediate stage, we utilize the pretrained Bert model to encode part-state semantic and appearance features to guide and improve queries by the PFIC layer. PRTT exhibits superior performance in detecting HOIs, achieving SOTA results on both V-COCO and HICO-DET datasets, demonstrating the effectiveness of our solution. In future research, we plan to explore the integration of multimodal data, such as combining depth and textual information, to further enhance the accuracy of human–object interaction detection.

Author Contributions

Conceptualization, Z.S. and H.Y.; methodology, Z.S. and H.Y.; software, Z.S. and H.Y.; validation, Z.S. and H.Y.; investigation, H.Y.; resources, Z.S.; data curation, H.Y.; writing—original draft preparation, Z.S.; writing—review and editing, Z.S. and H.Y.; visualization, Z.S.; supervision, Z.S.; project administration, Z.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the National Natural Science Foundation of China (62072094), the LiaoNing Revitalization Talents Program (XLYC2005001), the Key Research and Development Project of Liaoning Province (2020JH2/10100046), the Fundamental Research Funds for the Central Universities (DUT24RC(3)046).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The authors confirm that the data supporting the findings of this study are available within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chao, Y.W.; Liu, Y.; Liu, X.; Zeng, H.; Deng, J. Learning to detect human-object interactions. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 381–389. [Google Scholar]
Zhao, H.; Wildes, R.P. Spatiotemporal feature residual propagation for action prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7003–7012. [Google Scholar]
Kong, Y.; Tao, Z.; Fu, Y. Deep sequential context networks for action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1473–1481. [Google Scholar]
Lin, X.; Ding, C.; Zeng, J.; Tao, D. Gps-net: Graph property sensing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3746–3753. [Google Scholar]
Suhail, M.; Mittal, A.; Siddiquie, B.; Broaddus, C.; Eledath, J.; Medioni, G.; Sigal, L. Energy-based learning for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13936–13945. [Google Scholar]
Ulutan, O.; Iftekhar, A.; Manjunath, B.S. Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13617–13626. [Google Scholar]
Gao, C.; Zou, Y.; Huang, J.B. ican: Instance-centric attention network for human-object interaction detection. arXiv 2018, arXiv:1808.10437. [Google Scholar]
Liu, X.; Li, Y.L.; Wu, X.; Tai, Y.W.; Lu, C.; Tang, C.K. Interactiveness field in human-object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20113–20122. [Google Scholar]
Wan, B.; Zhou, D.; Liu, Y.; Li, R.; He, X. Pose-aware multi-level feature network for human object interaction detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9469–9478. [Google Scholar]
Wu, X.; Li, Y.L.; Liu, X.; Zhang, J.; Wu, Y.; Lu, C. Mining cross-person cues for body-part interactiveness learning in hoi detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 121–136. [Google Scholar]
Wang, X.; Gupta, A. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 399–417. [Google Scholar]
Liao, Y.; Liu, S.; Wang, F.; Chen, Y.; Qian, C.; Feng, J. Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 482–490. [Google Scholar]
Wang, T.; Yang, T.; Danelljan, M.; Khan, F.S.; Zhang, X.; Sun, J. Learning human-object interaction detection using interaction points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4116–4125. [Google Scholar]
Zou, C.; Wang, B.; Hu, Y.; Liu, J.; Wu, Q.; Zhao, Y.; Li, B.; Zhang, C.; Zhang, C.; Wei, Y.; et al. End-to-end human object interaction detection with hoi transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11825–11834. [Google Scholar]
Dong, Q.; Tu, Z.; Liao, H.; Zhang, Y.; Mahadevan, V.; Soatto, S. Visual relationship detection using part-and-sum transformers with composite queries. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 3550–3559. [Google Scholar]
Tamura, M.; Ohashi, H.; Yoshinaga, T. Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10410–10419. [Google Scholar]
Kim, B.; Lee, J.; Kang, J.; Kim, E.S.; Kim, H.J. Hotr: End-to-end human-object interaction detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 74–83. [Google Scholar]
Ning, S.; Qiu, L.; Liu, Y.; He, X. Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 23507–23517. [Google Scholar]
Zhang, A.; Liao, Y.; Liu, S.; Lu, M.; Wang, Y.; Gao, C.; Li, X. Mining the benefits of two-stage and one-stage hoi detection. Adv. Neural Inf. Process. Syst. 2021, 34, 17209–17220. [Google Scholar]
Zhang, F.Z.; Campbell, D.; Gould, S. Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20104–20112. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Gupta, S.; Malik, J. Visual Semantic Role Labeling. arXiv 2015, arXiv:1505.04474. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-time flying object detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar]
Li, Y.L.; Zhou, S.; Huang, X.; Xu, L.; Ma, Z.; Fang, H.S.; Wang, Y.; Lu, C. Transferable interactiveness knowledge for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3585–3594. [Google Scholar]
Su, Z.; Wang, Y.; Xie, Q.; Yu, R. Pose graph parsing network for human-object interaction detection. Neurocomputing 2022, 476, 53–62. [Google Scholar] [CrossRef]
Su, Z.; Yu, R.; Zou, S.; Guo, B.; Cheng, L. Spatial-Aware Multi-Level Parsing Network for Human-Object Interaction. Int. J. Interact. Multimed. Artif. Intell. 2023, 1–10. [Google Scholar] [CrossRef]
Li, Y.L.; Xu, L.; Liu, X.; Huang, X.; Xu, Y.; Chen, M.; Ma, Z.; Wang, S.; Fang, H.S.; Lu, C. Hake: Human activity knowledge engine. arXiv 2019, arXiv:1904.06539. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Zhong, X.; Qu, X.; Ding, C.; Tao, D. Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13234–13243. [Google Scholar]
Kim, B.; Choi, T.; Kang, J.; Kim, H.J. Uniondet: Union-level detector towards real-time human-object interaction detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 498–514. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5999. [Google Scholar]
Yan, W.; Sun, Y.; Yue, G.; Zhou, W.; Liu, H. FVIFormer: Flow-guided global-local aggregation transformer network for video inpainting. IEEE J. Emerg. Sel. Top. Circuits Syst. 2024. [Google Scholar] [CrossRef]
Lu, Y.; Fu, J.; Li, X.; Zhou, W.; Liu, S.; Zhang, X.; Wu, W.; Jia, C.; Liu, Y.; Chen, Z. Rtn: Reinforced transformer network for coronary ct angiography vessel-level image quality assessment. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 644–653. [Google Scholar]
Chen, M.; Liao, Y.; Liu, S.; Chen, Z.; Wang, F.; Qian, C. Reformulating hoi detection as adaptive set prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9004–9013. [Google Scholar]
Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7103–7112. [Google Scholar]
Lee, J.D.M.C.K.; Toutanova, K. Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
Chen, X.; Fang, H.; Lin, T.Y.; Vedantam, R.; Gupta, S.; Dollár, P.; Zitnick, C.L. Microsoft coco captions: Data collection and evaluation server. arXiv 2015, arXiv:1504.00325. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Liu, Y.; Chen, Q.; Zisserman, A. Amplifying key cues for human-object-interaction detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 248–265. [Google Scholar]
Li, Y.L.; Liu, X.; Wu, X.; Li, Y.; Lu, C. Hoi analysis: Integrating and decomposing human-object interaction. Adv. Neural Inf. Process. Syst. 2020, 33, 5011–5022. [Google Scholar]
Iftekhar, A.; Kumar, S.; McEver, R.A.; You, S.; Manjunath, B. Gtnet: Guided transformer network for detecting human-object interactions. arXiv 2021, arXiv:2108.00596. [Google Scholar]
Liao, Y.; Zhang, A.; Lu, M.; Wang, Y.; Li, X.; Liu, S. Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20123–20132. [Google Scholar]
Yuan, H.; Wang, M.; Ni, D.; Xu, L. Detecting human-object interactions with object-guided cross-modal calibrated semantics. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 3206–3214. [Google Scholar]
Hou, Z.; Yu, B.; Qiao, Y.; Peng, X.; Tao, D. Affordance transfer learning for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 495–504. [Google Scholar]
Shen, L.; Yeung, S.; Hoffman, J.; Mori, G.; Fei-Fei, L. Scaling Human-Object Interaction Recognition Through Zero-Shot Learning. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018. [Google Scholar]

Figure 1. We equip the same encoder with multiple tandem decoders to handle the HOI prediction composed of human–object (HO) pair recognition and interaction category detection, respectively. In the intermediate stage, we utilize the PSFE module to extract appearance features of part states based on human pose key points and further generate semantic features of part states to refine the representation of queries in the second stage.

Figure 2. Overview of PRTT: Interaction Instance Decoder and Interaction Category Decoder are run in tandem, sharing the same Transformer encoder. In the intermediate stage, we utilize the PSFE module to generate N vectors representing semantic and appearance features of part states to guide the queries in the next stage. Then, the interaction and HO pair representations are obtained separately in the two concatenated decoders and combined into HOI triples. Here, ⊕ represents concatenation procedures.

Figure 3. Structure of prior feature-integrated cross-attention layer.

Figure 4. Illustration of the process of part state feature extraction. Here, GAP is global average pooling, and Residual denotes residual block.

Figure 5. Illustration of the process of the PSFE module. Here, ⨂ represents element-wise multiplication, ⊕ is the concatenation process, FCs denote two fully connected layers, and the semantic extractor is a BERT-based pretrained model.

Figure 6. Attention visualization. The attention map is extracted from the decoder’s final layer. In each subgraph, from top to bottom, are (a) the original image with ground truth; (b) the attention map of QPIC; (c) the attention map of the Interaction Instance Decoder; and (d) the attention map of the Interaction Category Decoder.

Figure 7. Comparison of the qualitative results of PRTT and QPIC. For the same image, the first row is the prediction results of QPIC, and the second row is the prediction result of the method we proposed. The prediction scores of the two methods are exhibited in the captions. PRTT’s detection scores are labeled with green, and the scores of QPIC are labeled with red. The forecast score is displayed in the captions.

Table 1. Performance comparison on the V-COCO dataset.

Method	Feature Backbone	Scenario 1	Scenario 2
UnionDet [35]	ResNet-50-FPN	47.5	56.2
Wang et al. [13]	ResNet-50-FPN	51.0	-
PGPN [30]	ResNet-50-FPN	50.2	-
SMPNet [31]	ResNet-50-FPN	52.8	-
HOI-Trans [14]	ResNet-101-FPN	52.9	-
FCMNet [45]	ResNet-50	53.1	-
IDN [46]	ResNet-50	53.3	60.3
ASNet [39]	ResNet-50	53.9	-
GGNet [34]	Hourglass-104	54.7	-
HOTR [17]	ResNet-50	55.2	64.4
GTNet [47]	ResNet-50	56.2	60.1
QPIC [16]	ResNet-50	58.8	61.0
QPIC [16]	ResNet-101	58.3	60.7
Zhang et al. [20]	ResNet-101	60.7	66.2
Liu et al. [8]	ResNet-50	63.0	65.2
Wu et al. [10]	ResNet-50	63.0	65.1
HOICLIP [18]	ResNet-50	63.5	64.8
GEN-VLKT_l [48]	ResNet-101	63.6	65.9
CDN [19]	ResNet-101	63.9	65.9
OCN [49]	ResNet-50	64.2	66.3
Our method	ResNet-50	63.8	65.5
Our method	ResNet-101	65.2	66.8

Table 2. Performance comparison on the HICO-DET dataset.

Method	Full	Rare	Non-Rare
UnionDet [35]	17.58	11.72	19.33
Wang et al. [13]	19.56	12.79	21.58
FCMNet [45]	20.41	17.34	21.56
PGPN [30]	17.40	13.84	18.45
SMPNet [31]	20.31	17.14	21.26
PPDM [12]	21.73	13.78	24.10
HOI-Trans [14]	23.46	16.91	25.41
PST [15]	23.93	14.98	26.60
HOTR [17]	25.10	17.34	27.42
IDN [46]	26.29	22.61	27.39
GTNet [47]	26.78	21.02	28.50
ATL [50]	28.53	21.64	30.59
ASNet [39]	28.87	24.25	30.25
QPIC (ResNet-50) [16]	29.07	21.85	31.23
QPIC (ResNet-101) [16]	29.90	23.92	31.69
GGNet [34]	29.17	22.13	30.84
CDN [19]	32.07	27.19	33.53
Zhang et al. [20]	32.31	28.55	33.44
HOICLIP [19]	34.69	31.12	35.74
Liu et al. [8]	33.51	30.30	34.46
OCN(ResNet-50) [49]	30.91	25.56	32.51
GEN-VLKT_l [48]	34.95	31.18	36.08
Our method (ResNet-50)	34.47	32.43	33.78
Our method (ResNet-101)	35.06	32.48	35.83

Table 3. Ablation studies of PRTT in the V-COCO test set.

Method	${mAP}_{role}$
QPIC [16]	58.8
Base model	62.1
Base model + ${PSFE}_{v}$	63.3
Base model + FR + ${PSFE}_{v}$	63.6
Base model + FR + ${PSFE}_{v}$ + IS	64.1
Base model + ${PSFE}_{s}$ + IS	64.3
Base model + FR + ${PSFE}_{v}$ + ${PSFE}_{s}$	64.8
Base model + ${PSFE}_{v}$ + ${PSFE}_{s}$ + IS	64.9
Our method (Base model + FR + ${PSFE}_{v}$ + ${PSFE}_{s}$ + IS)	65.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, Z.; Yang, H. A Novel Part Refinement Tandem Transformer for Human–Object Interaction Detection. Sensors 2024, 24, 4278. https://doi.org/10.3390/s24134278

AMA Style

Su Z, Yang H. A Novel Part Refinement Tandem Transformer for Human–Object Interaction Detection. Sensors. 2024; 24(13):4278. https://doi.org/10.3390/s24134278

Chicago/Turabian Style

Su, Zhan, and Hongzhe Yang. 2024. "A Novel Part Refinement Tandem Transformer for Human–Object Interaction Detection" Sensors 24, no. 13: 4278. https://doi.org/10.3390/s24134278

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Part Refinement Tandem Transformer for Human–Object Interaction Detection

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Human–Object Interaction Detection

3. Method

3.1. Overview

3.2. Backbone

3.3. Tandem Transformer

3.4. Part State Feature Extraction

3.5. Inference and Loss Function

4. Experiments

4.1. Experimental Setup

4.2. Implementation Details

4.3. Results

4.4. Qualitative Visualization Comparison

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI