1. Introduction
HOI detection is a visual relation detection task. It aims to associate the interaction between people and objects by using the image features of the human body, object and person pairs, in which human pose also plays a crucial role, and then realizes image action classification (
Figure 1). This task requires not only accurate localization of people and objects in the image but also identification and classification of object categories and discrimination of the influence of human pose on behavior so as to accurately infer interactive behavior.
With the development of society and the acceleration of urbanization, the types of public places, such as communities, fairs, and shopping malls are gradually increasing, and the composition of residents is becoming more complex. As a result, preventing dangerous events in these places has become increasingly important [
1,
2]. For example, open robbery (such as attacks with knives or other dangerous tools), theft (such as pickpocketing of pedestrians), and violent conflicts caused by drinking (such as fighting with wine bottles) may all occur in the community. With the aim of more accurately assessing HOI instances and effectively gauging the relationships between people and objects and providing technical support for the judgment of abnormal behavior, an advanced HOI detection method is needed. This approach should not only focus on fine-grained human-centered action interactions [
3,
4,
5,
6,
7,
8] (armed threats or imminent attacks), but should also consider multiple complex actions that occur simultaneously [
9,
10,
11,
12] (e.g., a bicycle hitting a pedestrian). By accurately identifying and analyzing these interactions, we can fill in the gaps in the ability of community surveillance to identify, analyze, and prevent dangerous behaviors so as to effectively prevent them in advance and accurately trace them after the fact.
In view of the problems existing in the existing research, in order to enhance the feeling of the interaction relationship, the character interaction area is highlighted. In this paper, we designs a Pose-Aware Interaction Network (PAIN) based on transformer architecture and human posture (see
Figure 2), which effectively integrates human pose information and HOI features to improve detection accuracy. Specifically, the human posture detection method is used to obtain the coordinates of two-dimensional skeleton key points, and the adaptive graph convolutional neural network is used to extract the posture features and fuse them with the image features, so that the model combines the visual information in the image and the spatial relationship provided by the posture to form a comprehensive feature representation. The three-branch decoder is used to replace the traditional single transformer decoder, and the posture features are deeply fused in the human detection branch so as to better obtain the potential action features of the human body. In addition, the Cross-Attention Relationship fusion Module (CARM) is designed to unify the information from all three branches, delivering more elaborate interaction details and improving the grasp of interaction relationships. The main contributions of this paper are as follows:
We design a pose-aware human interaction network to provide a more comprehensive interpretation of human interaction behaviors, including the influence of human pose on the determination of interactive actions, so as to realize effective human interaction behavior detection.
In PAIN, we propose a novel feature fusion method to address the limitations of existing HOI methods by early fusing 2D human pose features and image features before the encoder to improve feature expressivity and creatively adding individual motion-related features to the human branch, a practice that has not been explicitly explored in existing HOI methods.
In PAIN, we develop the Cross-Attention Relation Fusion Module (CARM), which uses the cross-attention mechanism to interact with information from multiple inputs; captures the detailed relationship between the three branches of human, object and interaction; and improves the prediction accuracy.
We conduct experiments to quantitatively and qualitatively verify that our proposed PAIN achieves 64.51%, 66.42% on the public dataset V-COCO and 30.83% AP on HICO-DET.
2. Related Work
At present, the main HOI detection techniques mainly include the traditional one-stage HOI detection, two-stage HOI detection, and transformer-based HOI detection. The two-stage HOI detection methods [
3,
4,
5,
6,
7,
13,
14] usually decompose the detection task into an object detection task and an interaction classification task. First, it uses a trained fine-tuned object detector to generate bounding boxes and categories for people and objects. Then, all detected pairs of people and objects are paired, and all pairs are passed to a separate neural network for training and interactive classification. Kim et al. [
15] proposed multiple relational networks to perform rich context exchange in three decoder branches. Gao et al. [
14] first proposed a two-channel binary image representation to encode spatial relations. Chao et al. [
16] proposed the widely used dataset HICO-DET and adopted a two-stage method to propose the HO-RCNN model, which integrated the spatial location information of human–object for the first time to improve the detection effect. Zhou et al. [
17] decouple triplet prediction from human–object pair detection and interaction classification through an instance encoder–decoder stream and an interactive encoder–decoder stream. In addition to spatial relations, graph neural networks [
3,
5] have been proposed to explicitly model the interaction between people and objects, which indeed improves the representation power of the model. Park et al. [
18] proposed a novel feature extraction method, overlapping region masking, combined with a pose-conditioned self-loop structure to effectively solve the quantization problem in vision transformers.
Compared with the two-stage methods, the time complexity of the one-stage methods [
19,
20,
21,
22] is greatly reduced because all pairs of human combinations do not need to be trained, but they still require complex post-processing to combine object detection results and interaction predictions. In these methods, designing a reasonable matching pattern is the key to match the object detection and interaction detection results. Liao et al. [
20] and Wang et al. [
21] consider HOI as a point detection task, enabling the direct identification of interactions in a single-stage process by introducing a new definition of interaction points. Kim et al. [
19] proposed a one-stage anchor-based interaction detection framework, in which the network directly captures the interaction region to detect the interaction, without the matching phase. The joint-level detection framework is used to directly capture the interaction region, and the instance-level detector is used to perform object detection and action classification. Fang et al. [
23] proposed a dense interaction region selection framework (DIRV), which focuses on the interaction regions of human–object pairs.
At present, some transformer-based end-to-end HOI detection algorithms [
9,
10,
11,
12,
15,
24,
25] regard HOI detection as a set prediction problem. Aaron et al. [
26] decode N objects in parallel by replacing LSTM with a transformer and using transformer parallel decoding. Cheng et al. [
11] proposed an HOI transformer to handle HOI detection in an end-to-end manner. It simplifies the HOI pipeline and eliminates the need for many hand-designed components. The HOI transformer reasons about object and human relationships from the global image context, introduces a quintuple matching loss to enforce HOI prediction in a unified manner, and directly predicts HOI instances in parallel. Chen et al. [
27] improve upon the transformer-based approach by utilizing query-based anchors to derive HOI embeddings and forecast HOI instances. In this method, interactive queries of random parameters (learnable position-embedding sequence, learnable query mechanism) are fed into the transformer decoder to directly map a set of HOI predictions, which makes it tend to detect high-confidence target regions and ignore regions where interactive actions occur. At the same time, such interactive queries are usually unknown at the beginning of prediction, lacking intuitive interaction relationships and human poses with strong topology. Ma et al. [
28] introduced a novel staged training strategy to reduce the training pressure caused by the complexity of the task and proposed the HOI-SDC dataset to deal with the challenges in HOI detection. Chen et al. [
29] proposed an uncertainty-aware robust HCI learning method, which aims to refine detection and interaction prediction by estimating prediction uncertainty during training.
Many one-stage HOI detection methods [
10,
30,
31,
32] tend to divide the original task into multiple subtasks by using a two-branch or three-branch network structure. Fang et al. [
30] separate the detection process of people and objects and emphasize the importance of human features in interaction by introducing a human-guided link method. At the same time, they adopt a stopping gradient mechanism to manage the influence of interaction on detection so as to optimize the detection effect of people and objects. Chan et al. [
31] proposed a one-stage three-branch parallel HOI detection method to mitigate the noise interference generated during the fusion process by fusing a noise suppression module. Wu et al. [
32] integrate two key branches: time-enhanced recurrent graph network (TRGN) and parallel transformer encoder (PTE), aiming to extract rich hierarchical temporal features from video data. However, not enough attention has been paid to the information exchange between these branches, and a fully parallel approach would destroy the association between different pieces of information.
3. Method
This study aims to improve the diversity of different individual interactions in HOI detection. By strengthening the network structure and optimizing the feature extraction method, we aim to improve the attention of the algorithm to the human pose in the HOI instances so as to further improve the detection accuracy. In this paper, we designs a Pose-Aware Interaction Network (PAIN) based on transformer architecture and human posture, and the network structure is shown in
Figure 2. The architecture of PAIN is divided into four components: OpenPose obtains 2D skeleton key point coordinates and uses AGCN to extract pose features. Image features are fused with pose features, and deep feature learning is performed by transformer encoder. Three-branch transformer decoder of human, object and interaction, and pose features are added to the human branch. The CARM generates cross-relation contexts for relation inference. The attention fusion module passes the output of CARM to each subtask for context exchange. The quintuple HOI detection head directly outputs HOI instances.
3.1. Pose Feature Extraction
To achieve a more accurate extraction of posture features, OpenPose is used to generate the two-dimensional key point coordinates of the human body [
33]. The generated data format follows the COCO format. For each human body, 25 key points are provided, and each key point is composed (
x,
y) [
34] of two-dimensional coordinates. A spatial graph (
Figure 3) is constructed. In the image, each significant point of the human body is represented as a node on a graph, while the lines connecting these points are treated as edges. By employing this approach, multiple connected graphs can be derived, thereby leveraging the interdependencies among all nodes, as illustrated in
Figure 3.
The adaptive graph convolutional network is used to extract the posture features and obtain the position coordinates of all key points of the human body. The bone point data form is generally expressed as a second-order tensor of
, where
and
are the two-dimensional coordinates of the key points and the total number of all key points, respectively. The coordinates and the number of key points are taken as the data input
to encode and embed the skeleton points. The ReLU activation function and the FC layer are used to obtain the graph convolution input, which is denoted by
:
where
,
indicates the number of input channels utilized by the graph convolutional network, whereas
represents the ReLU activation function, and
represents the FC.
Since the number of human bodies in each HOI sample is different and the size of
is different, the calculation method [
13] of the graph convolution module is similar to 2s-AGCN, but the partition set of spatial skeleton key points is set to 1, the data and graph convolution network are used to automatically learn to generate the adjacency matrix, and then the graph convolution module is passed through three layers in order. The formula for each layer is as follows:
where
is the parameter matrix of graph convolution and weight vector obtained by
convolution operation,
is the adjacency matrix, in which
is completely to learn, the initial value is 1, and adaptive adjustment can be used in training and is used to represent the attention of the connection strength between two nodes.
The
matrix is an adjacency matrix related to the data, which is represented by the similarity between nodes and is used to determine whether there is a connection between two nodes and their connection strength. The similarity between nodes is calculated by using the normalized Gaussian function:
where the normalized Gaussian function calculation is equivalent to the matrix operation of the SoftMax function, and
and
are the parameters associated with the Gaussian embedding function. After conducting three graph convolution operations, the result is the pose feature embedding P.
3.2. Image Encoding
The CNN backbone network ResNet was used to extract the visual features [
35], and the 3-channel RGB image is input into the backbone network to generate the feature map of the shape
, where
and
represent the height and width of the image, respectively, and
is the feature dimension. To reduce the feature dimension from
to
, a
projection convolutional layer is used, and the feature map of shape
is generated, which has d channels and spatial dimensions
. In order to meet the input requirements of the subsequent transformer encoder for the feature sequence, we apply the flattening operator to collapse the spatial dimension into a single dimension, denoted as the flattening feature
.
The encoder layer is built on top of the standard transformer architecture [
17], and to enable it to distinguish relative positions in a sequence, positional encodings are added to the input of each attention layer. The sum of the flattened image features and pose features embedded in P fusion and position encoding are fed into the transformer encoder to summarize the global information, generating image tokens
, T for the subsequent network.
3.3. Inference on Three-Branch HOI Decoding
3.3.1. HOI Decoder
The original single decoder can only deal with a single feature of the input but cannot capture the complex interaction between people and objects at the same time and cannot effectively use the interaction information between people and objects. We adopt a three-branch architecture [
15] to replace the original architecture, which is responsible for human detection, object detection, and interaction classification, respectively. Different from the previous three-branch architecture, additional pose features are added to the human detection branch to better understand the influence of human pose on instance. The human detection branch
has
layers, where the input is query vector
, the fused feature
of the image feature Y, and the pose feature embedding P. In each layer, the transformer decoder updates the input query vector through attention, thereby generating an output
, which contains contextual information for the subtask responsible for the prediction branch:
The object detection and interaction classification branches
also have
layers, whose inputs are the query vector
and image features
. In each layer, the transformer decoder updates the input query vector
with attention
to produce outputs
that contain contextual information for predicting the subtask responsible for the branch:
3.3.2. HOI Relationship Fusion
In order to fuse the output information generated by the three branches for different tasks and combine the relationship information carried by them, we propose the CARM, as shown in
Figure 4. In this module, the pairwise relations are formed by concatenating the separately generated unary relations, and the cross-attention mechanism is used to perform relation reasoning for unary, pairwise, and ternary relation groups so as to fully mine the useful information in the context. The unary relations generated by the three branches are concatenated in pairs, and the generated unary, binary, and ternary relation groups are respectively reasoned with the cross-attention mechanism so as to make full use of the useful information in the context.
Specifically, each output
of the decoder is concatenated and then passed through the MLP layer to generate the binary relation group and ternary relation group of the
-th HOI instance:
where
represents the join operation. In order to exploit the detailed understanding of subtasks by unary relations, the joint attention of tuples to two subtasks and the overall understanding of HOI tasks by triples. Self-attention is used on unary relations and tuples to explore the internal information of their sequences:
The cross-attention mechanism is used to fuse the unary relation and binary relation with the ternary relation context, respectively:
The same cross-attention mechanism is used to fuse
and
to produce
. Finally, by focusing on the image feature Y, the transformed
is used to produce a joint context output
:
3.4. HOI Inference
MLP is used to combine the features of each specific task with the joint context output, so that the corresponding context information can be propagated according to the requirements of each subtask. We make use of the channel attention to the necessary context information selected for each subtask. Then, the detailed features are generated by propagating the necessary context information to the task-specific token
. The channel attention
,
and detailed features
,
generated by the human detection branch, the object detection branch, and the interaction relationship branch are shown in the following equation:
where
denotes element-by-element multiplication,
denotes the sigmoid function,
denotes channel attention, and
denotes refined tokens.
,
comes from formulas (4) and (5).
We define each HOI instance as a quintuple of (human, interaction, object, human box, object box) [
11]. The last output
and
of each branch are decoding predicted as HOI instances. Each of the five FFNS predicts HOI quintuples, where the input of human confidence and human frame is
, the input of object confidence and object frame is
, and the input of interaction classification confidence is
.
3.5. Loss Functions
Set the HOI instance as a quintuple of
[
11], where
represents the confidence of the human, interaction, and object classes, and
is the bounding box for the human and object. The interaction is approximated with probability
in a given data set.
We denote the predicted HOIs as and the ground true HOIs as . M denotes the number of actual interactions in the image. The length of the two sets can be made equal by filling in the background truth set.
In each step of training, the best one-to-one match between the set of true values and the current set of predictions should be found first. The following matching cost is designed for HOI:
where
denotes the loss between humans, object and interaction are calculated using standard SoftMax cross-entropy loss,
is the regression loss of human and object frames and is the weighted sum of GIOU loss and
loss. α and β are the hyperparameters of the loss weight.
and
dominate the classification weight and localization weight, respectively. In the matching process, the classification plays a more important role than the localization with
. In HOI, human appears in every instance of HOI, assuming that person classification is the simplest, set
, and interaction is more important than object objects,
.
Denote the match as an injective function,
, where
is the index of the predicted HOI assigned to the truth. The matching cost function is defined as follows:
where
is the matching cost between the true label and the prediction. Finally, the Hungarian algorithm [
36] is used to find the binary match.
4. Experiments
4.1. Dataset and Evaluation Metrics
Experiments are carried out on HICO-DET [
16] and V-COCO [
37], two datasets that are widely used in HOI detection. V-COCO is a subset of MSCOCO and consists of 5400 images in the original dataset and 4946 images in the test set. We annotate binary labels for twenty-nine different action categories (five of which do not involve the associated object) and also contain 80 object categories. The proposed method is evaluated on scenario 1 and Scenario 2 according to the evaluation settings, and the average precision of roles in both scenarios (
in scenario 1 and
in scenario 2) is reported. In scenario 1, the model should compute the bounding box of the occluded object and ignore the predicted bounding box of the occluded object in Scenario 2. HICO-DET contains 47,776 images (38,118 for training and 9658 for testing) and includes more than 150 K human–object pairs. The full, rare, and non-rare splits contain all 600 HOI classes, 138 HOI classes, and contain 80 object categories, 117 action categories, and 600 HOI triples, of which 138 are rare categories (i.e., less than 10 training instances), and the remaining 462 categories are non-rare categories.
In accordance with standard evaluation guidelines, we leverage the commonly utilized role mean average precision (mAP) to analyze the performance of the model. According to the P–R curve, we can obtain a numerical metric by averaging the precision values corresponding to each recall value: Average Precision (AP) is a measure of how well the trained model detects the class of interest. In the human–object interaction task, HOI detection results are considered as correct positive samples only if the Intersection Over Union (IOU) ratio between the predicted bounding box and the corresponding real bounding box is greater than 0.5 and both the object class and the action class are predicted correctly.
4.2. Implementation Details
The experiments in PAIN are carried out on the backbone network ResNet-50. The model is trained with AdamW [
35], with a transformer learning rate established at 1e–4, a backbone learning rate of 1e–5, and weight decay set to 1e–4. The number of encoding layers and decoding layers are both set to six, and the number of queries N for HICO-DET is set to 64, and the number of queries for V-COCO is set to 100. The model is trained for 150 epochs and the learning rate decays at 90 epochs. The DETR [
38] model pre-trained by MS-COCO is used to initialize the weights of the backbone and transformer codec. The batch size is set to four.
4.3. Results Analysis
In the experimental part of this paper, we conduct a systematic evaluation on two important benchmark datasets, V-COCO and HICO-DET, to verify the effectiveness and superiority of the proposed method. The experimental results on the V-COCO dataset are shown in
Table 1, where the comparison results with the baseline method HOITrans [
11] (the first end-to-end human–object interaction model) and other recent methods are listed in detail. Similarly,
Table 2 lists the experimental results on the HICO-DET dataset. To ensure the comparability of the experiments, we all adopted ResNet-50 as the backbone network. For the HICO-DET dataset, the object detector is pre-trained on MS-COCO.
Through the integration of posture and the introduction of the three-branch structure of human, object, and interaction, the model can better capture the subtle changes related to interaction and more complex interaction patterns. The proposed method uses ResNet-50 as the backbone benchmark on the V-COCO dataset, and compared with the single-branch decoder method HOI-Trans, increases by 15.54 percentage points (51.15→64.51), and reaches 66.42%. It increases 7.37 percentage points (23.46→30.83) on HICO-DET; in particular, there is a significant improvement in the “rare” category, which indicates the potential advantage of the model when dealing with sparse data, as different actions may have similar pose or visual features, and in some scenes, the human body is occluded. Combining pose features improves the robustness of the model in these cases. It shows that the method in this paper has made substantial improvements on the two benchmarks, confirming the model’s effectiveness.
The comparison between the results of the proposed model and the baseline model tested on the VCOCO validation set is shown in
Figure 5 and
Figure 6, and the results prove that in the model training stage, compared with the original algorithm, the total training loss, GIOU loss, bounding box loss, and behavioral cross-entropy loss of the proposed model converge faster than the original model. The proposed method is more efficient in learning and can quickly adapt to the characteristics of the data. In the inference stage of the model, four images are randomly selected for interactive action detection after the improvement. Through the comparison of the bar charts, it can be seen that the confidence indicators of the four images in recognizing the action are greatly improved, and the randomly selected samples show a significant improvement in the ability of action recognition, which indicates that the proposed method has better performance in dealing with complex HOI inference.
4.4. Ablation Study
We conducted ablation experiments on the V-COCO dataset to evaluate the performance differences of the model under different feature fusion strategies, verify the effectiveness of early fusion, and remove or replace key components in feature fusion one by one to analyze their impact on the final performance of the model.
4.4.1. Feature Fusion Verification
In order to verify the effectiveness of different fusion methods of posture features in the experiment, posture features are added at different positions to observe their influence on the results. The results in
Table 3 demonstrate that the method represented in
Figure 7f is the most effective. For HOI detection, real-time use of pose information is crucial to understand complex scenes and action sequences. Passing pose features to the encoder in advance enables the self-attention mechanism to conduct more accurate context modeling based on these features. Adding posture features to the human branch instead of other branches can enhance the expression of human-related features and reduce redundant features, thereby improving the overall performance of the model.
4.4.2. Component Impact Analysis
In our ablation experiments, the impact of adding different components separately on the accuracy of the model was first evaluated. The experimental results are shown in
Table 4, and the results show that the accuracy of the model is improved by 0.87% after only adding posture features. Adding only the CARM module increases the model’s accuracy by 1.8%. Finally, when adding pose features and the CARM module at the same time, the accuracy of the model is improved by 2.57% in total.
These results show that pose features make a significant contribution in capturing action details, and the CARM module further enhances the feature representation ability of the model. Combining the capabilities of the two modules, the overall performance of the model is significantly improved. This further verifies the effectiveness and importance of the feature fusion strategy in improving complex action recognition tasks.
5. Conclusions
This paper designs a pose-aware interaction network (PAIN) to improve HOI detection. Some previous methods tend to detect target regions with high confidence and ignore regions where interactive actions occur. At the same time, it lacks the interaction relationship that conforms to intuitive feeling and the human posture with strong topology. In order to solve these problems, the pose features were extracted and fused with image features. The three-branch decoder was used to obtain richer relationship contexts, and the posture features were integrated into the human detection branch to better obtain the potential action features of the human body. CARM is designed to fuse the information of the three branches to provide more detailed HOI information. Experimental results on V-COCO and HICO-DET datasets show that PAIN is superior to all two-stage methods and common one-stage methods, which achieves 63.74% , 65.64% on the public dataset V-COCO and 30.83% AP on HICO-DET.
HOI detection can be very computationally demanding, especially when processing high-definition video streams from multiple cameras, and with edge-computing strategies, preprocessing and preliminary processing can be performed near the data source, spreading the computational load and minimizing network latency. In future work, we plan to further deepen the proposed method for HOI detection and explore the application of 3D human pose information. 3D pose information can provide a richer context to help the model better understand and recognize complex interactive behaviors. It is a technical difficulty to effectively fuse 2D image features with 3D human pose information, and it is necessary to ensure the relevance between them. At the same time, the introduction of 3D pose data will increase the computational burden, especially in real-time applications, where the accuracy and processing speed need to be balanced. Edge-computing strategies can be used to distribute the computational load and minimize network latency by performing pre-processing and preliminary processing near the data source. Through the enhancement of this technology, we hope to reconstruct dynamic scenes and provide a richer environmental context for the community so as to realize more comprehensive instances analysis and understanding.