Linguistic-Driven Partial Semantic Relevance Learning for Skeleton-Based Action Recognition

Chen, Qixiu; Liu, Yingan; Huang, Peng; Huang, Jiani

doi:10.3390/s24154860

Open AccessArticle

Linguistic-Driven Partial Semantic Relevance Learning for Skeleton-Based Action Recognition

¹

College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China

²

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

^*

Author to whom correspondence should be addressed.

Sensors 2024, 24(15), 4860; https://doi.org/10.3390/s24154860

Submission received: 21 June 2024 / Revised: 16 July 2024 / Accepted: 23 July 2024 / Published: 26 July 2024

(This article belongs to the Section Biomedical Sensors)

Download

Browse Figures

Versions Notes

Abstract

:

Skeleton-based action recognition, renowned for its computational efficiency and indifference to lighting variations, has become a focal point in the realm of motion analysis. However, most current methods typically only extract global skeleton features, overlooking the potential semantic relationships among various partial limb motions. For instance, the subtle differences between actions such as “brush teeth” and “brush hair” are mainly distinguished by specific elements. Although combining limb movements provides a more holistic representation of an action, relying solely on skeleton points proves inadequate for capturing these nuances. Therefore, integrating detailed linguistic descriptions into the learning process of skeleton features is essential. This motivates us to explore integrating fine-grained language descriptions into the learning process of skeleton features to capture more discriminative skeleton behavior representations. To this end, we introduce a new Linguistic-Driven Partial Semantic Relevance Learning framework (LPSR) in this work. While using state-of-the-art large language models to generate linguistic descriptions of local limb motions and further constrain the learning of local motions, we also aggregate global skeleton point representations and textual representations (which generated from an LLM) to obtain a more generalized cross-modal behavioral representation. On this basis, we propose a cyclic attentional interaction module to model the implicit correlations between partial limb motions. Numerous ablation experiments demonstrate the effectiveness of the method proposed in this paper, and our method also obtains state-of-the-art results.

Keywords:

skeleton-based action recognition; cross-modal; transformer

1. Introduction

Action recognition [1,2,3,4,5] constitutes a pivotal branch within the computer vision field, dedicated to identifying human or object behaviors and actions through the analysis of visual information contained in video sequences or real-time video streams. This technology plays a crucial role in diverse applications such as human–computer interaction [6,7,8,9,10], health rehabilitation [11,12,13], and sports analysis [14,15,16]. The advent of depth sensors, exemplified by the Kinect [17], has facilitated easy access to human skeleton joint data. Currently, skeleton-based action recognition has garnered substantial interest for its computational efficiency and inherent robustness against variations in lighting conditions, viewpoints, and background noise.

Research on skeleton-based action recognition [18,19,20,21] from the perspective of network architecture can be broadly categorized into four types: methods based on Recurrent Neural Networks (RNNs) [22,23,24], methods based on Convolutional Neural Networks (CNNs) [15,25], methods based on Graph Neural Networks (GCNs) [26,27,28], and Transformer-based methods [9,19,29]. A frequently employed pipeline is to convert raw skeleton data into data formats associated with point sequences or graphical structures, subsequently applying the aforementioned deep learning techniques for feature extraction. RNN-based methods [30,31] recursively process data sequences, effectively capturing temporal dependencies, but may suffer from challenges with complex spatio-temporal data and long-term dependencies. CNN-based methods [18,32] perform convolutional operations within designated spatial or spatio-temporal windows to progressively extract higher-level features, exhibiting translation invariance. GCN-based methods [33,34,35,36] leverage the graph topology of the human skeleton to capture the relationships between different nodes. However, this approach is constrained in its ability to identify relationships between nodes that are not directly edge-connected (e.g., “head” and “feet”). Transformer-based methods [20,29] benefit from the self-attention mechanism, offering advantages in modeling long-distance dependencies and unrelated nodes, and have gradually become one of the most popular research frameworks in the community. Consequently, this work aims to explore a more effective skeleton activity representation based on Transformer (Figure 1).

To enhance the skeleton-based activity representation, researchers often introduce additional modalities, such as video (RGB) and depth image sequences [37,38,39], as supplementary information. Nevertheless, the additional processing and computation of modal data, as previously described, will result in extra computational overheads. Therefore, we expect to discuss a balanced learning strategy between performance and cost to effectively represent skeleton activity. Xiang et al. [21] proposed a cross-modal skeleton activity recognition method called Generative Action-description Prompts (GAP), which introduces a pre-trained large language model to generate textual descriptions of body parts’ actions and serves as supervised information to constrain the optimization of different body parts in the skeleton modality. On the one hand, GAP prompts further reflection on the role of textual descriptions in skeleton-based action recognition. There are visual semantic similarities among different body actions; for instance, “side kick” and “kicking” both involve leg movements, but skeleton data alone fails to effectively capture the nuanced motion patterns of these fine-grained behaviors [4]. Language, however, could provide a more nuanced and discerning form of guidance. On the other hand, there is implicit synergy among local body movements when a specific action occurs. For instance, there are simultaneous spatio-temporal displacements of the “head” and “hands” during the action “sneeze”. Consequently, how to sufficiently mine the semantic associations among these local body movements poses a significant challenge.

To alleviate the above two problems, we propose a fine-grained cross-modal skeleton action recognition approach, namely Linguistic-Driven Partial Semantic Relevance Learning (LPSR), which consists of two major components: the Partial Semantic Consistency Constraints (PSCC) and the Cyclic Attention Interaction Module (CAIM). In PSCC, we leverage the current state-of-the-art large language model to generate more detailed local body movement descriptions, as well as the global description of the action, by using skeleton point visualizations and text labels as inputs. Multiple local body descriptions guide the model to learn finer-grained representations of skeleton body movements, where the Kullback–Leibler (KL) consistency loss is used to construct local semantic consistency associations across modalities. Global textual descriptions are then (as key and value) associated with the global skeleton feature to learn a more discriminative action feature via cross-attention. Furthermore, considering the semantic synergy between local body movements, we design the CAIM module to model the implicit relations between them. The local body parts studied in this paper include the “head”, “arm”, “hand”, “hip”, “leg”, and “foot”. The selection of these parts is mainly based on the division of the human body into 25 nodes based on the dataset. We locally segment the human body based on the information provided by these nodes. In summary, the main contributions of this paper are summarized as follows:

We propose a novel Linguistic-Driven Partial Semantic Relevance Learning framework (LPSR) for skeleton-based action recognition. The framework leverages the powerful zero-sample capability of multi-modal large language models to generate global and local textual descriptions of skeleton actions, and furthermore constructs cross-modal partial semantic consistency constraints to guide the model to learn a more discriminative representation;
We propose a Cyclic Attention Interaction Module (CAIM) to mine the implicit semantic associations between different body movements, fully exploiting the potential of synergistic relationships of local body movements in global action understanding.
We conduct extensive ablation studies on two popular benchmarks NTU-60 and NTU-120, and the experimental results demonstrate the effectiveness of the proposed method in this work. In addition, compared with previous Transformer-based methods, our method also achieves state-of-the-art results under the same setup conditions.

2. Related Works

Skeleton-based Action Recognition. Skeleton-based Action Recognition [35,40,41,42] is a technique for recognizing human movements by capturing and analyzing the movements of human skeleton joints. Human joint trajectories [27,43] offer a detailed perspective on human movement, largely due to the spatial information they encompass and their strong correlation with adjacent joint nodes. However, representing skeleton information has its challenges: it is often sparse and noisy. This sparsity becomes evident when distinguishing between similar actions, like ‘brushing teeth’ and ‘brushing hair’, which are almost identical in body movement and heavily rely on hand movements for accurate identification [4]. Recently, deep learning, propelled by advances in high-performance computing and technology, has shown remarkable capabilities in extracting complex features. One area where deep learning is particularly effective is in processing time-series data through Recurrent Neural Networks (RNNs) [44,45,46,47]. RNNs excel in learning dynamic dependencies within such data. However, they face limitations in modeling spatial dependencies among skeleton joints. To address this, Du et al. [24] proposed an innovative solution: an end-to-end hierarchical RNN framework. Complementing this approach, Yang et al. [48] introduced the concept of group sparse regularization. This technique centers on investigating the concurrent characteristics of skeleton joints, providing a more profound comprehension of their interrelations.

In addition to the RNN-based approach, Convolutional Neural Networks (CNNs) [25,32,43,49] are well-regarded for their excellent capability in extracting features and learning spatial dimensions, and have been successfully utilized to process spatio-temporal data in skeleton analyses. Wang et al. [43] and Li et al. [18] encode the skeleton sequence data into an image and then feed it into a CNN for action recognition, giving a skeleton spectrogram and a joint trajectory map, respectively. Wang et al. [50] converted skeleton joints into multiple 2D pseudo-images to suit the CNN’s input needs, enabling the network to capture spatio-temporal characteristics. Additionally, Xu et al. [49] introduced a solely CNN-based structure known as Topologyaware CNN, designed to enhance the modeling of irregular skeleton topologies by CNNs.

Yet, the aforementioned techniques struggle to grasp the inter-joint correlations, Yan et al. [26] depict the human body as a graph, characterizing joint connections with an adjacency matrix, and introduce the Spatio-Temporal Graph Convolutional Network (ST-GCN). This network addresses the temporal and spatial dimensions of the convolution and processes the skeleton data for efficient modeling. In addition, combining semantic information of human joints and frames [21,51] has been shown to enrich the expressiveness of skeleton features, thus improving recognition accuracy. Diverging from these graph-centric methods, our approach models skeleton data using Linguistic-Driven Semantic Relevance Learning, offering a distinctive outlook that could yield novel insights and advancements in the domain of action recognition and pose estimation.

Transformer-based Action Recognition. In recent years, there has been a notable shift in Natural Language Processing (NLP) [1,51,52] towards the adoption of Transformer structures [53] as a replacement for traditional network architectures. Due to the powerful long-range temporal modeling capabilities of Transformers with self-attention modules, there has been a growing interest in utilizing Transformers for action recognition tasks. While most existing approaches in this area utilize video frames as input tokens [54,55], a limited number of techniques integrate skeleton data [9,19] within the Transformer architecture. Nonetheless, the computational demands for Transformer-based action recognition are substantial, given the self-attention mechanism’s application to numerous 3D tokens in videos. Self-attention is becoming increasingly popular in computer vision and has been applied to a variety of tasks, including image classification and segmentation [56,57], object detection [58], and action recognition [20,52]. In video action recognition, ref. [52] used self-attention to learn spatio-temporal features from frame-level patch sequences. Ref. [20] uses self-attention in skeleton-based action recognition instead of regular graph convolution. In contrast, our approach solely relies on self-attention to model skeleton data and calculates the correlation of all joints across multiple consecutive frames simultaneously.

Language Model in Skeleton-Based Action Recognition. Significant progress has been made in advanced natural language processing systems based on deep learning techniques with the introduction of models such as Bidirectional Encoder Representations from Transformers (BERT) [59]. These models are pre-trained to understand and generate complex text [60,61,62,63], capturing linguistic nuances and deeper meanings. Despite its effectiveness, the application of BERT was initially constrained to single-task adaptations, which limited its efficiency. In response to this limitation, the concept of Prompt Learning (PL) was introduced. This technique [63,64] enhances the adaptability of pre-trained LLMs to multiple tasks by adding specific textual parameters to the model’s input.

The principles of PL and transformer-based learning have been extended to Skeleton-Based Action Recognition. A notable example is GAP [21], which uses the Contrastive Language–Image Pretraining (CLIP) training method for skeleton action recognition and incorporates an additional transformer layer that significantly improves bone-based action recognition. In this framework, a cue learning (PL) technique is employed to construct bone-to-text correspondences, i.e., textual cues are used to allow GPT-3 [61] to generate detailed descriptions for different skeleton action categories for multimodal representation learning. This advancement demonstrates the great potential of transformer-based modeling and PL techniques for enhancing human action understanding and recognition using skeleton data. In contrast, we use GPT-4 [60] as a knowledge engine to enhance the understanding of actions. Textual cues and intuitive motion dynamics diagrams are input to generate global descriptions of human motion and local descriptions of different limb motions in an action to further optimize local behavioral learning, thus improving the quality of the learned representations. In addition, we aggregate global skeleton point representations and textual representations to form a cross-modal behavioral representation with broader applicability.

3. Methods

In this section, we first introduce the general framework for Linguistic-Driven Partial Semantic Relevance Learning in the sub-section Overview. Then, we will elaborate on the Cyclic Attention Interaction Module (CAIM) and Partial Semantic Consistency Constraints (PSCC) in detail, respectively.

3.1. Overview

In this work, we propose a novel Linguistic-Driven Partial Semantic Relevance Learning framework for skeleton action recognition (shown in Figure 2), which contains two major sub-components: Cyclic Attention Interaction Module (CAIM) and Partial Semantic Consistency Constraints (PSCC).

For a given skeleton input

X_{org}

, in CAIM, we first extract global skeleton features

S_{g}

through a skeleton encoder and obtain local partial features

S_{l}

based on node information. We design a cyclic attention strategy to mine the potential relationship between partial limb motions, and the output after local feature interaction is

f_{l}

. Each partial limb feature is then aggregated to obtain

{\tilde{f}}_{l}

. In PSCC, we use text labels

T

as well as

X_{org}

as inputs to generate global and local descriptions

T_{g}

and

T_{l}

, and then obtain the encoded features

f_{g}

and

f_{l}

, which can be passed through a pre-trained text encoder. We exploit these more discriminative textual descriptions to guide the learning of partial limb motion, specifically, using KL loss to construct local consistency constraints across modalities. In addition, we correlate global textual feature with global skeleton feature by cross-attention to obtain

{\hat{f}}_{g}

, which is fused with

{\tilde{f}}_{l}

to obtain

f_{g l}

, using

f_{g l}

to compute the classification objective. Finally, the final optimization objective is obtained

L_{t o t a l}

.

3.2. Cyclic Attention Interaction Module

Specifically, given the original skeleton sequence input

X_{org} \in R^{C \times T \times V}

with T frames and V joints. Following [19], we first expand the original skeleton sequence

X_{org}

to

X_{1} \in R^{C_{1} \times T \times V}

in the channel dimension, by using a feature mapping layer (implemented by a

C o n v 2 d

layer + a

B a t c h N o r m

layer + a

L e a k y R e L U

layer). The expanded

X_{1}

is then fed into a spatio-temporal tuple encoding layer after a sequence division operation, with the output

X \in R^{C_{1} \times T \times V_{1}}

. Next, the global skeleton feature

S_{g}

is extracted by

S_{g} = Λ (Υ (X) + PE)

(1)

where

PE

is a sine and cosine positional embedding function,

Λ (\cdot)

represents the ViT-based skeleton encoder, and

Υ (\cdot)

is applied to convert X to Query, Key and Value as inputs of

Λ

.

As mentioned earlier, there is an implicit connection between different body parts during an action being performed. Therefore, further exploration of the potential relationships between these local movements of body parts may contribute to a better understanding of skeleton action representations. To this end, we first utilize node information to refine the global feature

S_{g}

to K local partial features, which can be formulated as

S_{l} = RF (S_{g}, I n f o_{K})

(2)

where

I n f o_{K}

denotes the set of body parts, i.e.,

{h e a d, h a n d s, a r m s, h i p s, l e g s, f e e t}

,

RF (\cdot)

means the refined processing. Thus, the output

S_{l} = {S_{l}^{1}, S_{l}^{2}, \dots, S_{l}^{K}}, K = 6

means the set of partial limb motion features, with each

S_{l}^{i}

is

R^{C \times T \times V_{i}}, i \in {1, 2, \dots, K}

.

Furthermore, to mine the implicit synergies between partial limb motions, we design a cyclic attention strategy to learn the relation between each partial limb motion and others, shown as

f_{l} = C y c A t t n (S_{l}^{i}, S_{l}^{j \neq i}, η), (i, j) \in [1, K]

(3)

where

C y c A t t n (\cdot)

is a cyclic attention which is implemented by several cross-attention, with a cyclic mechanism that each limb motion is regarded as the query and others are key and value, and the parameter

η

means the number of attention layers. The specific process is shown in Algorithm 1. As a result, the interacted local features can be refined as

f_{l} = {f_{l}^{1}, f_{l}^{2}, \dots, f_{l}^{K}}, K = 6

.

Algorithm 1: Cyclic Attention $CycAttn (S_{l})$

Input: Partial limb motion features

S_{l} = {S_{l}^{1}, S_{l}^{2}, \dots, S_{l}^{K}}

for each

i \in [1, K]

do:

1. Calculate:

S_{r e s t} = C o n c a t e n a t e (S_{l} - {S_{l}^{i}})

;

2. Calculate Query, Key and Value:

S_{q u e r y} = W_{q} S_{l}^{i}

,

S_{k e y} = W_{k} S_{r e s t}

,

S_{v a l u e} = W_{v} S_{r e s t}

;

3. Calculate

f_{l}^{i} = C r o s s A t t n (S_{q u e r y}, S_{k e y}, S_{v a l u e}) = S o f t M a x (\frac{S_{q u e r y} S_{k e y}^{T}}{\sqrt{d}}) S_{v a l u e}

;

4.

i \leftarrow i + 1

.

end

where

C o n c a t e n a t e

means splice partial limb motion features other than

S_{l}^{i}

,

W_{q}

,

W_{k}

,

and

W_{v}

denote the projection weights, d is the channel dimension of

S_{q u e r y}

.

Output: Interacted local features

f_{l} = {f_{l}^{1}, f_{l}^{2}, \dots, f_{l}^{K}}

.

Each local feature captures the most relevant local motion than its own in (3); next, we aggregate these local skeleton features by

{\tilde{f}}_{l} = \frac{1}{K} \sum_{i = 1}^{K} A v g P o o l (f_{l}^{i} |_{T, V}), i \in [1, K]

(4)

where

A v g P o o l (\cdot)

is a fusion function to aggregate each partial limb feature in the temporal (T) and the joint (V) dimensions, and the

{\tilde{f}}_{l}

will be involved in the calculation of the final classification loss.

3.3. Partial Semantic Consistency Constraints

Although existing large language models demonstrate impressive zero-shot generation capabilities, they are constrained to the generation and expansion of linguistic modalities, being unable to generate reasonably accurate captions for specific visual contents. Therefore, in this work, we introduce a multi-modal large model that utilizes dynamic visualization of skeleton data as the visual input. By designing specific linguistic prompts, we generate descriptions related to global action and partial limb motion, respectively. Specifically, for a given skeleton sequence input

X_{org}

, we first convert it into a 3-dimensional array dynamic graph

X_{dg}

to show the intuitive motion process (refer to the visual input in Figure 2 for an intuitive understanding).

On this basis, we design two specific linguistic prompts targeting global actions and local limb motions, respectively, to generate more precise textual descriptions, which can be formulated as follows:

T_{g} = G (X_{d g}, T, P_{g l o b a l})

(5)

and

T_{l} = G (X_{d g}, T, P_{l o c a l})

(6)

where

G (\cdot)

indicates a multi-modal description generator which is implemented by GPT-4 in this work, making a groundbreaking advancement in understanding multi-modal information compared to previous versions.

T

denotes corresponding text labels, and

P_{g l o b a l}

and

P_{l o c a l}

represent global action prompt and local limb motion prompt, respectively. The local output

T_{l} = {ψ_{1}, ψ_{2}, \dots, ψ_{k}} (k \in [1, 6])

corresponds to the partial limb of body [“head”, “arm”, “hand”, “hip”, “leg”, “foot”]. The detailed content presentation of

T_{g}

,

T_{l}

,

P_{g l o b a l}

and

P_{l o c a l}

are shown in Figure 3; the action category involved is exemplified by “

o p e n i n g a b o t t l e

”.

The generated global and local detailed textual descriptions

T_{g}

and

T_{l}

are then encoded by

f_{g} = {LE}_{g} (T_{g}), f_{l} = {LE}_{l} (T_{l})

(7)

where

{LE}_{g} (\cdot)

and

{LE}_{l} (\cdot)

are frozen pre-trained language encoders that share parameters for each other, and the global and local textual features are

f_{g}

and

f_{l}

, respectively.

Considering the similarities or ambiguities in the visual semantics between different actions, we introduce a partial semantic consistency strategy that utilizes the generated fine-grained local limb description as supervisory signals to guide the model in learning more discriminative representations of the partial limb motions:

L_{c t s} = \frac{1}{K} \sum_{i = 1}^{K} K L (S_{l}^{i}, f_{l}^{i}, y_{l}^{i})

(8)

where

L_{c t s}

represents the partial semantic consistency constraint,

K L (\cdot)

is a standard KL contrast loss,

S_{l}^{i}

and

f_{l}^{i}

denote the i-th partial skeleton and textual features, respectively, and

y_{l}^{i}

is the corresponding label for the KL function. We employ the KL divergence to align the cross-modal alignment for the partial limb motion.

3.4. Total Objective

In the CAIM and the PSCC modules, we discuss and explore the skeleton and language representation of the partial limb motions in detail. In addition, we also introduce the global language description to improve the comprehensiveness by

{\hat{f}}_{g} = C r o s s A t t n (S_{g} W_{q}, f_{g} W_{k}, f_{g} W_{v}, δ)

(9)

where

C r o s s A t t n (\cdot)

indicates the cross-attention,

W_{q}

,

W_{k}

and

W_{v}

are the projection weights for the query, key, and value inputs, respectively,

S_{g}

is defined as the query input,

T_{g}

is denoted as the key and value inputs, respectively, and the

δ

means the number of layers for the

C r o s s A t t n

.

Subsequently, the final representation is obtained by

f_{g l} = F u s ({\hat{f}}_{g}, {\tilde{f}}_{l})

(10)

where

F u s (\cdot)

is an aggregation function to fusion the global and local feature, which can be implemented by a single MLP. The output

f_{g l}

of (10) is then used to calculate the classification objective,

L_{c l s} = C E L (f_{g l}, y)

(11)

where

C E L (\cdot)

is a standard cross-entropy loss and y is corresponding action labels. Therefore, the final optimization objective of this work

L_{t o t a l}

is the combination of

L_{c l s}

and

L_{c t s}

(obtained in (8)),

L_{t o t a l} = \frac{1}{2} (L_{c t s} + L_{c l s})

(12)

4. Experiments

In this section, extensive comparative experiments are conducted to demonstrate the effectiveness of our proposed method. The evaluation begins with a detailed description of the datasets utilized in our study. Following this, we outline the experimental setup. Subsequently, we conduct ablation studies using the NTU RGB+D skeleton data to determine the individual contributions of each component of our method. The final phase of our evaluation involves a comparison of the proposed method with existing state-of-the-art approaches, utilizing both NTU RGB+D 60 and NTU RGB+D 120 skeleton data sets.

4.1. Datasets

NTU RGB+D 60. The NTU RGB+D 60 dataset [65], a comprehensive resource for 3D human activity analysis, was developed and released by researchers at Nanyang Technological University, Singapore. This large-scale dataset comprises a diverse array of data types, including RGB, depth, infrared, and skeleton data. It encompasses 56,880 samples, representing a wide range of 60 human activity categories. The extensive size and varied nature of this dataset facilitate rigorous cross-subject (X-sub) and cross-view (X-View) evaluations, X-sub divides the dataset according to the person ID. The training set and the test set contain 20 subsets, respectively. X-View divides the dataset according to camera ID, substantially contributing to advancements in the field of 3D human activity analysis.

NTU RGB+D 120. The NTU RGB+D 120 dataset [66] represents an extension of the NTU RGB+D 60 dataset, encompassing all the data from the NTU RGB+D 60 and incorporating an additional 60 categories. This expansion results in a comprehensive collection of 120 categories, with a total of 57,600 newly added video samples, bringing the aggregate number of samples in the dataset to 114,480. It features high-resolution RGB videos at 1920 × 1080 pixels, while the depth maps and IR videos are captured at a resolution of 512 × 424. The 3D skeleton data includes the coordinates of 25 body joints per frame. For experimental assessment, the dataset offers two benchmarks: (1) cross-subject (X-sub) and (2) cross-setup (X-Set), catering to a wide range of research needs in the field. For X-Sub, the 106 subjects are split into training and testing groups. Each group consists of 53 subjects. The X-Set takes samples with even collection setup IDs as the training set and samples with odd setup IDs as the test set.

4.2. Experimental Setup

We follow the data processing procedure of [34] for NTU RGB+D 60 and NTU RGB+D 120. The skeleton encoder uses STTformer as the backbone network to extract the skeletal features and utilizes the Stochastic Gradient Descent (SGD) optimizer with a momentum of

0.9

, a standard cross-entropy as the classification loss, weight decay of

0.0004

, and batch size of 110. The learning rate is set to

0.1

initially and reduced by a factor of 10 at 60 and 80 epochs. For the text encoder, we load the pre-training weights of the text encoder to perform the inference process on the text descriptions (without training), and encode the text features. The temperature for contrastive loss is set to

0.1

. Additionally, a warm-up strategy is applied during the first five epochs. We use PyTorch and all experiments are conducted on 2×Titan RTX 3090 GPUs. For a fair comparison, all settings are the same, except for the exploration subjects.

4.3. Ablation Study

In this section, We investigate the effectiveness of the proposed method through several experiments on the bone mode of the NTU-RGB+D 60 skeleton dataset.

Ablation study for Cyclic Attention Interaction Module (CAIM). To validate the potential synergy of limb motion, we design the CAIM module and perform ablation validation, and the results are recorded in Table 1. The notation “partial features (mean)” indicates that the global skeleton features are decoupled (to obtain head, hands, arms, hips, legs, feet) and then directly fusion with an average pooling layer, aggregating each partial limb feature in the temporal (T) and the joint (V) dimensions. The experimental results validate the effectiveness of our proposed module for CAIM. In contrast, the direct fusion of multiple partial limb features (mean) has limited performance improvement. Using CAIM to mine the synergy of each limb’s motion with other nodes has a positive impact on the action recognition of skeleton sequence.

Ablation study for Partial Semantic Consistency Constraints (PSCC). In order to verify the consistency constraints effect of local language descriptions on limb motion and the enhancement of global descriptions for the global skeleton representation, several ablation experiments are conducted. Firstly, the outcomes of the experiments utilizing partial and global descriptions, respectively, are present Table 2. The recognition of skeleton models without accompanying description information yielded the lowest accuracy. Following the introduction of partial descriptions, we observe a significant performance improvement, indicating that more detailed descriptive information about partial motion can effectively guide the model to learn more discriminative skeleton representations. Furthermore, the utilization of global descriptions also enhances the recognition performance. Notably, the optimal result is achieved by combining partial descriptions and global descriptions.

Furthermore, we assess the validity of different partial descriptions for the prediction, as shown in Table 3. The results obtained using a single local description are marginally higher than the baseline. The highest gain is achieved by using all six local partial descriptions corresponding to limb motions.

Finally, we demonstrate the ablations of distinct text encoders and record the results, as illustrated in Table 4. A comparison is conducted between four text encoders: BERT [59], DistilBERT [67], RoBERTa [68] and CLIP [63]. The results indicated that RoBERTa exhibited the most optimal performance. Given its commendable balance between efficiency and accuracy, RoBERTa was selected as the text encoder for this study.

Ablation studies for different modules. We perform distinct ablation studies on separate sub-components of the Cyclic Attention Interaction Module (CAIM) and Partial Semantic Consistency Constraints (PSCC) in prior experiments as a complement; this part provides ablation confirmation of the overall framework, as shown in Table 5. The integration of the CAIM module into the baseline model has been found to enhance its performance, indicating that cyclic attention interaction improves the model’s effectiveness. This improvement can be attributed to the CAIM module is capacity to effectively explore the implicit semantic relationships between different limb motions, thereby fully leveraging the synergistic potential of local limb motions within the global action context. Furthermore, the PSCC module improves performance by capitalizing on linguistic supervision and domain-specific knowledge of global action and local limb motions. This enables the model to learn more discriminative representations of skeleton action. The complete LPSR approach achieves optimal performance across both X-Sub and X-View. While each component of LPSR contributes differently to the overall performance, their combined effect significantly enhances the model’s accuracy when processing skeleton data.

Visualization results. In order to illustrate the efficacy of our methodology in a more visually compelling manner, we selected 20 action categories each from NTU60 and NTU120 to compare the baseline and our method using confusion matrices, as illustrated in Figure 4. In NTU60, actions such as “reading”,“taking off a shoe”,“playing with a phone”, and “typing on a keyboard” exhibited poorer classification performance. Our method significantly outperforms the baseline for these actions due to the text branch, which generates descriptions for different body parts involved in these actions. However, the performance of our proposed method for recognizing actions such as “tear up paper”, “phone call”, and “cutting paper” is degraded, probably due to the difficulty in recognizing objects in the skeleton. The generated text descriptions are mainly related to objects and local limbs, e.g., “paper cutting” and “paper tearing” both involve paper and hand descriptions, but due to the fine-grained nature of the skeletal data (small discriminative differences between actions), the final prediction results may be be guided by the text to favor incorrect categories with the same objects or localized limbs. Overall, our language-assisted action recognition method shows marked improvement.

4.4. Comparison with the State-of-the-Art Methods

We compared the performance of the LPSR method we developed with the current state-of-the-art methods on two datasets, NTU RGB+D 60 and NTU RGB+D 120. The results of the comparison of recognition accuracy are shown in Table 6. In our study, four different data integration strategies were used: bone, bone motion, joint, and joint motion. Meanwhile, we compared other state-of-the-art methods, including those based on LSTM, GCN and Transformer.

In comparing LSTM-based approaches, it is evident that our proposed LPSR framework shows a marked improvement over traditional LSTM-based models when applied to the dataset in question. The core limitation of LSTM-based methods lies in their struggle to effectively capture the spatial relationships between joints and bodily segments. On the other hand, GCN-based methods adeptly leverage the spatio-temporal characteristics of skeleton data, leading to superior recognition capabilities. When juxtaposed with a specific GCN-based approach, our LPSR methodology demonstrates distinct advantages, primarily due to the employment of linguistic supervision that steers the recognition of behavior. This supervision harnesses actionable insights from the interplay of movements and body parts, enriching the model’s representational power. Moreover, LPSR sets a new benchmark against Transformer-based counterparts. Ultimately, the consistent outperformance of LPSR across varied datasets underscores its efficacy and robustness as a state-of-the-art method in behavior recognition.

5. Conclusions

This study proposes a novel Linguistic-Driven Partial Semantic Relevance Learning framework (LPSR) for skeleton-based action recognition, which contains two major sub-modules: Cyclic Attention Interaction Module (CAIM) and Partial Semantic Consistency Constraints (PSCC). In comparison to previous methods, we introduce a more comprehensive multi-modal large-scale language model to generate more detailed linguistic descriptions of global actions and partial limb motions. Further, in PSCC, we generate multiple local body descriptions to guide the model to learn finer-grained representations of skeleton body motions. In addition, considering the semantic synergy between partial body motions, we propose the CAIM module to model the implicit relations between them. Extensive ablation experiments demonstrate the efficacy of the method present this paper, achieving comparable performance to the current state-of-the-art methods.

One limitation of our current approach to skeletal action recognition is its reliance on fully supervised conditions, which constrains its applicability in real-world scenarios where annotated data may be scarce. Future research will explore recognizing skeletal behaviors under weakly supervised or unsupervised conditions to broaden the practical utility of our methods. Another limitation is the small difference between the training and test set distributions in our skeletal action recognition task, which hampers the model’s performance when generalizing to new, unseen action classes. Consequently, enhancing the classification performance and generalization capabilities of our model in zero-shot skeletal behavior recognition will be a primary focus of our future work.

Author Contributions

Conceptualization, Q.C. and P.H.; methodology, Q.C. and P.H.; software, Q.C.; validation, Q.C., P.H. and J.H.; formal analysis, Q.C. and P.H.; data curation, Q.C.; writing—original draft preparation, Q.C.; writing—review and editing, P.H. and Y.L.; supervision, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 62372240.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, P.; Yan, R.; Shu, X.; Tu, Z.; Dai, G.; Tang, J. Semantic-Disentangled Transformer with Noun-Verb Embedding for Compositional Action Recognition. IEEE Trans. Image Process. 2023, 33, 297–309. [Google Scholar] [CrossRef]
Huang, P.; Shu, X.; Yan, R.; Tu, Z.; Tang, J. Appearance-Agnostic Representation Learning for Compositional Action Recognition. IEEE Trans. Circuits Syst. Video Technol. 2024. [Google Scholar] [CrossRef]
Zhou, B.; Wang, P.; Wan, J.; Liang, Y.; Wang, F. A unified multimodal de-and re-coupling framework for rgb-d motion recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11428–11442. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Liu, Z.; Wu, Y.; Yuan, J. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 June 2012; pp. 1290–1297. [Google Scholar]
Hussein, M.E.; Torki, M.; Gowayyed, M.A.; El-Saban, M. Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Beijing, China, 3–9 August 2013. [Google Scholar]
Vemulapalli, R.; Arrate, F.; Chellappa, R. Human action recognition by representing 3d skeletons as points in a lie group. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 588–595. [Google Scholar]
Vahdat, A.; Gao, B.; Ranjbar, M.; Mori, G. A discriminative key pose sequence model for recognizing human interactions. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 1729–1736. [Google Scholar]
Aggarwal, J.K.; Ryoo, M.S. Human activity analysis: A review. ACM Comput. Surv. 2011, 43, 1–43. [Google Scholar] [CrossRef]
Pang, Y.; Ke, Q.; Rahmani, H.; Bailey, J.; Liu, J. Igformer: Interaction graph transformer for skeleton-based human interaction recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 605–622. [Google Scholar]
Banerjee, B.; Baruah, M. Attention-Based Variational Autoencoder Models for Human–Human Interaction Recognition via Generation. Sensors 2024, 24, 3922. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Li, Y.; Nair, R.; Naqvi, S.M. Skeleton-based action analysis for ADHD diagnosis. arXiv 2023, arXiv:2304.09751. [Google Scholar]
Tang, Y.; Li, X.; Chen, Y.; Zhong, Y.; Jiang, A.; Wang, C. High-accuracy classification of attention deficit hyperactivity disorder with l 2,1-norm linear discriminant analysis and binary hypothesis testing. IEEE Access 2020, 8, 56228–56237. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, X.; Chang, M.C.; Ge, W.; Chen, T. Spatio-temporal phrases for activity recognition. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2012; pp. 707–721. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Chen, Z.; Huang, W.; Liu, H.; Wang, Z.; Wen, Y.; Wang, S. ST-TGR: Spatio-Temporal Representation Learning for Skeleton-Based Teaching Gesture Recognition. Sensors 2024, 24, 2589. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z. Microsoft kinect sensor and its effect. IEEE Multimed. 2012, 19, 4–10. [Google Scholar] [CrossRef]
Li, B.; Dai, Y.; Cheng, X.; Chen, H.; Lin, Y.; He, M. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN. In Proceedings of the IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 10–14 July 2017; pp. 601–604. [Google Scholar]
Qiu, H.; Hou, B.; Ren, B.; Zhang, X. Spatio-temporal tuples transformer for skeleton-based action recognition. arXiv 2022, arXiv:2201.02849. [Google Scholar]
Plizzari, C.; Cannici, M.; Matteucci, M. Skeleton-based action recognition via spatial and temporal transformer networks. Comput. Vis. Image Underst. 2021, 208, 103219. [Google Scholar] [CrossRef]
Xiang, W.; Li, C.; Zhou, Y.; Wang, B.; Zhang, L. Generative action description prompts for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 10276–10285. [Google Scholar]
Zhu, W.; Lan, C.; Xing, J.; Zeng, W.; Li, Y.; Shen, L.; Xie, X. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal lstm with trust gates for 3d human action recognition. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 816–833. [Google Scholar]
Du, Y.; Wang, W.; Wang, L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1110–1118. [Google Scholar]
Ding, Z.; Wang, P.; Ogunbona, P.O.; Li, W. Investigation of different skeleton features for cnn-based 3d action recognition. In Proceedings of the IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 10–14 July 2017; pp. 617–622. [Google Scholar]
Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Zhang, D.; Vien, N.A.; Van, M.; McLoone, S. Non-local graph convolutional network for joint activity recognition and motion prediction. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 2970–2977. [Google Scholar]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Non-Local Graph Convolutional Networks for Skeleton-Based Action Recognition. arXiv 2018, arXiv:1805.07694. [Google Scholar]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. In Proceedings of the Asian Conference on Computer Vision (ACCV), Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
Perez, M.; Liu, J.; Kot, A.C. Interaction relational network for mutual action recognition. IEEE Trans. Multimedia. 2021, 24, 366–376. [Google Scholar] [CrossRef]
Liu, J.; Shahroudy, A.; Xu, D.; Kot, A.C.; Wang, G. Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 3007–3021. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Cheng, J.; Wang, L.; Xia, H.; Liu, F.; Tao, D. Ensemble one-dimensional convolution neural networks for skeleton-based action recognition. IEEE Signal Process. Lett. 2018, 25, 1044–1048. [Google Scholar] [CrossRef]
Zhu, Y.; Xu, Y.; Yu, F.; Liu, Q.; Wu, S.; Wang, L. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference, Ljubljana, Slovenia, 19–23 April 2021; pp. 2069–2080. [Google Scholar]
Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 13359–13368. [Google Scholar]
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3595–3603. [Google Scholar]
Chi, H.g.; Ha, M.H.; Chi, S.; Lee, S.W.; Huang, Q.; Ramani, K. Infogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 20186–20196. [Google Scholar]
Wang, Y.; Wu, Y.; Tang, S.; He, W.; Guo, X.; Zhu, F.; Bai, L.; Zhao, R.; Wu, J.; He, T.; et al. Hulk: A Universal Knowledge Translator for Human-Centric Tasks. arXiv 2023, arXiv:2312.01697. [Google Scholar]
Duan, H.; Zhao, Y.; Chen, K.; Lin, D.; Dai, B. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2969–2978. [Google Scholar]
Bruce, X.; Liu, Y.; Zhang, X.; Zhong, S.h.; Chan, K.C. Mmnet: A model-based multimodal network for human action recognition in rgb-d videos. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3522–3538. [Google Scholar]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7912–7921. [Google Scholar]
Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1227–1236. [Google Scholar]
Tian, Y.; Liang, Y.; Yang, H.; Chen, J. Multi-Stream Fusion Network for Skeleton-Based Construction Worker Action Recognition. Sensors 2023, 23, 9350. [Google Scholar] [CrossRef] [PubMed]
Wang, P.; Li, Z.; Hou, Y.; Li, W. Action recognition based on joint trajectory maps using convolutional neural networks. In Proceedings of the ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 102–106. [Google Scholar]
Li, C.; Zhong, Q.; Xie, D.; Pu, S. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv 2018, arXiv:1804.06055. [Google Scholar]
Liu, J.; Wang, G.; Duan, L.Y.; Abdiyeva, K.; Kot, A.C. Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans. Image Process. 2017, 27, 1586–1599. [Google Scholar] [CrossRef]
Wei, S.; Song, Y.; Zhang, Y. Human skeleton tree recurrent neural network with joint relative motion feature for skeleton based action recognition. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 91–95. [Google Scholar]
Si, C.; Jing, Y.; Wang, W.; Wang, L.; Tan, T. Skeleton-based action recognition with spatial reasoning and temporal stack learning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 103–118. [Google Scholar]
Yang, Z.; An, G.; Zhang, R.; Zheng, Z.; Ruan, Q. SRI3D: Two-stream inflated 3D ConvNet based on sparse regularization for action recognition. IET Image Process. 2023, 17, 1438–1448. [Google Scholar] [CrossRef]
Xu, K.; Ye, F.; Zhong, Q.; Xie, D. Topology-aware convolutional neural network for efficient skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 28 February–1 March 2022; Volume 36, pp. 2866–2874. [Google Scholar]
Xi, W.; Devineau, G.; Moutarde, F.; Yang, J. Generative model for skeletal human movements based on conditional DC-GAN applied to pseudo-images. Algorithms 2020, 13, 319. [Google Scholar] [CrossRef]
Zhang, P.; Lan, C.; Zeng, W.; Xing, J.; Xue, J.; Zheng, N. Semantics-guided neural networks for efficient skeleton-based human action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1112–1121. [Google Scholar]
Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; Volume 2, p. 4. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Zhang, Y.; Zhu, H.; Song, Z.; Koniusz, P.; King, I. COSTA: Covariance-preserving feature augmentation for graph contrastive learning. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 2524–2534. [Google Scholar]
Zhang, Y.; Zhu, H.; Song, Z.; Koniusz, P.; King, I. Spectral feature augmentation for graph contrastive learning and beyond. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 11289–11297. [Google Scholar]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 22–31. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zheng, M.; Gao, P.; Zhang, R.; Li, K.; Wang, X.; Li, H.; Dong, H. End-to-end object detection with adaptive clustering transformer. arXiv 2020, arXiv:2011.09315. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Wang, M.; Xing, J.; Liu, Y. Actionclip: A new paradigm for video action recognition. arXiv 2021, arXiv:2109.08472. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intelligence 2019, 42, 2684–2701. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]

Figure 1. Idea of this work. Most previous methods (as shown above left) employ a single encoder to extract global features, or (as shown above middle) introduce the text information to conduct extra contrast loss. Nevertheless, there are instances where the visual semantic similarities or ambiguities between different actions make it challenging to distinguish between them. In contrast, as shown above right, we generate local feature descriptions of actions to learn finer-grained representations of skeleton limb motion. Meanwhile, the cyclic attention interaction module is proposed to mine the implicit association between partial limb motions.

Figure 2. Overview of this approach. The novel framework is composed of two components: the Cyclic Attention Interaction Module (CAIM) and the Partial Semantic Consistency Constraints (PSCC). For a given raw skeleton input

X_{org}

, we design a cyclic attention strategy to mine the potential relationships between partial limb motions in CAIM and output

{\tilde{f}}_{l}

. In PSCC, we use the text labels

T

as well as

X_{org}

as inputs to generate global and local descriptions

T_{g}

and

T_{l}

; they are then encoded into common space to guide the learning of global action and local skeleton motions features using cross-modal aggregation and KL scatter alignment, respectively.

Figure 2. Overview of this approach. The novel framework is composed of two components: the Cyclic Attention Interaction Module (CAIM) and the Partial Semantic Consistency Constraints (PSCC). For a given raw skeleton input

X_{org}

, we design a cyclic attention strategy to mine the potential relationships between partial limb motions in CAIM and output

{\tilde{f}}_{l}

. In PSCC, we use the text labels

T

as well as

X_{org}

as inputs to generate global and local descriptions

T_{g}

and

T_{l}

; they are then encoded into common space to guide the learning of global action and local skeleton motions features using cross-modal aggregation and KL scatter alignment, respectively.

Figure 3. Textual action descriptions generated from two prompt input by GPT-4.

Figure 4. Confusion matrices for unimodal baseline and our methods.

Table 1. Effect of CAIM evaluated on NTU RGB+D 60 Skeleton dataset in the bone mode. We record the recognition accuracy (%) for different settings X-Sub and X-View. Best results are in bold.

Methods	Accuracy (%)
Methods	X-Sub	X-View
Baseline	88.8	93.3
Baseline + partial features (mean)	88.9	93.3
Baseline + CAIM	89.4	93.7

Table 2. Influences of textual description types on the NTU RGB+D 60 Skeleton dataset in bone mode. We record the recognition accuracy (%) for different settings X-Sub and X-View. Best results are in bold.

Description Type	Accuracy (%)
Description Type	X-Sub	X-View
Baseline	88.8	93.3
+ Partial Description	89.5	93.9
+ Global Description	89.1	93.5
+ Partial + Global Description	89.6	94.1

Table 3. Comparison of different body parts description on NTU RGB+D 60 Skeleton dataset in bone mode. We record the recognition accuracy (%) for different settings X-Sub and X-View. Best results are in bold.

Method	Text Partial Description						Accuracy (%)
Method	Head	Hand	Arm	Hip	Leg	Foot	X-Sub	X-View
Baseline	✗	✗	✗	✗	✗	✗	88.8	93.3
	✓	✗	✗	✗	✗	✗	89.0	93.3
	✗	✓	✗	✗	✗	✗	88.9	93.5
	✗	✗	✓	✗	✗	✗	89.0	93.3
	✗	✗	✗	✓	✗	✗	89.1	93.4
	✗	✗	✗	✗	✓	✗	88.9	93.3
	✗	✗	✗	✗	✗	✓	88.8	93.3
	✓	✓	✓	✓	✓	✓	89.5	93.9

Table 4. Effect of text encoders evaluated on NTU RGB+D 60 Skeleton dataset in bone mode. We record the recognition accuracy (%) for different settings X-Sub and X-View. Best results are in bold.

Methods	Accuracy (%)
Methods	X-Sub	X-View
Baseline	88.8	93.3
BERT	89.1	93.8
DistilBERT	89.2	93.9
CLIP	89.5	94.0
RoBERTa	89.6	94.1

Table 5. Ablation studies for different modules on the NTU RGB+D 60 Skeleton dataset in bone mode. We record the recognition accuracy (%) for different settings X-Sub and X-View. Best results are in bold.

Methods	Accuracy (%)
Methods	X-Sub	X-View
Baseline	88.8	93.3
+ CAIM	89.3	93.7
+ PSCC	89.5	93.9
+ LPSR (CAIM + PSCC)	89.8	94.2

Table 6. Comparison of recognition accuracy with state-of-the-art methods on NTU RGB+D 60 and NTU RGB+D 120 Skeleton dataset. We record the NTU RGB+D 60 recognition accuracy (%) for different settings of X-Sub and X-View, and NTU RGB+D 120 recognition accuracy (%) for different settings of X-Sub and X-Set, respectively. Best results are in bold.

Type	Methods	NTU RGB+D 60		NTU RGB+D 120
Type	Methods	X-Sub (%)	X-View (%)	X-Sub (%)	X-Set (%)
LSTM	ST-LSTM	83.0	87.3	63.0	66.6
	GCA	85.9	89.0	70.6	73.7
	2s-GCA	87.2	89.9	73.0	73.3
	LSTM-IRN	90.5	93.5	77.7	79.6
GCN	ST-GCN	81.5	88.3	78.9	76.1
	AS-GCN	89.3	93.0	82.9	83.7
	2s-AGCN	88.5	95.1	82.9	84.9
	MS-G3D	91.5	96.2	86.9	88.4
	CTR-GCN	92.4	96.8	88.9	90.6
	InfoGCN	92.7	96.9	89.4	90.7
Transformer	DSTA-Net	91.5	96.4	86.6	89.0
	IGFormer	93.6	96.5	85.4	86.5
	STTFormer	92.3	96.5	88.3	89.2
	LPSR	92.8	96.9	89.2	90.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Q.; Liu, Y.; Huang, P.; Huang, J. Linguistic-Driven Partial Semantic Relevance Learning for Skeleton-Based Action Recognition. Sensors 2024, 24, 4860. https://doi.org/10.3390/s24154860

AMA Style

Chen Q, Liu Y, Huang P, Huang J. Linguistic-Driven Partial Semantic Relevance Learning for Skeleton-Based Action Recognition. Sensors. 2024; 24(15):4860. https://doi.org/10.3390/s24154860

Chicago/Turabian Style

Chen, Qixiu, Yingan Liu, Peng Huang, and Jiani Huang. 2024. "Linguistic-Driven Partial Semantic Relevance Learning for Skeleton-Based Action Recognition" Sensors 24, no. 15: 4860. https://doi.org/10.3390/s24154860

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Linguistic-Driven Partial Semantic Relevance Learning for Skeleton-Based Action Recognition

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Overview

3.2. Cyclic Attention Interaction Module

3.3. Partial Semantic Consistency Constraints

3.4. Total Objective

4. Experiments

4.1. Datasets

4.2. Experimental Setup

4.3. Ablation Study

4.4. Comparison with the State-of-the-Art Methods

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI