Next Article in Journal
Non-Destructive Early Detection of Drosophila Suzukii Infestation in Sweet Cherries (c.v. Sweet Heart) Based on Innovative Management of Spectrophotometric Multilinear Correlation Models
Previous Article in Journal
Spatiotemporal Influence Analysis Through Traffic Speed Pattern Analysis Using Spatial Classification
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Auxiliary Task Graph Convolution Network: A Skeleton-Based Action Recognition for Practical Use

Department of AI Convergence, Chonnam National University, Gwangju 61186, Republic of Korea
SafeMotion, Gwangju 61011, Republic of Korea
Authors to whom correspondence should be addressed.
Appl. Sci. 2025, 15(1), 198;
Submission received: 3 December 2024 / Revised: 26 December 2024 / Accepted: 27 December 2024 / Published: 29 December 2024


Graph convolution networks (GCNs) have been extensively researched for action recognition by estimating human skeletons from video clips. However, their image sampling methods are not practical because they require video-length information for sampling images. In this study, we propose an Auxiliary Task Graph Convolution Network (AT-GCN) with low and high-frame pathways while supporting a new sampling method. AT-GCN learns actions at a defined frame rate in the defined range with three losses: fuse, slow, and fast losses. AT-GCN handles the slow and fast losses in two auxiliary tasks, while the mainstream handles the fuse loss. AT-GCN outperforms the original State-of-the-Art model on the NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets while maintaining the same inference time. AT-GCN shows the best performance on the NTU RGB+D dataset at 90.3% from subjects, 95.2 from view benchmarks, on the NTU RGB+D 120 dataset at 86.5% from subjects, 87.6% from set benchmarks, and at 93.5% on the NW-UCLA dataset as top-1 accuracy.

1. Introduction

Always-on cameras such as CCTV are widely used, and their purpose is mostly safety management. Recently, researchers have applied action recognition technology to the CCTV system, such that it can automatically recognize roamers’ behavior and achieve better safety management [1,2]. Additionally, action recognition could be used to understand people in the video, such as behavior in a class [3].
Graph convolution networks (GCNs) [4,5,6,7,8] are typical networks that recognize human action with information based on the human skeleton. This way of recognition has the benefit of removing noise from the image background, reducing the number of troubles and inference time. With this benefit, the GCN model shows accurate action recognition. However, the previous GCNs [9,10,11,12] are not practical because of the sampling frame (image) process. The previous GCNs require the information of the video length prior to the sampling frame for recognition. The video from an always-on camera (i.e., CCTV) that action recognition with GCNs could be applied to does not have a video length; hence, it would be difficult to simply apply previous GCNs to it.
A simple solution to this sampling issue is clipping a CCTV video to have a fixed length (e.g., clipping a video to have a four-second length). This simple solution, however, is not compatible with the previous GCNs. Most GCNs’ datasets include videos of different lengths for training and testing; hence, simply clipping videos to a certain length may not include all the movements for human actions. Additionally, previous GCNs may not be effective in recognizing real-world human actions because real-world actions have different pose sequences compared to the actions in the dataset. For example, human actions in the real world could have diverse speeds of movement, but the actions in the dataset are artificial and generally have a certain level of movement speed.
Therefore, we propose an Auxiliary Task Graph Convolution Network (AT-GCN) structure for the practical use of action recognition. The AT-GCN structure implemented low and high pathways introduced in SlowFast Network [13] to parallelly sample the frame at low and high rates for slow and fast action recognition, respectively. This means that our AT-GCN structure could recognize diverse speeds of movement by parallelly capturing semantic information at low and high frame rates (Figure 1).
There is a major difference between our AT-GCN and previous GCNs. Previous GCNs used only one frame rate for action recognition, but our AT-GCN uses two frame rates. Using one frame rate may not be suitable for recognizing a variety of actions, including slow and fast actions. To solve this problem, AT-GCN uses two frame rates for action recognition. Our AT-GCN is also different from SlowFast [13] as it has three losses, while SlowFast has only one fuse loss. Using one loss may not be suitable for learning a variety of features, such as different frame rate features. To solve this problem, AT-GCN uses two additional losses: low and high frame rate losses, in addition to the fuse loss. This structure is effective for learning more variety of features compared to using only one loss. Our AT-GCN includes three end branches to support three loss functions: a fuse loss function in the main task and a slow and a fast loss function in the auxiliary task [14,15,16].
In this paper, we compare two variations of GCNs—traditional GCNs and AT-GCNs —with regard to the NTU RGB+D 60 and 120 and NW-UCLA [17,18,19] datasets. The experimental results show that AT-GCNs did not increase the inference time with the learning process of slow and fast actions. Moreover, AT-GCNs outperform traditional GCNs on the three datasets.
Our contributions are summarized below:
  • We propose an AT-GCN, which can be applied to existing GCNs to add low and high pathways in addition to the main branch with added loss in the auxiliary task.
  • Our AT-GCN structure supports practical use for action recognition as the AT-GCN is designed to work with a fixed length of clipped video from a CCTV without performance degradation.
  • We propose a T ensemble for the AT-GCN structure to enhance the performance with a static sample range (a fixed-length video). Trained models show specialized performance.
  • In the benchmarks of the datasets, AT-GCN outperformed traditional GCNs with the same inference time.
The remainder of this paper is organized as follows. In Section 2, we review the related multi-stream model and graph convolution network of action recognition. The proposal for our architecture is shown in Section 3. The experimental process and results are given in Section 4. Finally, Section 5 presents the conclusions and future work.

2. Related Works

2.1. Learning Feature with Multi-Stream

The video is about sequences of frames. Each frame is not the same as others; hence, these frames have different information. But even when we go through the frames, there is no changed information. For example, if a human has the action “running” in the video, each frame has a different pose, but the foot always has the same identity. This shows us there is no need for frames to extract hand identity information. On the other hand, if we want to extract temporal information such as the action “running”, we need more frames for action recognition because of different pose sequences. SlowFast [13] applies this point to the model. They proposed a multi-stream architecture [20,21,22] using a slow and a fast pathway. The slow pathway is part of the same identity; hence, this part has a low frame rate and high channel to extract identity information more easily. The fast pathway is part of the temporal; hence, this part has a high frame rate and low channel to extract temporal information more lightly. Auxiliary task models [14,15,16] also have a multi-stream architecture, but they propose a sub-feature to support the learning process of the main feature. This model is divided into the main task and auxiliary task parts. The main task has main information input, such as text, for understanding. The auxiliary task has sub-information input, such as time, to support the text understanding of the main task. Since we decide to use the original action recognition model as a feature extractor part, we only adopt the frame rate rule from SlowFast. SlowFast used only one loss from the fused feature, but one loss structure may not fit for training a variety of features. To solve this problem, AT-GCN uses two more losses from each pathway. Added loss has more effect on variety feature extraction compared to using one loss, as in the SlowFast. Our model consists of two pathways: a low pathway and a high pathway. The low pathway is part of the low frame rate, and the high pathway is part of the high frame rate. Each pathway will extract each feature, and we will use this feature to make the added loss as the auxiliary task.

2.2. Action Recognition with Skeleton

A human skeleton contains information that is important for recognizing human actions. In the case of action recognition models using images [23,24,25,26,27], data unrelated to action recognition (e.g., objects, background) may negatively affect the inference time and accuracy. ST-GCN [4] proposes a Graph Convolution Network (GCN) using skeletons to solve the above problems. Human behaviors have a relationship with skeleton movements. When we walk, the knees move automatically by our foot position. To use this point, the ST-GCN has a graph consisting of skeletons. Each skeleton is assigned a position number, such as “left foot” to number 13. ST-GCN connects skeletons by this position number. Hence, this connected graph of the skeleton supports action recognition. Previous GCNs [28,29,30,31] proposed a variety of methods using graph convolution structured by the relationship with skeletons. However, using only one frame rate of clips may not be effective for training action recognition. To solve this problem, AT-GCN uses two frame rates, such that a greater variety of features can be trained than previous GCNs. Also, infoGCN and HD-GCN [32,33] propose ensemble methods using connections between skeletons. Unlike the existing GCNs, AT-GCN uses the two pathways of using sample rate rather than the relation between graph and skeleton. We also propose a T ensemble that combines specialized models of short and long actions.

3. Methods

In order, we describe training the model to match static sample range methods in Section 3.1, learning multiple features using an auxiliary task in Section 3.2, ensembles with multiple trained models with different sampling ranges in Section 3.3, and a new sampling method for slow and fast actions in Section 3.4. All experiments are processed by NTU RGB+D 60, 120, and NW-UCLA datasets [17,18,19]. This dataset is defined as an action by a video. All videos have skeleton labels; hence, we use these data as input for the model. A more specific explanation is shown in Section 4.2.

3.1. Static Sampling from Datasets

Existing training methods can cause performance degradation when the sampling of the testing process is in a static range. Our new training method aims to provide similar performance when applied to the test of static sample range.
  • Static Sampling: Previous studies trained their architecture with a video sample that includes all the sequences of an action. However, practically, it is not guaranteed that a video sample from a CCTV includes all the sequences of an action (e.g., a video sample includes only the first half of a sitting action). Therefore, to consider the practical use of architecture with CCTV videos, we suggest a static sampling method that randomly samples a video clip (that is not guaranteed to include all the sequences of action) from an existing video (that includes all the sequences of action) in the dataset. Figure 2 shows an example of static sampling. The static sampling method determines a static range and samples the video frames from the existing video dataset, but it is not guaranteed that the sampled clip includes all the sequences of action. The location of the static range is randomized. Our training method only samples randomly placed static ranges; hence, every training has only cropped actions from video. This sampling method adopted uncontrolled data and training architecture, preventing performance degradation when using an uncontrolled dataset. If the static range is larger than the video length, a dummy pose is used to create new clips equal to the static range.
  • Dummy poses: Figure 3 shows the use of dummy poses. To create a static range from a video, the first and last frames of the existing video are used as dummy poses. The first frame of the existing video extends to the first frame of the new video, and the end frame of the existing video extends to the end frame of the new video. Also, the old video is centered within the new video. The above process has the same effect as a person standing still starting an action or a person ending an action standing still.

3.2. Multi-Stream with Pathway

We propose an AT-GCN model using a low pathway and a high pathway with SlowFast [13] sample rate and Auxiliary Task [14,15,16]. The overall framework can be seen in Figure 4. AT-GCN aims to improve performance without increasing the inference time of existing models.
  • 2 Streaming: The 2 Streaming structure consists of the low pathway and high pathway by different numbers of samples. The low and high pathway sample rates are defined using the ratio α = 8 from the previously proposed parameter [13]. t frames are applied to the Low pathway, and tα frames to the High pathway. The 2 Streaming model uses a Fuse feature that combines slow and fast features, and the process is as follows:
    F u s e = P o o l ( G M t ) + P o o l ( G M t α )
M means full frame from the clip. Low pathway sample M t and High pathway sample M t α are passed through the GCN network G ( M ) to generate slow and fast features. Each feature is passed to the Header, which pools the global average and combines them to create the Fuse feature. As shown in Figure 5, the Header does not use any additional modules beyond the layer that fuses the features. This prevents the increasing inference time.
  • Auxiliary Task: utilizes an auxiliary task to use the added loss, such as the slow and fast loss. For the 2 Streaming (2S) model, only the score of the fuse feature is generated. AT-GCN generated two more scores for only the low pathway and high pathway that were used. Therefore, AT-GCN had three losses: slow loss (low pathway), fast loss (high pathway), and fuse loss.
    L o s s = w m L ( F u s e ) + w n L ( S l o w ) + w n L ( F a s t )
    L o s s ( F e a t u r e ) = G T s o f t m a x ( F C F e a t u r e )
Each loss has a weight w . The fuse feature is used in the main task; hence, this feature has high importance. Due to this reason, fuse loss has a higher weight than other losses, and this is shown as w m = 2 w n , w n = ( 1 w m ) / 2 .
Slow and fast loss-generating processes are only conducted during training; hence, the inference process is the same as the 2S model.

3.3. Ensemble with Different Sampling Range

The sampling range is important for action recognition. When we applied the short sample range for action recognition, we had the highest performance for short-length action compared to the long sample range. However, a short sample range had lower performance compared to long-length action than a long sample range. This is because a short sample range with short action has small noise frames for action recognition. However, for a short sample range with long actions, we will acquire cutting sampling under long action. Figure 6 shows the performance by sample range. Otherwise, since we decide the static range for training, when the model is trained with a static range, high performance is obtained for actions with a static range length, as shown in Figure 7. The trained model with 120 ranges shows specialized performance on 120 frame actions, and the model trained with 150 ranges shows specialized performance on 150 frame actions. This shows that the trained model with a short range is suitable for recognizing short actions, and the trained model with a long range is suitable for recognizing long actions. The T Ensemble method ensembles models specialized for short and long action samples with different static ranges.

3.4. Setting Sample Range by Pathway

In SlowFast [13], each pathway was used in the same sample range. However, slow actions are performed with slow-changing pose sequences. Therefore, if we sample slow actions from short ranges, it may be hard to obtain the sequences of the pose. On the other hand, fast actions are performed with fast-changing pose sequences; hence, if we sample fast actions from a long range, the critical pose frame may be missing to recognize actions. For this reason, sampling with the same range for each pathway can cause performance degradation. To solve the above problems, AT-GCN samples a wider range in low pathways than in high pathways.
R ( L ) = R ( H ) + r f
R ( L ) , R ( H ) is the sample range of the low pathway and high pathway, and f is the frames per second of the video. In this experiment, we experimented with r = 2 .
L b = H b ( r f k )
L b is the sample starting point of the low pathway. The sample range of the low pathway always contains the sample range of the high pathway k { 0,1 , . . , r f 1 , r f } . k can vary with the type of video. The dataset in this experiment contains information important for action recognition towards the center; hence, we apply k = r f / 2 .

4. Experiments and Results

4.1. Validation Process

The structure of the proposed validation can be seen in Figure 8. The new validation splits a video into five static-length clips. The clip lengths are in the range used to train the model. Each clip’s score is weighted. The weight is greater for clips that are considered more important to recognize the action. For example, in the case of CCTV, the more recent clips have a higher weight because the most recent clips are the most important in recognizing the action. In our case, due to the dataset [17,18,19], the middle of the video is the most important part to recognize the action; hence, the weight is increased towards the middle. For evenly spaced sampling, if the video length is shorter than the total sampling range + 2 s, the proposed method creates a new video with dummy poses, and the process is the same as in Section 3.1.

4.2. Datasets

  • NTU RGB+D 60: NTU RGB+D 60 [17] is an action dataset containing 3D joint values. It consists of 60 classes and 40 subjects with a total of 56,880 videos and 30-frames-per-second videos. The dataset creator divided the training process of this dataset into two benchmarks. (1) Cross-Subject (X-sub): Out of 40 subjects, 20 subjects are used for training, and the rest are used for testing. (2) Cross-View (X-view): Divided into three views defined by Microsoft Kinect v2. Two views are used for training, and the remaining view is used for testing.
  • NTU RGB+D 120: NTU RGB+D 120 [18] is an action dataset with new data added to the NTU 60 dataset, totaling 114,480 videos. It is a 30-frames-per-second video with 3D joint values. It consists of 120 classes, 106 subjects, and 32 setups. The authors divided the training process of the dataset into two benchmarks. (1) cross-subject (X-sub): 53 subjects are used for training, and the remaining subjects are used for testing. (2) cross-setup (X-setup): Even-numbered setups are used for training, and odd-numbered setups are used for testing.
  • Northwestern-UCLA: NW-UCLA [19] is an action dataset consisting of 1494 videos. The videos are 12 frames per second and have 3D joints. It consists of 10 classes, 10 subjects, and three views created by a kinetic camera. Experiments follow the NW-UCLA protocol, using two views for training and one view for testing.

4.3. Implementation Details

All experiments used a single RTX 4090. NTU 60 and 120 [17,18] and UCLA datasets [19] are subjected to data preprocessing [34]. All models have the warm-up strategy [35] applied to the first five epochs. Epoch, optimizer, learning late, and decay are used with the parameters suggested by each model [4,5,32,33,36]. The train batch size and test batch size are 64 and 16. In the NTU 60 and 120 datasets, a total of 36 samples are taken. The original has thirty-six samples in the 120 range, and AT-GCN has four samples in the 180 range and thirty-two samples in the 120 range. The NW-UCLA dataset takes 27 samples in total. The original samples 27 in the 48 range, and AT-GCN samples 3 in the 72 range and 24 in the 48 range.

4.4. Comparison with Original Models

Performances compared with the original models [4,5,32,33,36] with auxiliary task (AT) and different sample ranges by pathway method (D). The performance comparison results using the NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets [17,18,19] can be seen in Table 1. In all three datasets, the auxiliary task (AT) outperforms the existing models. There is no significant performance difference between different sample ranges by pathway method (D). These results show that the proposed methods are more effective for action recognition compared to the existing State-of-the-Art method. We note that the proposed architecture takes a variety of features by sampling frames at different rates, with added loss from auxiliary tasks.

4.5. Ablation Study

To show the stepwise contribution of AT-GCN and static sample training methods, the study uses benchmarks X-sub and X-set of NTU RGB+D 120 dataset [18] and InfoGCN, HD-GCN [32,33].
  • Setting a static sample range: Using the 210 range as a baseline, decrease the static range by 1 s increments to find the static range that performs best with the original model. As shown in Table 2, a wider freeze range than necessary leads to poor performance due to too many dummy poses. The 120 range has the highest performance based on the original model.
  • Effect of the Auxiliary Task: The performance of the original model is used as a baseline for comparison. To show the impact of Auxiliary Task (AT-GCN) on the performance, the proposed method compared the performance with 2 Streaming (2S) without loss of slow and fast. As shown in Table 3, the 2 Streaming model performs worse than the original model due to the difficulty of getting information from only the fuse feature. On the other hand, the Auxiliary Task (AT-GCN) outperforms the original model. This result indicates that the proposed method is more effective for model training than the SlowFast structure [13], which uses a single loss.
  • Ensemble for Auxiliary Task: To see how well the T ensemble fits the Auxiliary Task, the baseline is set for the performance of the AT-GCN model. The proposed ensemble method [32,33] is used for performance comparison. As shown in Table 4, the existing ensemble method has less performance gain than the T ensemble because it is not designed to consider the Auxiliary Task. T ensemble shows the same or higher performance than the traditional ensemble method in AT-GCN. Table 5 shows that the performance gain of the T ensemble is similar for other networks except for MS-G3D [36].
  • Effect of static range training methods: The traditional training and validation method is used as a baseline. The performance comparison results are shown in Table 6. Results show us traditional training performance drops when static validation is performed. Also, static training performance is similar to traditional scores when static validation is used. For traditional training-traditional testing and traditional training-static range testing, as shown in Figure 9, performance drops significantly up to actions of static length. Compared to traditional training-static validation, shown in Figure 10, performance is stable until the action of static length in the case of static sampling training-static range test. The static test crops an action video into several clips, such that each clip contains a cropped action sequence. In the static test, the proposed training method shows a higher performance compared to the existing training method. This result shows the proposed training method is more effective for cropped action compared to existing training methods.

4.6. Comparison of Inference Time

  • Parameters and number of samples: The performance of the original model and AT-GCN is compared by matching the same inference time. The inference time is measured using the NTU RGB+D 120 test set, and the same batch size of 16 is applied. In experiments, the inference time of existing and proposed methods is measured using the same skeleton of the dataset label, such that performance and inference time can be fairly compared. The AT-GCN has two backbones and doubles the parameters of the original model. However, it can be seen in Figure 4 that AT-GCN has the same data flow as the original model with the same number of samples. In other words, the point of AT-GCN has the same inference time as the original model by simply matching the number of samples. As shown in Table 7, the proposed methods have higher performance compared to the existing model at the same inference time, setting the same number of samples.
  • Inference time and performance: To compare the inference time and performance of AT-GCN, the performance and inference time of the original model are used as a baseline. Each model inference time is measured using the skeleton from the label of the dataset. As shown in Table 8, the proposed methods achieve higher performance at the same inference time compared to existing models. In the case of CTR-GCN [5], the inference time is higher than the original model when using a structure that uses Channel-wise to go through the model twice.

5. Conclusions and Future Works

This study presents an auxiliary task graph convolution network (AT-GCN) of skeleton-based action recognition for practical use. AT-GCN is a multi-stream model that uses multiple pathways. Each pathway has a different frame rate to make a variety of features. The proposed method also uses auxiliary tasks to learn a variety of features. Each frame rate feature is converted to each loss. This additional loss leads to a more understandable variety of features for model training. AT-GCN shows higher performance while maintaining the inference time of traditional GCN. In addition, the static sampling method shows higher performance in the static range validation than the conventional training method. The T ensemble uses highly performed models for the length of actions. Proposed methods can be ensembled without additional modules using only sample range differences. T ensembles show similar or higher performance gains than traditional ensemble methods when applied to AT-GCN.
Proposed methods were trained and tested with a video that included human action, and no human joints were occluded in the video. However, when considering practical videos from a CCTV, the human joints may sometimes be occluded. These problems can lead to performance degradation because the joint is an important resource for action recognition. The proposed methods divide each pathway by frame rate. However, the changed action sequence also had a deep relationship with action speed. For example, “looking at the phone” had a long duration of action, but “throwing” had a short duration of action. Future works will include different pathway structures to prevent performance degradation from an insufficient number of joints. Also, future works may include dividing action type by action speed and using this feature to further support the multi-streaming action recognition model.

Author Contributions

Conceptualization, J.C. and J.-M.P.; methodology, J.C.; writing—review and editing, J.C. and S.K.; supervision, S.K., C.-M.O. and J.-M.P.; project administration, S.K. and C.-M.O.; funding acquisition, S.K. and C.-M.O. All authors have read and agreed to the published version of the manuscript.


This work was supported by the Institute for Information and Communications Technology Planning and Evaluation (IITP) grant, funded by the Korean government (MSIT) (No. IITP-RS-2022-II221203), as part of the Regional Strategic Industry Convergence Security Core Talent Training Program (50%). It was also partially supported by the Innovative Human Resource Development for Local Intellectualization Program through the IITP grant funded by the Korean government (MSIT) (IITP-2024-RS-2022-00156287). Additionally, support was provided in part by the IITP under the Artificial Intelligence Convergence Innovation Human Resources Development Program (IITP-2023-RS-2023-00256629), funded by the Korean government (MSIT). It was also partially supported by the Ministry of SMEs and Startups (RS-2024-00435114).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in rose1 and github at 10.48550/arXiv.1604.02808, 10.48550/arXiv.1905.04757 and 10.1109/CVPR.2014.339, reference numbers [17,18,19].

Conflicts of Interest

Authors Chi-Min Oh and Jeong-Min Park were employed by the company SafeMotion. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.


  1. Zhang, S.; Zhang, L. Construction site safety monitoring and excavator activity analysis system. Constr. Robot. 2022, 6, 151–161. [Google Scholar] [CrossRef]
  2. Sun, H.; Chen, Y. Real-time elderly monitoring for senior safety by lightweight human action recognition. In Proceedings of the 2022 IEEE 16th International Symposium on Medical Information and Communication Technology (ISMICT), Lincoln, NE, USA, 2–4 May 2022; pp. 1–6. [Google Scholar]
  3. Zhou, H.; Jiang, F.; Lu, H. Student Dangerous Behavior Detection in School. arXiv 2022, arXiv:2202.09550. [Google Scholar]
  4. Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  5. Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13359–13368. [Google Scholar]
  6. Liu, Y.; Zhang, H.; Li, Y.; He, K.; Xu, D. Skeleton-based human action recognition via large-kernel attention graph convolutional network. IEEE Trans. Vis. Comput. Graph. 2023, 29, 2575–2585. [Google Scholar] [CrossRef] [PubMed]
  7. Song, Y.-F.; Zhang, Z.; Shan, C.; Wang, L. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1474–1488. [Google Scholar] [CrossRef] [PubMed]
  8. Liu, M.; Meng, F.; Chen, C.; Wu, S. Novel motion patterns matter for practical skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 1701–1709. [Google Scholar]
  9. Huang, X.; Zhou, H.; Wang, J.; Feng, H.; Han, J.; Ding, E.; Wang, J.; Wang, X.; Liu, W.; Feng, B. Graph contrastive learning for skeleton-based action recognition. arXiv 2023, arXiv:2301.10900. [Google Scholar]
  10. Li, S.; He, X.; Song, W.; Hao, A.; Qin, H. Graph diffusion convolutional network for skeleton based semantic recognition of two-person actions. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8477–8493. [Google Scholar] [CrossRef] [PubMed]
  11. Hu, H.; Fang, Y.; Han, M.; Qi, X. Multi-scale adaptive graph convolution network for skeleton-based action recognition. IEEE Access 2024, 12, 16868–16880. [Google Scholar] [CrossRef]
  12. Pang, C.; Lu, X.; Lyu, L. Skeleton-based action recognition through contrasting two-stream spatial-temporal networks. IEEE Trans. Multimed. 2023, 25, 8699–8711. [Google Scholar] [CrossRef]
  13. Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
  14. Jiang, J.; Chen, B.; Pan, J.; Wang, X.; Liu, D.; Jiang, J.; Long, M. Forkmerge: Mitigating negative transfer in auxiliary-task learning. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
  15. Cai, Y.; Sui, X.; Gu, G. Multi-modal multi-task feature fusion for RGBT tracking. Inf. Fusion 2023, 97, 101816. [Google Scholar] [CrossRef]
  16. Candito, M. Auxiliary tasks to boost biaffine semantic dependency parsing. arXiv 2024, arXiv:2402.07682. [Google Scholar]
  17. Shahroudy, A.; Liu, J.; Ng, T.-T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
  18. Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.-Y.; Kot, A.C. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef] [PubMed]
  19. Wang, J.; Nie, X.; Xia, Y.; Wu, Y.; Zhu, S.-C. Cross-view action modeling, learning and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2649–2656. [Google Scholar]
  20. Ren, Y.; Lan, Z.; Liu, L.; Yu, H. Emsin: Enhanced multi-stream interaction network for vehicle trajectory prediction. IEEE Trans. Fuzzy Syst. 2024, 1–15. [Google Scholar] [CrossRef]
  21. Mao, K.; Zhu, J.; Su, L.; Cai, G.; Li, Y.; Dong, Z. FinalMLP: An enhanced two-stream MLP model for CTR prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 4552–4560. [Google Scholar]
  22. Yan, S.; Xiong, X.; Arnab, A.; Lu, Z.; Zhang, M.; Sun, C.; Schmid, C. Multiview transformers for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3333–3343. [Google Scholar]
  23. Yang, T.; Zhu, Y.; Xie, Y.; Zhang, A.; Chen, C.; Li, M. Aim: Adapting image models for efficient video action recognition. arXiv 2023, arXiv:2302.03024. [Google Scholar]
  24. Wu, W.; Wang, X.; Luo, H.; Wang, J.; Yang, Y.; Ouyang, W. Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6620–6630. [Google Scholar]
  25. Xing, Z.; Dai, Q.; Hu, H.; Chen, J.; Wu, Z.; Jiang, Y.-G. Svformer: Semi-supervised video transformer for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18816–18826. [Google Scholar]
  26. Wu, W.; Sun, Z.; Ouyang, W. Revisiting classifier: Transferring vision-language models for video recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 2847–2855. [Google Scholar]
  27. Wu, C.-Y.; Li, Y.; Mangalam, K.; Fan, H.; Xiong, B.; Malik, J.; Feichtenhofer, C. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13587–13597. [Google Scholar]
  28. Xiong, X.; Min, W.; Wang, Q.; Zha, C. Human skeleton feature optimizer and adaptive structure enhancement graph convolution network for action recognition. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 342–353. [Google Scholar] [CrossRef]
  29. Wei, C.; Deng, Z. A Novel Contrastive Diffusion Graph Convolutional Network for Few-Shot Skeleton-Based Action Recognition. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 5780–5784. [Google Scholar]
  30. Tu, Z.; Zhang, J.; Li, H.; Chen, Y.; Yuan, J. Joint-bone fusion graph convolutional network for semi-supervised skeleton action recognition. IEEE Trans. Multimed. 2022, 25, 1819–1831. [Google Scholar] [CrossRef]
  31. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Action recognition via pose-based graph convolutional networks with intermediate dense supervision. Pattern Recognit. 2022, 121, 108170. [Google Scholar] [CrossRef]
  32. Chi, H.-g.; Ha, M.H.; Chi, S.; Lee, S.W.; Huang, Q.; Ramani, K. Infogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20186–20196. [Google Scholar]
  33. Lee, J.; Lee, M.; Lee, D.; Lee, S. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 10444–10453. [Google Scholar]
  34. Zhang, P.; Lan, C.; Zeng, W.; Xing, J.; Xue, J.; Zheng, N. Semantics-guided neural networks for efficient skeleton-based human action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1112–1121. [Google Scholar]
  35. Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
  36. Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 143–152. [Google Scholar]
Figure 1. Auxiliary task part of AT-GCN. Low and high pathways defined by frame rates. Each pathway output features through the GCN backbone. Via a low header, high frame rate features had a score and slow, fast loss compared with the ground truth.
Figure 1. Auxiliary task part of AT-GCN. Low and high pathways defined by frame rates. Each pathway output features through the GCN backbone. Via a low header, high frame rate features had a score and slow, fast loss compared with the ground truth.
Applsci 15 00198 g001
Figure 2. Static range training process. Sampling only within a static range of the entire video. The location of the static range is randomized.
Figure 2. Static range training process. Sampling only within a static range of the entire video. The location of the static range is randomized.
Applsci 15 00198 g002
Figure 3. How to use dummy poses. The original video is placed in the center of the new video. The first (violet) and last (orange) frames of the original video are used as dummy poses.
Figure 3. How to use dummy poses. The original video is placed in the center of the new video. The first (violet) and last (orange) frames of the original video are used as dummy poses.
Applsci 15 00198 g003
Figure 4. (a) Original model. Creates a feature and a loss by one action; (b) The 2 Streaming model. Divide pathway by sample rate. Concatenate each feature generated by GCN from each pathway to create a fused feature; (c) AT-GCN. Generate three losses using slow feature, fast feature, and fuse feature. The training process uses the main task + auxiliary task, and the test process uses only the main task.
Figure 4. (a) Original model. Creates a feature and a loss by one action; (b) The 2 Streaming model. Divide pathway by sample rate. Concatenate each feature generated by GCN from each pathway to create a fused feature; (c) AT-GCN. Generate three losses using slow feature, fast feature, and fuse feature. The training process uses the main task + auxiliary task, and the test process uses only the main task.
Applsci 15 00198 g004
Figure 5. The structure of the 1S and 2S Headers. In the 1S Header, features are through global average pooling and softmax to create a score. The 2S Header concatenated the slow and fast features to create a fuse feature.
Figure 5. The structure of the 1S and 2S Headers. In the 1S Header, features are through global average pooling and softmax to create a score. The 2S Header concatenated the slow and fast features to create a fuse feature.
Applsci 15 00198 g005
Figure 6. Performance depends on the sample range. The performance test with different sample ranges. The video frame is the length of the video, showing which length of video each sample range has higher accuracy on.
Figure 6. Performance depends on the sample range. The performance test with different sample ranges. The video frame is the length of the video, showing which length of video each sample range has higher accuracy on.
Applsci 15 00198 g006
Figure 7. The performance of the model depends on the sample range used during training. The results of training the original model with different sample ranges. The video frame is the length of the video, showing which length of video each model has higher accuracy on.
Figure 7. The performance of the model depends on the sample range used during training. The results of training the original model with different sample ranges. The video frame is the length of the video, showing which length of video each model has higher accuracy on.
Applsci 15 00198 g007
Figure 8. Score in the testing process. The entire video was split into five clips. The clip length is the sampling range of the training course. The sum of the scores from each clip is the final score.
Figure 8. Score in the testing process. The entire video was split into five clips. The clip length is the sampling range of the training course. The sum of the scores from each clip is the final score.
Applsci 15 00198 g008
Figure 9. Performance when applying static range to test. Performance with traditional training-traditional test method (Red) and performance with traditional training-static range test method (Blue).
Figure 9. Performance when applying static range to test. Performance with traditional training-traditional test method (Red) and performance with traditional training-static range test method (Blue).
Applsci 15 00198 g009
Figure 10. Performance when applying the static range to train. Performance of the traditional training–static range test method (Blue). Performance of the static range training–static range test method (Green).
Figure 10. Performance when applying the static range to train. Performance of the traditional training–static range test method (Blue). Performance of the static range training–static range test method (Green).
Applsci 15 00198 g010
Table 1. Comparisons top-1 accuracy against NTU RGB+D 60, NTU RGB+D 120, NW-UCLA datasets. AT indicates the auxiliary task module. D indicates different sample ranges by pathway method. Bold is indicates best performance of each GCN model.
Table 1. Comparisons top-1 accuracy against NTU RGB+D 60, NTU RGB+D 120, NW-UCLA datasets. AT indicates the auxiliary task module. D indicates different sample ranges by pathway method. Bold is indicates best performance of each GCN model.
MethodsNTU-RGB+D 60 (%)NTU-RGB+D 120 (%)NW-UCLA (%)
ST-GCN [4]87.492.982.283.090.5
ST-GCN w/AT88.093.382.683.591.2
ST-GCN w/AT w/D88.093.982.783.291.4
MS-G3D [36]89.494.583.985.491.6
MS-G3D w/AT89.694.784.285.692.9
MS-G3D w/AT w/D89.594.984.285.792.7
CTR-GCN [5]89.394.583.984.791.6
CTR-GCN w/AT89.595.084.085.392.7
CTR-GCN w/AT w/D89.695.084.285.392.9
InfoGCN [32]90.094.486.087.192.9
InfoGCN w/AT90.694.986.288.093.8
InfoGCN w/AT w/D90.394.786.587.693.5
HD-GCN [33]89.094.783.884.991.4
HD-GCN w/AT89.
HD-GCN w/AT w/D89.295.284.385.592.2
Table 2. Comparing performance with sample range. Shows the performance of the model as a performance of the sample range (Window) of the original model. Bold is indicates best performance.
Table 2. Comparing performance with sample range. Shows the performance of the model as a performance of the sample range (Window) of the original model. Bold is indicates best performance.
WindowInfoGCN [32] (%)HD-GCN [33] (%)
Table 3. Comparison of Original, 2 Streaming, and AT-GCN. 2S indicates 2 Streaming. Bold is indicates best performance of each GCN model.
Table 3. Comparison of Original, 2 Streaming, and AT-GCN. 2S indicates 2 Streaming. Bold is indicates best performance of each GCN model.
MethodsX-Sub (%)X-Set (%)
InfoGCN [32]86.087.1
HD-GCN [33]83.884.9
Table 4. Comparison of AT-GCN by ensemble. Original is the ensemble method of the original model. T 2-en is an ensemble of models trained with [120, 180], [150, 210] sample ranges. 3-en is the addition of models trained on the [180, 240] range. Bold is indicates best performance.
Table 4. Comparison of AT-GCN by ensemble. Original is the ensemble method of the original model. T 2-en is an ensemble of models trained with [120, 180], [150, 210] sample ranges. 3-en is the addition of models trained on the [180, 240] range. Bold is indicates best performance.
MethodsInfoGCN (%)HD-GCN (%)
Original, 3-en87.488.785.386.7
T, 2-en87.488.585.186.4
T, 3-en87.689.085.586.8
Table 5. Performance on T ensemble of AT-GCNs. AT-GCN is the performance of AT-GCNs. 3-en is the ensemble results of AT-GCN models trained with [120, 180], [150, 210], and [180, 240] static ranges.
Table 5. Performance on T ensemble of AT-GCNs. AT-GCN is the performance of AT-GCNs. 3-en is the ensemble results of AT-GCN models trained with [120, 180], [150, 210], and [180, 240] static ranges.
MethodsX-Sub (%)X-Set (%)
AT-GCN T 3-en
Table 6. Comparison between traditional and static range training methods. ‘Original’ means traditional training and traditional test methods. Full for traditional training-static test method, static sampling training-static test method. X-Sub and X-Set are the top-1 accuracy.
Table 6. Comparison between traditional and static range training methods. ‘Original’ means traditional training and traditional test methods. Full for traditional training-static test method, static sampling training-static test method. X-Sub and X-Set are the top-1 accuracy.
MethodsInfoGCN [32] (%)HD-GCN [33] (%)
Table 7. Comparison of inference time and performance with parameters and number of samples. P is the parameters of the model, N is the number of samples, and Time is the inference time. AT indicates the auxiliary task module. D indicates different sample ranges by pathway method. X-Sub and X-Set are the top-1 accuracy. Bold is indicates best performance of each GCN model.
Table 7. Comparison of inference time and performance with parameters and number of samples. P is the parameters of the model, N is the number of samples, and Time is the inference time. AT indicates the auxiliary task module. D indicates different sample ranges by pathway method. X-Sub and X-Set are the top-1 accuracy. Bold is indicates best performance of each GCN model.
MethodsX-Sub (%)X-Set (%)P(M) × NTime (ms)
InfoGCN [32] × 3630
InfoGCN w/AT w/D86.587.63.44 × 3630
HD-GCN [33]83.884.91.87 × 3654
HD-GCN w/AT w/D84.385.54.04 × 3654
Table 8. Comparison of inference time and performance with original and AT-GCN. Time is the inference time. AT indicates the auxiliary task module. D indicates different sample ranges by pathway method. X-Sub and X-Set are the top-1 accuracy. Bold is indicates best performance of each GCN model.
Table 8. Comparison of inference time and performance with original and AT-GCN. Time is the inference time. AT indicates the auxiliary task module. D indicates different sample ranges by pathway method. X-Sub and X-Set are the top-1 accuracy. Bold is indicates best performance of each GCN model.
MethodsX-Sub (%)X-Set (%)Time (ms)
ST-GCN [4]82.283.016
ST-GCN w/AT w/D82.783.216
MS-G3D [36]83.985.459
MS-G3D w/AT w/D84.285.760
CTR-GCN [5]83.984.748
CTR-GCN w/AT w/D84.285.372
InfoGCN [32]86.087.130
InfoGCN w/AT w/D86.587.630
HD-GCN [33]83.884.954
HD-GCN w/AT w/D84.385.554
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cho, J.; Kim, S.; Oh, C.-M.; Park, J.-M. Auxiliary Task Graph Convolution Network: A Skeleton-Based Action Recognition for Practical Use. Appl. Sci. 2025, 15, 198.

AMA Style

Cho J, Kim S, Oh C-M, Park J-M. Auxiliary Task Graph Convolution Network: A Skeleton-Based Action Recognition for Practical Use. Applied Sciences. 2025; 15(1):198.

Chicago/Turabian Style

Cho, Junsu, Seungwon Kim, Chi-Min Oh, and Jeong-Min Park. 2025. "Auxiliary Task Graph Convolution Network: A Skeleton-Based Action Recognition for Practical Use" Applied Sciences 15, no. 1: 198.

APA Style

Cho, J., Kim, S., Oh, C.-M., & Park, J.-M. (2025). Auxiliary Task Graph Convolution Network: A Skeleton-Based Action Recognition for Practical Use. Applied Sciences, 15(1), 198.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop