1. Introduction
Gait recognition refers to judging individuals by their distinct walking patterns, making it one of the most auspicious biometric technologies for identity recognition. In contrast to other biometric identification methods such as fingerprints, DNA, facial recognition, vein recognition, and so on, gait recognition works by using regular or low-resolution cameras at long ranges and does not require explicit cooperation from the subjects of interest. Therefore, gait recognition has demonstrated huge development potential in the fields of crime prevention, video surveillance, and social security. Invariance to disturbances in gait recognition signifies the capability of achieving high recognition accuracy without being affected by various external factors, such as bag-carrying, cloth-change, cross-view, and speed changes [
1,
2,
3]. However, the performance of gait recognition is susceptible to the forementioned external factors in real-word scenarios, which brings substantial challenges to achieve invariance to disturbances. Therefore, a multitude of state-of-art works have focused on achieving invariance to disturbances in gait recognition to ensure the reliability of gait recognition systems.
Benefiting from recent developments in deep learning algorithms, numerous powerful gait recognition algorithms have been developed [
4,
5,
6,
7,
8,
9,
10,
11], and have been proven effective under challenging conditions. Liao et al. [
9] proposed long short-term memory (LSTM) [
12] for the extraction of spatial and temporal features based on human skeleton points. PoseGait [
10] transformed 2D poses into 3D poses to enhance recognition accuracy by extracting additional gait features from 3D poses. Shiraga et al. [
11] utilized the gait energy image (GEI) to train their networks and learn covariate features for gait recognition. GaitSet [
4] used 2D convolutional neural networks (CNNs) at the frame level to extract global features and treated gait silhouettes as a set to capture temporal features. GaitPart [
7] extracted local gait feature representations by dividing the feature maps horizontally and utilized micro-motion features to focus on the short-term temporal expressions. Huang et al. [
8] introduced a 3D local CNN to extract sequence information from specific human body parts.
To the best of our knowledge, the human body possess evidently various visual appearances and movement patterns during walking. Spatial-temporal representations refer to the spatial-temporal features derived from the modeling of silhouettes, which can effectively capture the visual appearances and movement patterns of the pedestrian. Spatial feature representations indicate the visual appearances of the silhouettes, while temporal feature representations reflect the movement patterns of the silhouettes. Spatial feature extraction and temporal modeling can extract rich and discriminative spatial-temporal feature representations, distinctly capturing an individual’s walking process. Since there are significant differences among various persons in terms of visual appearance and movement patterns during walking, these spatial-temporal representations can play a pivotal role in enhancing the effectiveness of gait recognition. The combing of spatial and temporal representations can not only describe the visual appearances, but also capture motion patterns of the pedestrians, and utilizing either spatial representations or temporal representations alone would lead to the poor accuracy of gait recognition. Therefore, by jointly investigating motion learning and spatial mining simultaneously, the accuracy of gait recognition can be substantially improved. Despite considerable efforts in gait recognition, the aforementioned methods still encounter the following challenges: (1) there is a need for muti-scale feature extraction to capture more robust spatial-temporal representations, thereby enhancing the accuracy of gait recognition especially in case of appearance camouflage; (2) few methods have taken view angle into consideration explicitly and the detection or estimation of viewpoint has been somewhat overlooked, which can exert an essential impact to improve the recognition ability of existing approaches.
With such considerations, we propose an advanced gait recognition network, namely two-path spatial-temporal feature fusion module and view embedding module. The two-path spatial-temporal feature fusion module consists of multi-scale feature extraction (MSFE), frame-level spatial feature extraction (FLSFE) and multi-scale temporal feature extraction (MSTFE). Firstly, MSFE is deployed to facilitate the effective extraction of shallow features, which can extend the receptive field and enable the extraction of multiple internal features within different regions. Subsequently, we introduce a two-path parallel structure containing FLSFE and MSTFE, which aims to effectively extract the multi-scale spatial-temporal information across various granularities. In FLSFE, we develop an innovative residual convolutional block (R-conv) to capture both global and local gait features by the special design of residual operation. Meanwhile, in MSTFE, we design independent branches of temporal feature extraction with varying scales and integrate the temporal features in an attention-based way. For reasonable refinement of extracted features, a view embedding module is constructed to reduce the negative impact of viewpoint differences, which uses view prediction learning to calculate the best view and embeds the view information into the multi-scale spatial-temporal characteristics to obtain the final features. Extensive experiments conducted on the two public datasets CASIA-B and OU-MVLP have demonstrated that our method has outperformed other state-of-the-art methods, showing its superior performance in gait recognition.
2. Related Work
Gait recognition: Current deep-learning-based gait recognition methods can be broadly classified into two categories: model-based [
13,
14,
15,
16] and appearance-based [
4,
5,
6,
7,
8,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30]. Model-based methods leverage the relationships between bone joints and pose information to create models of walking patterns and human body structures [
13], such as OpenPose [
14], HRNet [
15], and DensePose [
16]. These methods exhibit stronger robustness to clothing variation and carrying articles. However, model-based methods depend on accurate joint detection and pose estimators, which can significantly increase the computational complexity and may lead to inferior performance in certain scenarios. Conversely, appearance-based methods use gait silhouettes (the binary images shown in
Figure 1) as the model’s input and capture spatial and temporal characteristics from silhouettes by CNNs [
17]. Gait silhouettes can describe the body state in a single frame at a low computational cost and more detailed information in each silhouette image can be preserved directly from the original silhouette sequences. The silhouettes are the basis of appearance-based methods, and the quality of silhouettes will directly affect the performance of the gait recognition system. Moreover, spatial feature representations can be obtained from the silhouette of an individual frame, which can represent appearance characteristics, while temporal feature representations can be captured from consecutive silhouettes, in which the relationship between adjacent frames can reflect the temporal characteristics and motion patterns. Therefore, some appearance-based methods have overcome the challenges of pose estimation and achieved competitive performance [
4,
5,
6,
7,
8,
21,
22,
29]. Particularly, the first opensource gait recognition framework, named OpenGait (
https://github.com/ShiqiYu/OpenGait, accessed on 29 January 2023) [
18], encompassed a series of state-of-the-art appearance-based methods for gait recognition. In this paper, our approach is categorized as appearance-based and we count on binary silhouettes as our input data without the need for pose estimation or joint detection. By focusing on silhouettes, we aim to reduce the influence of variations in subjects’ appearance, thereby enhancing the accuracy and robustness of our gait recognition approach.
Spatial feature extraction modeling: Regarding the range of feature representations in spatial feature extraction modeling, two main approaches are commonly used: global-based and local-based methods. Specifically, global-based methods focus on exploring gait silhouettes as a whole to generate global feature representations [
4,
5,
12,
17,
19,
20]. For instance, Shiraga et al. [
12] proposed the GEI and utilized 2D CNNs to obtain global gait feature representations from the GEIs. Similarly, GaitSet [
4] and GLN [
5] extracted global gait features at the frame-level by 2D CNNs. Conversely, local-based methods usually segment the silhouettes into multiple parts to establish the local feature representations [
7,
21,
22], and focus more on learning the local information of different body parts. For example, Zhang et al. [
21] split human gait into four distinct segments and adopted 2D CNNs to capture more detailed information from each of these segments. Fan et al. [
7] introduced the focal convolution layer (FConv), a novel convolution layer that divided the feature map into several parts to obtain part-based gait features. Qin et al. [
22] developed RPNet to discover the intricate interconnections between each part of gait silhouettes and then integrated them by a straightforward splicing process. GaitGL [
23] and GaitStrip [
24] both used 3D CNNs to construct multi-level frameworks to extract more discriminative and richer spatial features. In this paper, we design an innovative residual convolutional block (R-conv) in spatial feature extraction model by 2D CNNs, which combines regular convolution with FConv to extract both global and local gait features and enhance the discriminative capacity of the feature representations.
Temporal feature extraction modeling: As a crucial cue for gait tasks, the current temporal feature extraction model generally employs various approaches such as 1D convolutions, LSTMs and 3D convolutions [
25]. For example, Fan et al. [
7] and Wu et al. [
26] utilized 1D convolutions to model the temporal dependencies and aggregated temporal information by concatenation or in a summation. Additionally, LSTM networks were built in [
21,
27] to preserve the temporal variation of gait sequences and fuse temporal information by accumulation. Moreover, some studies have proposed 3D convolutions [
28,
29,
30,
31] to simultaneously extract spatial and temporal information from gait silhouette sequences. Lin et al. [
32] introduced MT3D network, which used 3D CNNs to extract spatial-temporal features with multiple temporal scales. However, 3D CNNs often bring complex calculations and encounter challenges during training. In this paper, we present a novel approach for temporal feature extraction modeling by 2D CNNs, which aggregates temporal information with different scales. By incorporating muti-scale temporal branches, our approach can capture rich temporal clues and empower the network to learn more discriminative motion representations adaptively.
View-invariant Modeling: To the best of our knowledge, viewpoint change poses a formidable challenge in biometrics, particularly in face recognition and gait recognition. In contrast to face recognition, fewer methods in gait recognition have incorporated the aspect of viewpoint into their considerations. He et al. [
33] introduced a multitask generative adversarial network (GAN) and trained the GAN by using viewpoint labels as the supervision. Chai et al. [
34] adopted a different projection matrix as a perspective embedding method and achieved high growth on multiple backbones. However, these methods often involve a large number of parameters, making them extremely complex for effective cross-view gait recognition. Therefore, we propose a concise view model that applies view prediction learning to calculate the best view and embeds view information into our two-path spatial-temporal feature fusion module, which can significantly enhance the robustness of our network to view changes and improve gait recognition performance across varying viewpoints.
5. Discussion
Based on extensive comparative experimental results, our proposed model exhibits significant improvements in two key aspects: (1) Enhanced accuracy under BG and CL conditions on CASIA-B dataset. This improvement is attributed to the utilization of MSFE, FLSFE and MSTFE. MSFE can expand different perceptual fields and observe more detailed gait features in spatial and temporal domain. Besides, both FLSFE and MSTFE in a two-path parallel structure can extract muti-scale spatial and temporal discriminative features, which can enhance the robustness of the proposed method, particularly under unfavorable conditions. (2) Enhanced accuracy on the OUMVLP dataset. The proposed method demonstrates outstanding performance in the large-scale dataset, this is due to the fact that the view embedding module can predict the best view and embed view angle into the multi-scale spatial-temporal features, which can help mitigate the intra-class variations resulting from view differences and enhance the recognition ability of gait recognition.
In the future, we will further improve the performance of our proposed method in more complex test scenarios and gait in-the-wild datasets, such as the GREW [
43] dataset and the Gait3D [
44] dataset. Simultaneously, whether our method exists overfitting over CASIA-B and OU-MVLP datasets also needs to be verified in the wild datasets and real-word scenarios. Additionally, as silhouettes are easily disturbed by pedestrians’ clothes and objects, it is essential to explore multi-modal gait recognition approaches that combines silhouettes, skeletons [
45], and pose heatmaps [
46].