1. Introduction
In recent years, there has been a shift towards implementing automated monitoring systems to augment human identification efforts and mitigate criminal activity [
1]. Biometric-based techniques have emerged as prominent methods for identification and authentication, broadly categorized into behavioral and physical domains [
2]. Physical biometrics utilize bodily attributes such as the face, ears, palm, fingerprint, retina, and iris, while behavioral biometrics encompass one’s voice, signature, handwriting, keystrokes, and stride patterns [
3]. Human gait recognition (HGR) stands out as a reliable biometric technique, reflecting distinctive walking patterns without requiring explicit consent from a subject [
4,
5]. The appeal of HGR lies in its ability to capture 24 different components of the human gait, including the angular displacement of body parts and joint angles [
6]. This method has applications in various domains, including criminal identification and visual surveillance.
Machine learning (ML) and computer vision (CV) techniques play a crucial role in leveraging HGR for identification purposes [
7]. HGR methods are generally classified as model-free or model-based approaches [
8]. While model-based methods rely on static gait data and create high-level human models [
1], model-free techniques extract features based on shifting body contours [
2], offering computational efficiency and adaptability to covariants like shadows and clothing changes [
3].
To address the challenge of high-dimensional feature sets, various dimensionality reduction techniques such as entropy-based analysis and principle component analysis are employed [
9]. Additionally, classifiers like decision trees (DTs) [
10], support vector machines (SVMs) [
11], and convolutional neural networks (CNNs) are utilized for gait classification tasks [
11]. Despite challenges such as time-consuming silhouette-based feature extraction and problem-specific traditional feature extraction techniques, HGR continues to evolve with the integration of deep learning and classical methods, offering a diverse range of approaches for accurate individual identification [
12].
In CNN models, features are typically extracted from either the fully connected layer or the average pooling layer [
13]. Since many models are trained on raw images, retrieving some noisy or irrelevant features is possible. Upon examining this process, a small group of researchers has concluded that feature extraction occurs significantly later than model training [
14]. Traditionally, static hyperparameters such as the learning rate, epochs, and momentum govern the training of CNN models. Additionally, several studies have employed feature selection techniques, including entropy-based selection [
11], distance-based selection [
15], and evolutionary algorithm-based selection [
10], in an attempt to mitigate this issue. However, recent research has highlighted the potential elimination of certain significant traits during the selection process [
16].
The aforementioned studies relied on silhouettes and utilized data-driven networks to extract motion characteristics from silhouette sequences [
17]. However, these approaches often overlook instantaneous motion properties or short-term motion aspects [
18]. To tackle this issue, this study introduces dynamic stride flow representation (DSFR), which incorporates the original gait silhouettes with motion intensity and actual instantaneous motion direction [
19]. Additionally, DSFR leverages spatial and temporal contextual cues. While spatial conditions like clothing may alter the body’s form [
20], their instantaneous motion remains consistent with that of the body. The DSFR method improves the frame quality by applying histogram equalization before optical flow estimation using the Lucas–Kanade algorithm [
8]. The method aims to enhance motion clarity by enhancing the contrast, leading to improved motion estimation accuracy. This technique, beneficial for applications like human gait recognition [
21], enhances frame clarity through histogram equalization, aiding in accurate motion estimation, which is crucial for various applications [
8], including human gait recognition. The method assumes a constant pixel brightness over time, approximating motion as translational within local neighborhoods between consecutive frames [
6]. Utilizing Taylor series expansion, the brightness constancy equation is linearized to estimate optical flow. This process involves solving an overdetermined linear system using least squares optimization, in which spatial and temporal gradients are utilized to compute motion vectors accurately.
An innovative approach in gait recognition is channel-wise attention and keypoint-based embedding. This method incorporates spatial dynamic features to enhance silhouette embeddings alongside comprehensive considerations of spatial–temporal dynamic appearance features using transformer-based attention [
22,
23,
24,
25]. By integrating spatial features with temporal data and a channel-wise attention mechanism, GaitSTAR effectively addresses the limitations associated with temporal pooling. Furthermore, to tackle challenging scenarios such as individuals wearing coats that obscure much of the leg region, GaitSTAR leverages model-based techniques utilizing human pose information. This integration enhances the fusion of spatial and temporal features. GaitSTAR harnesses the power of global and local convolutional neural networks [
26,
27] alongside human position data and temporal attention mechanisms [
28,
29,
30], to generate embeddings across multiple frames.
For example, walking speed is a temporal condition that affects the gait cycle and phase while maintaining the same instantaneous motion direction. The optical-flow images and actual silhouette sequence are cropped and merged along with outlines of corresponding silhouettes in a specific ratio to generate DSFRs. Furthermore, this study proposes Spatial–Temporal Attention-Based Feature Reweighting (GaitSTAR) comprising DSFRs to mitigate the effects of silhouette distortion commonly observed in gait sequences; the feature set transformation (FST) module plays a pivotal role in the integration of image-level features into set-level representations derived from DSFRs. By facilitating the capture of long-range interactions, it serves as a potent mechanism for acquiring contextual information, which is particularly beneficial for enriching the representation of distant objects and bolstering confidence in discerning false negatives. Moreover, the dynamic feature reweighting (DFR) module offers a sophisticated approach to scaling the decoding space of features. The computation of attention distributions across the key embedding of each channel-wise dimension substantially augments the interaction and the decoding of query and key elements across both temporal and spatial domains. According to our experimental results, GaitSTAR demonstrates superior performance compared to the two established benchmarks in gait recognition, namely the CASIA-B [
31], CASIA-C [
32], and Gait3D [
33] gait datasets. The following are this work’s main contributions:
We propose Spatial–Temporal Attention-Based Feature Reweighting (GaitSTAR) incorporating dynamic-feature weighting via the discriminant analysis of temporal and spatial features by passing them through a channel-wise architecture.
We introduce DSFRs to enhance the video frame quality, aiding with object feature extraction for improved motion estimation. Our FST architecture integrates image-level features into set-level representations, capturing long-range interactions. DFR further enhances feature decoding by enabling attention computation across key embedding channels, boosting query–key interactions in both temporal and spatial contexts.
The efficacy of GaitSTAR is substantiated through comprehensive tests and comparisons performed with the CASIA-B, CASIA-C, and GAIT3D gait datasets, unveiling its superior performance over the preceding techniques across both cross-view and identical-view scenarios.
The subsequent sections of this document are structured as follows:
Section 2 delves into the interconnected works and fundamental concepts that inform our research.
Section 3 offers an in-depth exploration of the DSFR, FST components, and DFW components integral to GaitSTAR.
Section 4 delineates the dataset specifics, parameterization, and comprehensive experimental analyses. Finally,
Section 5 synthesizes the key findings derived from our research endeavors.
2. Related Works
This section first reviews the current research on using convolutional neural networks (CNNs) for human gait identification, and then it concentrates on techniques that are unique to gait feature extraction using enhanced motion estimation and attention-based detections.
Human gait recognition (HGR) has gained significant popularity in the domains of computer vision (CV) and deep learning in recent years [
22,
24,
26,
34]. Technological advancements have led to the development of numerous deep learning-based models aimed at mitigating covariate effects [
14,
33]. Ling et al. [
35] presented a unique, scalar, multi-temporal gait identification approach that blends temporal characteristics with different temporal scales, using 3D CNN and a frame pooling approach to handle mismatched inputs between video sequences and 3D networks. Ateep et al. [
20] introduced a method for pose estimation that enhances model-based gait recognition by directly estimating optimal skeletal postures from RGB images. They effectively extracted gait characteristics and incorporated spatiotemporal techniques, proposing an improved model-based HGR strategy by combining a gait graph with skeletal poses using graph convolutional networks (GCNs). Zou et al. [
36] also targeted human gait angles using MobileNet architectures in real-world scenarios, gathering axisymmetric gait data in unrestricted environments. Empirical evaluations using the CASIA-B dataset demonstrated improved performance compared to SOTA techniques.
Comparing the outcomes of experiments using current approaches with the publicly accessible CASIA-B dataset reveals encouraging results. Hou et al. [
37] proposed the Gait Lateral Network (GLN), an HGR approach that focuses on discovering discriminative and compact representations using silhouettes. GLN provides set-level and silhouette-level features by fusing deep CNNs with retrieved attributes. Experiments using the OUMVLP and CASIA-B datasets confirmed that the proposed strategy is effective, demonstrating higher accuracy than other approaches.
To achieve accurate human recognition performance, deep learning methods, particularly hybrid deep CNN models [
29,
30], are employed. Wen et al. [
38] introduced a view-fusion approach and generative adversarial network (GAN) for human gait detection. The GAN is used to transform gait images, while a fusion model combines results from multiple views. Ghosh et al. [
39] proposed a faster method based on region convolutional neural networks (R-CNNs) for object extraction and identification from video frames, utilizing walking patterns from gait sequences. Additionally, efficient HGR techniques have been explored using multi-stream approaches, part-fused graph convolutional networks, and graph convolutional networks [
3]. Furthermore, Sharif et al. [
2] introduced a deep learning-based method for feature extraction and classification that leverages transfer learning and kurtosis-controlled entropy.
While previous approaches have focused on feature reduction, classification, and feature extraction from silhouette photos, preprocessing actions to enhance visual data in video frames were not performed, though they could improve a model’s learning capacity. Thus, this research presents an optimal deep learning-based HGR framework and automated Bayesian optimization to address these limitations.
4. Experiments and Results
This section presents the performance evaluation of the Spatial–Temporal Attention-Based Feature Reweighting conducted using the CASIA-B [
31], CASIA-C [
32], and Gait3D [
33] gait datasets, which are publicly accessible datasets. Detailed descriptions of these datasets, including their composition and an overview of the datasets, training, and testing sets, as well as gallery and probe subsets, are initially provided in this section. Subsequently, a comparative analysis is performed between the outcomes of GaitSTAR and alternative techniques. Finally, the effectiveness of each component of GaitSTAR is verified through the results obtained from ablation studies. This section provides a comprehensive overview of the experiments conducted, covering details related to the dataset, experimental conditions, and obtained results.
4.1. Dataset
The CASIA-B dataset encompasses 124 instances featuring 11 viewpoints spanning from 0° to 180°, categorized into three unique walking scenarios: normal (NM), bag (BG), and coat-carrying (CL), as depicted in
Figure 5. Out of these, 62 subjects are designated for testing, while the initial 62 subjects constitute the training set. Specifically, the first four sequences of regular walking for each participant are earmarked as the gallery set, while the remaining sequences are designated for the probe set. Similarly, the CASIA-C dataset [
32], illustrated in
Figure 5, comprises data from 153 individuals engaging in various walking scenarios [
46], including normal walking (NW), fast walking (FW), slow walking (SW), and walking with a bag (BW), as shown in
Figure 6 and
Figure 7. Each subject contributes four NW sequences, two FW sequences, two SW sequences, and two BW sequences. The training set comprises the initial 26, 64, and 100 subjects, while the testing set encompasses the remaining 52 subjects. NW sequences serve as the gallery set for each subject, with the remaining sequences allocated to the probing set.
A substantial gait dataset dubbed Gait3D was amassed within the confines of a supermarket. Here, with a frame rate of 25 FPS and boasting a resolution of 1080 by 1920 pixels, it encompasses a staggering 1090 h of footage. The delineation between gallery and probe sets, along with the training and testing methodologies, strictly adhered to standardized protocols meticulously executed by our team. The principal evaluative metric utilized was the rank-1 accuracy.
4.2. Network Configuration
In our proposed work, we present the configuration of a deep-learning system network tailored for GaitSTAR, a human gait recognition network. The final diagram should begin with the input layer, which consists of image sequences. These sequences are processed through multiple convolutional layers represented by CNN blocks with various filter sizes, such as 5 × 5, 3 × 3, and 2 × 2. Following each convolutional layer, activation functions like ReLU should be indicated if they are used, connecting the CNN blocks to the next stage. If pooling layers are utilized, they should be depicted between the convolutional layers and the subsequent blocks. After the convolutional layers (and any activation or pooling layers), the features proceed to the feature selection/transformation (FST) blocks. These blocks are responsible for selecting or transforming the features extracted via the convolutional layers. Following the FST blocks, the features go through dimensionality reduction (DFR) blocks, which further process the features by reducing their dimensionality.
If the network includes fully connected layers towards the end, these should be shown before the final output. The diagram should culminate with the output layer, which represents the final result Y of the CNN’s processing. Consistent color coding and clear labels should be used throughout the diagram to enhance clarity and the understanding of each component’s role in the network. Training the network involves 100 epochs with a learning rate set to 0.0001 and utilizing the Adam optimizer for an efficient gradient descent. Each training batch consists of 32 samples, and we employ the categorical cross-entropy loss function to measure the disparity between predicted and actual gait classes. Rectified linear unit (ReLU) activation functions are applied to the hidden layers, and they involve transformer channels, , while softmax activation is employed in the output layer to compute class probabilities. To prevent overfitting, we incorporate dropout regularization with a rate of 0.5. Our network expects input images of the size 128 × 128 pixels, and it is designed to classify gait into three distinct classes. For optimal performance, the training process requires access to a high-performance computing environment equipped with a GPU (graphics processing unit) with at least 8 GB of VRAM, 16 GB of RAM, and a multicore CPU. Additionally, efficient data handling and processing capabilities are necessary to manage the large volume of image data required for training. This configuration is crafted to manage the balance between computational efficiency and model complexity, ensuring robust performance in gait recognition tasks.
4.3. GaitSTAR Performance Comparison
Table 1 presents a performance comparison of the CASIA-B dataset across various gait recognition models, Gaitset [
47], MT3D [
35], GaitGL [
27], GaitPart [
48], and MSGG [
49], in three different sizes of training data under different walking scenarios: normal walking (NM#5-6), bag-carrying (BG#1-2), and coat-wearing (CL#1-2). The models include SPAE, MGAN, Gaitset, Gait-D, GaitPart, GaitGL, GaitNet, GaitGraph, and GaitSTAR. GaitSTAR consistently outperformed other models across all scenarios, achieving the highest accuracy scores in NM, BG, and CL conditions. Specifically, GaitSTAR achieved an accuracy of 97.4% in NM#5-6, 86.7% in BG#1-2, and 68.3% in the CL#1-2 scenarios. In comparison, other models exhibited varying degrees of performance, with GaitPart and GaitGL showing competitive results, particularly in the NM and BG scenarios. However, GaitSTAR maintained a significant performance advantage across all scenarios, demonstrating its robustness and effectiveness in gait recognition tasks.
Table 2 illustrates that GaitSTAR demonstrated the potential to achieve the highest rank-1 accuracy across a wide range of viewing angles. Initially, an analysis of various scenarios, detailed in
Table 2, revealed that the progression from normal (NM) to background (BG) and clothing (CL) situations escalated the complexity of gait recognition. The presence of bags or coats led to a gradual degradation in the accuracy performance of all techniques. Specifically, under the NM, BG, and CL conditions, the cutting-edge GaitGL approach achieved recognition accuracy of 97.2%, 94.5%, and 83.6% (utilizing the LT setting).
Correspondingly, our findings for GaitSTAR exhibited a parallel trend, with identification accuracies in these scenarios reaching 98.5%, 98.0%, and 92.7%, respectively. Scrutinizing the testing outcomes under the LT setting revealed that GaitSTAR outperformed GaitGL by approximately 1.1% and 3.5% in NM and BG and by approximately 9.1% in CL. This underscores the superior performance of GaitSTAR, particularly in the BG and CL scenarios, underscoring its more discriminative representation compared to other state-of-the-art techniques, as shown in
Figure 8. Given that the primary objective of GaitSTAR is to integrate the human posture feature, which offers a more informative gait description, particularly under the BG and CL circumstances, these results were anticipated. Regardless of the type of attire or bag worn by an individual, their skeletal structure remains discernible. Even in single-view (ST) and multi-view (MT) scenarios, GaitSTAR’s rank-1 accuracy yields comparable results. Furthermore, considering that gait data may be captured from various viewing angles and environmental conditions, our performance demonstrates reliability across a spectrum of external factors.
In the LT scenario, the average rank-1 accuracy of GaitSTAR stands at 96.4%, surpassing MT3D and GaitGL by 6.0% and 4.6%, respectively. An examination of experimental findings using different dataset sizes (e.g., ST, MT, and LT with the CASIA-B dataset) reveals that, under NM conditions, MT3D achieved recognition accuracies of 82.8%, 94.4%, and 96.7%, respectively. Conversely, the proposed GaitSTAR approach achieved comparable identification accuracies of 87.0%, 96.7%, and 98.5%, representing increments of 4.2%, 2.3%, and 2.3%, respectively. Thus, GaitSTAR exhibited greater advancements in the context of small-scale datasets. Additionally, a comparison with another multi-modality approach, MSGG [
49], which incorporates posture characteristics in its framework, revealed that GaitSTAR consistently achieved the highest rank-1 accuracy for the BG and CL scenarios across almost all angles.
Moreover, the efficacy of GaitSTAR was evaluated using various training sets with the CASIA-C dataset. As depicted in
Table 3, GaitSTAR achieved an average accuracy of 54.1% when trained on a cohort of 24 participants, escalating to 58.4% and 67.3% with training sets comprising 62 and 100 subjects, respectively. This underscores the positive correlation between the size of the training set and the accuracy of recognition, as larger sets facilitated a better delineation of boundary features and mitigated overfitting tendencies.
Additionally, the subpar quality of gait sequences in CASIA-C introduced outlier data during optical flow extraction, resulting in overall performance inferior to that observed in CASIA-B. Nonetheless, GaitSTAR exhibited an average accuracy improvement of approximately 1.8% and a notable 5% enhancement under forward walking (FW) conditions compared to the PSN method proposed by [
54]. This advancement could be attributed to the spatial information provided by the silhouette’s edge in GaitSTAR, which proves instrumental in refining recognition accuracy, particularly in FW scenarios where motion information may be susceptible to outliers, thereby yielding superior outcomes.
On average, GaitSTAR enhances accuracy by approximately 2.2%(CH), with an increase of roughly 6%(CH) under fast-walking (FW) conditions compared to the PSN method proposed by [
35]. This improvement may be attributed to the silhouette edge in the DSFR, which furnishes valuable spatial information to enhance recognition accuracy, particularly under FW conditions where motion information may include outliers.
4.4. Evaluation on Gait3D
Table 4 displays the outcomes of model-free methodologies, model-based approaches, and our proposed GaitSTAR. A conspicuous disparity between laboratory-centric research and real-world deployment is evident, notably observed in the inferior overall performance of model-free techniques with the real-world GREW [
33] dataset compared to controlled laboratory datasets like CASIA-B. Notably, model-free methodologies treating frames as an unordered ensemble, exemplified by GaitSet [
47], exhibited superior performance compared to those considering the sequential arrangement of frames, such as GaitPart [
48], GLN [
37], GaitGL [
27], and CSTL [
55]. This discrepancy likely stemmed from the inherent difficulty in capturing temporal dynamics in unbounded environments, where individuals may interrupt and resume walking along diverse trajectories and at varying velocities.
Of particular note is the subpar performance demonstrated for the GEI-based approaches, including GEINet [
56], indicating that GEIs overlook crucial gait cues. Conversely, model-based techniques surpassed model-free methods with the Gait3D dataset, a finding that can potentially be attributed to the sparsity of input provided via human-body joints, which lack essential gait characteristics such as body morphology and appearance. Additionally, the unpredictability of the walking pace and trajectory in real-world contexts poses challenges for temporal dynamics’ modeling. Moreover, in terms of rank-1 accuracy, GaitSTAR marginally outperformed the cutting-edge SMPLGait [
33] with the Gait3D dataset by a negligible margin of 10.1%. However, for metrics such as mean average precision (mAP) and rank-5 accuracy, GaitSTAR demonstrated superior performance, indicating its potential for gait-recognition applications in practical scenarios.
4.5. Ablation Studies
The outcomes of ablation experiments carried out on the CASIA-B dataset are shown in
Table 5, evaluating the effectiveness of various GaitSTAR components. The baseline method, which directly utilizes GEI [
56] for classification, is presented in the first row. The second row shows the results obtained by incorporating FST (feature set transformation) and DFR (dynamic feature reweighting) with GEI as an input. Notably, there was an improvement of approximately 21% in the BG and NM conditions, and in CL conditions, the improvement reached approximately 37%. The third row demonstrates the utility of DFSR compared to GEI, with DFSR yielding an increase in accuracy of 3.1% in NM, 0.7% in BG, and 4.1% in CL conditions. It also depicts the outcome of various GaitSTAR modules, emphasizing the effectiveness of FST and DFR. It has been established that utilizing FST as the permutation-invariant function results in a comparison to other popular functions such as the mean, max, median, and attention, with the best accuracy. Ablation studies further indicated that DFR contributes significantly to accuracy, with an increase of approximately 2–3% observed in all conditions with DFR.
We explored the efficacy of the spatial–temporal attention mechanism in
Table 6. Our experimentation leveraged the CASIA-B dataset, employing the LT parameters.
Table 6 elucidates that employing average pooling in the spatial dynamic (SD) layer yielded accuracies of 96.9%, 92.2%, and 83.5%, while utilizing the spatial–temporal (ST) layer resulted in accuracies of 98.7%, 97.1%, and 83.8%, respectively. Hence, integrating the ST operation, rather than solely relying on SD, can enhance performance. Moreover, it is evident that attention reweighting (AR) with ST + AR significantly boosts accuracy in the CL scenario by 8.5% compared to ST alone. This substantial improvement is attributed to ST + AR’s ability to extract both human positional and temporal information from gait sequences, thereby enhancing adaptability to changes in the external environment. Consequently, we opted for the ”ST + AR” combination as the ultimate representation for multi-view gait recognition.
5. Analysis and Discussion
The GaitSTAR framework advances human gait recognition through its innovative architecture and effective handling of key challenges in the field. The framework’s design integrates three core components: dynamic stride flow representation (DSFR), feature set transformation (FST), and dynamic feature reweighting (DFR), each addressing critical issues in gait analysis.
DSFR enhances the video frame quality by combining silhouette-edge information with motion intensity and direction, mitigating silhouette distortion due to factors such as carrying conditions and clothing variations. This module captures both shape and motion characteristics, improving feature extraction accuracy. FST transforms image-level features into set-level representations using the Discriminant Common Vector (DCV) methodology and temporal attention mechanisms. This approach enhances feature richness by capturing contextual linkages and adapting to various viewing angles. DFR dynamically reweights features based on their discriminative power, refining feature interactions and improving recognition accuracy across diverse conditions.
Gait recognition faces challenges, including silhouette distortion, the need to capture short-term motion properties, cross-view recognition difficulties, and real-world deployment complexities. GaitSTAR addresses these challenges effectively. DSFR reduces silhouette distortion, enhancing frame clarity. FST and DFR capture and adapt to both instantaneous and dynamic gait features, improving recognition robustness across different viewing angles and conditions. The integration of advanced methodologies, such as histogram equalization for contrast enhancement and the Lucas–Kanade algorithm for motion estimation, supports GaitSTAR’s ability to handle real-world variations in walking conditions and environments.
GaitSTAR’s effectiveness was quantitatively validated through several performance metrics. With the CASIA-B dataset, GaitSTAR achieved an average accuracy of 84.13%, outperforming other models such as SPAE (40.2%) and MGAN (51.4%). It also compared favorably with advanced models like GaitPart (89.1%) and GaitGL (88.7%), demonstrating its competitive edge. In evaluations using the Gait3D dataset, GaitSTAR achieved rank-1 accuracy of 54.21%, rank-5 accuracy of 72.76%, a mean average precision (mAP) of 44.15%, and mean intrapersonal normalized precision (mINP) of 27.07%. These results surpass those of models like GEINet (rank-1: 7.00%; mAP: 6.05%) and Gaitset (rank-1: 42.60%; mAP: 33.69%) and exceed those of GaitPart (rank-1: 29.90%; mAP: 23.34%) while approaching GaitGL (rank-1: 23.50%; mAP: 16.40%).
Overall, GaitSTAR’s architecture, which combines DSFR, FST, and DFR, offers a robust solution to gait-recognition challenges. The superior performance metrics affirm its effectiveness in practical scenarios, marking it as a leading framework in the field.