1. Introduction
Gait recognition is a biometric authentication method that has garnered significant interest in fields such as security, surveillance, healthcare, and human–computer interaction. Unlike traditional biometric methods such as fingerprints or iris scans, gait recognition leverages unique walking patterns for identification, offering benefits such as remote identification, resistance to spoofing, and applicability in scenarios where other biometrics are impractical [
1].
The challenge in gait recognition lies in accurately identifying individuals based on their walking patterns, which involves extracting and analyzing gait features from video sequences. The initial research involved manual extraction and basic classification algorithms [
1], but advancements in computer vision, machine learning, and sensor technologies have allowed gait recognition to evolve into a sophisticated biometric method. Specifically, deep learning techniques, such as convolutional neural networks (CNNs), have revolutionized gait detection by enabling the automatic extraction of distinctive characteristics from gait sequences [
2]. The current research on deep learning for gait recognition explores various architectures, including CNNs, multi-layer perceptrons (MLPs), self-organizing maps (SOMs), and transfer learning models such as EfficientNet [
3]. Each architecture offers distinct advantages in terms of accuracy, computational efficiency, and interpretability. However, selecting the optimal model remains challenging because of factors such as dataset properties, feature representation, and training methodologies. Furthermore, the lack of standardized evaluation protocols complicates model comparison, making it difficult for researchers and practitioners to identify the most suitable approach for specific applications [
4,
5,
6]. In security, gait recognition can be used for surveillance systems that identify individuals from a distance without the need for active participation, offering a non-intrusive alternative to traditional biometrics such as face or fingerprint recognition. In healthcare, it has the potential to assist in monitoring patients with mobility impairments or neurological disorders, providing valuable insights into movement patterns for diagnosis and treatment [
1]. Additionally, advancements in gait recognition can enhance human–computer interaction, enabling more intuitive and natural interfaces for devices and systems. The methods explored in this study are highly relevant in advancing these applications by improving the accuracy and efficiency of gait recognition systems in real-world settings. This study addresses the problem of identifying the most effective deep learning model for gait recognition by evaluating the CNN, MLP, SOM, and EfficientNet architectures via the CASIA B dataset [
6]. The authors aimed to compare these models based on their performance, strengths, and limitations to provide a clearer understanding of their applicability in gait identification tasks. Additionally, this research highlights the need for more refined models and optimization techniques to increase the robustness and accuracy of gait recognition systems.
This work bridges this research gap by offering a detailed comparative study of deep learning models, thus advancing the understanding of their effectiveness in gait recognition and providing directions for future improvements. Gait recognition, a biometric authentication method, has gained significant interest in several fields, such as security, surveillance, healthcare, and human–computer interfaces. Gait recognition uses the unique walking patterns of people for identification, unlike typical biometric methods such as fingerprints or iris scans. This unobtrusive method has several benefits, such as remote identification capabilities, resistance to spoofing assaults, and effectiveness in situations where other biometric methods may not be suitable. While the primary focus of this study is gait recognition for biometric applications, the findings also have potential implications for the neuroscience community. Gait analysis plays a crucial role in understanding motor control, neurological disorders such as Parkinson’s disease, and rehabilitation processes. The techniques developed in this study could aid in assessing and monitoring gait abnormalities, providing valuable insights into motor function and neurological health.
The research on human gait as a biometric identifier has been ongoing for several decades, originally including the manual extraction of gait data and basic classification algorithms [
1]. Advancements in computer vision, machine learning, and sensor technologies have advanced gait recognition, developing it into a complex and effective biometric method. Deep learning methods, namely, convolutional neural networks (CNNs), have transformed gait detection by allowing the automatic extraction of distinctive characteristics from gait sequences recorded by surveillance cameras or sensors.
Researchers have investigated several deep learning architectures for gait detection tasks because of the growing need for strong and effective biometric identification solutions. The designs vary from basic multi-layer perceptrons (MLPs) to intricate models such as self-organizing maps (SOMs) and EfficientNet. Each model has distinct benefits in terms of accuracy, computing efficiency, and interpretability, making it an appropriate choice for various gait identification situations.
The specific research gap addressed in this paper involves the need for a more efficient and accurate deep learning model for human gait recognition, especially in challenging real-world scenarios. Prior studies focused on limited models, often overlooking the impact of environmental factors such as varying view angles, clothing styles, and carrying conditions, which reduce the reliability of gait recognition systems. This paper addresses this gap by evaluating four deep learning models—CNNs, MLP, SOMs, and EfficientNet—on the CASIA-B dataset to determine their effectiveness in overcoming these real-world challenges. This is important because existing models lack the robustness required for practical applications, and the study’s comparative analysis fills this gap by providing insights into the strengths and limitations of each model, contributing to the optimization of gait recognition systems for real-world use.
The innovative approach used in this research involves the comprehensive evaluation of multiple deep learning models, particularly focusing on how convolutional layers in CNNs and feature extraction techniques in EfficientNet can improve gait recognition performance. The use of a custom CNN architecture with additional convolutional layers allows better feature extraction, while EfficientNet introduces a novel approach to transfer learning in gait recognition. Furthermore, the integration of regularization techniques such as dropout layers and the fine-tuning of hyperparameters into the model enhances its robustness. These techniques differ from those of previous studies by optimizing both model accuracy and computational efficiency, aiming to balance real-time performance with high accuracy.
The results of this study could have practical implications for security, surveillance, and biometric systems, where the accurate and real-time identification of individuals based on their gait is crucial. The enhanced CNN model developed in this research could be highly suited to real-world applications, where conditions such as changing view angles, clothing variations, and environmental noise can degrade system performance. By optimizing model accuracy and computational efficiency, this research offers a scalable solution that can be implemented in real-time security systems, law enforcement tools, and healthcare monitoring, making it a valuable contribution to the study of the practical deployment of gait recognition technology.
Problem Statement: This research addresses several challenges in gait recognition, including recognizing gaits under varying conditions, such as changing view angles and walking styles, and carrying items such as a coat or bag, as well as the following:
Addressing the issue of distinct yet sometimes similar gaits among different subjects can lead to misclassification and reduced system performance.
Tackling the occasional failure of the two-step process of subject detection followed by classification because of incorrect subject detection and increased computational time.
Dealing with variations in clothing styles, making it difficult to extract the rich features necessary for accurate classification.
Managing irrelevant feature information extracted from original frames, which impacts the system accuracy and increases the computational time.
Contributions: The significant contributions of this study are as follows:
A comparative analysis of four deep learning models for gait recognition;
Insights into the strengths and weaknesses of each model;
Recommendations for optimizing deep learning architectures for gait identification;
An evaluation of model performance on a benchmark dataset is needed to guide future research in selecting appropriate models for practical applications.
This method stands out from prior methods, such as the single-stream deep learning approach highlighted in [
2], as it not only enhances the accuracy but also improves the system’s robustness across various environmental conditions. Moreover, it addresses the computational inefficiencies found in traditional silhouette-based approaches by optimizing the computational load without sacrificing accuracy. This makes the framework more suitable for real-time applications, representing a significant advancement in the field of human gait recognition. The purpose of this study was to evaluate and compare various deep learning algorithms for human gait recognition (HGR) using the CASIA-B dataset. While we propose a modified CNN architecture, the main contribution lies in determining the most effective algorithm for HGR under different conditions, providing valuable insights into model selection for practical applications. The novelty of this work lies in its systematic evaluation of several models, including CNNs, MLP, SOMs, and EfficientNet, to determine the most effective algorithm for gait recognition under varying conditions (viewpoint variations, clothing variation, carrying conditions, occlusions, walking speed changes, etc.), offering significant improvements in accuracy and computational efficiency.
Structurally, this work is organized as follows.
Section 2 reviews the relevant literature on human gait recognition (HGR), highlighting the evolution of biometric techniques and the challenges associated with gait analysis.
Section 3 details the methodology employed, including the design and development of the deep learning models and the preparation of the CASIA B dataset.
Section 4 presents the experimental setup and results, offering a comparative analysis of the performance of the CNN, MLP, SOM, and transfer learning models.
Section 5 discusses the implications of these findings, their impact on the field of biometric security, and potential applications and concludes this research by summarizing key insights, addressing limitations, and suggesting directions for future work.
2. Recent Advancements
This paper conducts a thorough investigation evaluating CNNs [
2], EfficientNet [
3], MLP [
4], and SOM [
5] models for gait identification in the face of these obstacles. To analyze the performance, strengths, and limits of each model, the CASIA-B dataset [
6], a commonly utilized benchmark dataset in gait recognition, was used. Our goal was to enhance the understanding of deep-learning-based gait recognition systems and guide the selection of suitable model architectures for practical use by comparing these models in different experimental settings.
Claudio Filipi Gonçalves dos et al. [
7] investigated the application of deep learning in gait identification, a technique used to identify individuals by analyzing their walking patterns. This text emphasizes the advantages of deep learning over traditional biometric methods such as fingerprints and iris scans. It also discusses how deep learning may enhance feature extraction and accuracy. This research presents a historical analysis of biometric identification methods, contrasts gait recognition with traditional methods, and delineates the strengths and weaknesses of nine frequently utilized gait recognition datasets.
The text examines current studies that utilize deep learning for gait identification, focusing on its advantages and limitations. The research proposes the necessity of utilizing more varied and authentic datasets that consider elements such as perspectives, surroundings, and attention. It also promotes additional research on addressing the constraints of existing deep learning models in practical environments. This study is a great resource for academics and developers interested in deep-learning-based gait identification. This approach emphasizes the promise and constraints of this technology and calls for more research to guarantee strong performance in real-world scenarios.
A. Saboor et al. [
8] focused on gait analysis via wearable sensors and machine learning (ML). The analysis of human movement provides valuable information in several sectors, such as healthcare, security, sports, and fitness. This study emphasizes the increase in research in machine learning, enabling the precise extraction of gait features. It also highlights problems such as small sample sizes, restricted computer capabilities, and issues regarding the energy economy. The report recommends that researchers select suitable algorithms, sensors, and locations on the basis of individual requirements and states that ideal sample sizes help minimize bias and lead to more robust findings. Regarding the future, the emphasis is placed on tackling unresolved issues such as generalizability and ethical concerns in data privacy.
B. Jawed et al. [
9] explored the application of gait analysis for reidentification, with a specific focus on the last ten years. The benefits of gait analysis include it serving as a strong identification indicator and being suitable for surveillance applications that do not require collaboration. It recognizes the difficulties posed by current technology, including changes in perspective and fluctuations in illumination. The review classifies existing methods, evaluates the obtained results, and outlines future research paths. Possible applications include surveillance, security, forensics, investigation, and medical and fitness monitoring. This research suggests that additional breakthroughs in this sector can be achieved by investigating new sensor technologies and overcoming existing restrictions. Human gait recognition (HGR) is a biometric identification technique that uses video recordings of individuals walking to recognize them. HGR does not rely on active participation, such as fingerprints or voice recognition, making it appealing for security and surveillance purposes. HGR has key advantages, such as being nonintrusive, distance-friendly, and disguise-resistant. Challenges arise from fluctuations in accuracy caused by variables such as clothes, perspective, and walking pace, as well as ethical concerns related to involuntary surveillance. Advancements in deep learning methods and 3D sensors have the potential to enhance precision and resilience. HGR provides a distinctive method for identification that has potential uses in security, forensics, and other fields. It is essential that privacy concerns are addressed and algorithms are enhanced to ensure ethical and successful deployment.
Zhang et al. [
10] introduced a new approach to cross-view gait recognition via the Koopman operator theory, which captures the dynamic nature of gait patterns. The method uses a convolutional variational autoencoder to extract features from various viewpoints and approximate the Koopman operator, which represents overall gait dynamics. This method offers solid physical interpretability and improved performance, as demonstrated in experiments on the OU-MVLP dataset. Further research is needed to investigate its robustness to clothing variations and environmental changes, compare its performance with other dynamical system theories, and explore its applications beyond gait recognition.
Mutlag et al. [
11] analyzed the significance of feature extraction in image processing activities such as diagnosis, classification, and detection. This work offers a detailed discussion of several feature extraction approaches, such as geometric, statistical, texture, and color characteristics. The authors analyze the performance of various features in geometry and texture tasks by conducting a comparative study utilizing face and plant picture datasets. The investigation revealed that the best qualities may differ on the basis of the type of image. This research highlights the importance of feature extraction in image processing, possible enhancements in feature selection, and the relevance of taking picture-specific factors into account in decision-making. The article is a great resource for understanding and choosing feature extraction strategies.
Pandey et al. [
12] explored a technique for recognizing individuals by analyzing their walking pattern, utilizing sparse representation. The approach, which uses sparse combinations of established gait patterns, was evaluated on the CASIA-B database with individuals and five distinct viewing angles. The findings indicated an average identification rate of 96.5% for all angles, with the peak rate being 98.1% for a particular angle. The approach is invariant with regard to view changes, resilient, and simple to execute, making it beneficial for surveillance purposes. This study proposes enhancements by specifying the sparse representation method employed, indicating the size of the training dataset, and providing comparisons with alternative gait detection systems.
Pansambal et al. [
13] analyzed a model-free method for human gait recognition (HGR), a biometric system that recognizes individuals by their walking style. The focus is on three main aspects: motion-free picture representation, dimensionality reduction, and classification techniques. This report examines publicly accessible gait datasets for research and development, outlines existing research obstacles in model-free human gait recognition (HGR), and suggests future paths for enhancing this field of biometrics. The intended audience comprises academics and developers who are interested in model-free hand gesture recognition methods. Possible enhancements include discussing well-known feature extraction techniques that do not rely on models, emphasizing particular dimensionality reduction approaches, and sharing instances of popular classification algorithms.
Wu et al. [
14] presented an in-depth analysis of human identification via cross-view gait recognition. It leverages deep convolutional neural networks (CNNs) to address the challenges associated with varying viewpoints. This study explores multiple CNN architectures and training strategies to increase recognition accuracy. The authors introduce a novel gait representation that captures both spatial and temporal features. Extensive experiments demonstrate significant improvements over traditional methods, validating the effectiveness of deep learning approaches in handling cross-view variations. This paper also discusses the impact of different network parameters and the importance of large-scale gait datasets. This research contributes to advancing the robustness and applicability of gait-based human identification systems across diverse real-world scenarios.
Chao et al. [
15] introduced a novel approach to cross-view gait recognition by treating gait sequences as sets. This method leverages the set-based perspective to address the variations in viewing angles. The GaitSet model uses a deep learning framework to extract comprehensive gait features, which are then aggregated to improve recognition performance. The authors conducted extensive experiments on large-scale datasets, demonstrating that GaitSet significantly outperforms existing methods. The paper also highlights the robustness of the approach in handling diverse scenarios and complex backgrounds. The proposed model is noted for its efficiency and effectiveness in real-world applications, contributing to the advancement of gait recognition technology. This work represents a substantial step forward in the field, emphasizing the importance of innovative feature representation techniques.
Mehmood et al. [
16] presented an innovative approach to human gait recognition via pre-trained convolutional neural networks (CNNs). The proposed system is an end-to-end solution that leverages pre-trained CNN models to extract and select relevant gait features. This method enhances recognition accuracy by utilizing the powerful feature extraction capabilities inherent in pre-trained models. The authors conducted extensive experiments to validate their approach, which demonstrated significant improvements in recognition performance compared with traditional methods. Their system is efficient, showing robustness across different datasets and varying conditions. This study highlights the importance of feature selection in improving the overall effectiveness of gait recognition systems. This work contributes to the field by providing a scalable and accurate solution for gait-based human identification.
Several recent studies have explored deep learning models for gait recognition using the CASIA-B dataset. For example, Pandey et al. [
12] achieved an accuracy of 96.50% using a multiview sparse representation approach. Mehmood et al. [
16] reported 94.26% accuracy with pre-trained CNNs. In comparison, the proposed CNN model achieved an accuracy of 97.12%, demonstrating improved performance due to the optimized architecture and hyperparameter tuning. This enhancement addresses challenges such as varying viewing angles and clothing conditions more effectively, contributing to the robustness and scalability of gait recognition systems. While prior research on the CASIA-B dataset has explored different architectures, such as CNNs and transfer learning models, few studies have performed direct comparisons of these models under varying environmental conditions. This study addresses this gap by evaluating four distinct deep learning architectures, optimizing them for computational efficiency and generalization across diverse gait patterns.
Currently, the research on human gait recognition is making significant strides, especially with the advent of deep learning techniques. However, notable gaps and limitations remain in this domain. Most existing methods, such as those utilizing single-stream deep learning models, often struggle with the variability introduced by different covariate factors, such as clothing changes, carrying conditions, and multiview environments. For example, Saleem F. et al. [
17] focused primarily on feature fusion but did not fully address the challenge of recognizing gait across multiple views and varying conditions. Moreover, Asif M. et al. [
18] attempted to address these issues, but they often relied on complex preprocessing steps, and their networks were computationally intensive, making them less suitable for real-time applications. These approaches also tend to perform sub-optimally when dealing with large, diverse datasets, limiting their scalability and practical utility.
Deng, M. et al. [
19] introduced a novel frontal-view gait recognition method that enhances human gait recognition accuracy by leveraging gait dynamics and deep learning. Unlike most previous works, which rely on lateral-view parameters, the authors propose the use of frontal-view gait features—kinematic, spatial ratio, and area features. To further improve recognition, temporal dynamics in human walking are captured, and deep learning techniques are incorporated to optimize performance. The method calculates similarities between the test and training gait dynamics and uses an error-based feature fusion scheme for robustness against walking variations. However, a gap exists in the paper’s focus on improving frontal-view recognition, as challenges related to environmental factors in real-world applications, such as occlusions or lighting variations, are not extensively explored.
Hou, S. et al. [
20] critically evaluated silhouette-based gait recognition systems, arguing that while they demonstrate high performance in academic settings, there are significant gaps when they are applied to real-world environments. The authors identify issues such as oversimplified evaluation protocols and sensitivity to noise from factors such as occlusions and rotations. They also introduce the Multi-Height Gait (MHG) dataset, which features walking data for different camera heights, clothing variations, and carried objects. This highlights the gap between academic performance and practical application, emphasizing the need for more complex, realistic evaluations. This work addresses the limitations of existing datasets but lacks comprehensive solutions for noise sensitivity and dataset augmentation.
Qin, L et al. [
21] proposed a method utilizing two-stream convolutional neural networks (two-stream CNNs) for gait recognition based on multisensor data, such as inertial and pressure sensors. The innovative aspect of this work is the progressive feature fusion (PFF) module, which optimally integrates multistage features, and the feature enhancement channel attention (FECA) module, which highlights crucial data. The attention-based feature fusion (ABFF) module further enhances recognition by allowing flexible feature fusion across different sensor modalities. The results show superior recognition accuracy compared to previous approaches. A gap in this study is its limited discussion on how this multisensor approach could be generalized to uncontrolled, large-scale real-world environments, where sensor placement and variance could affect performance.
Sepas-Moghaddam, A. et al. [
22] provided a comprehensive survey of recent developments in gait recognition, focusing on how deep learning has transformed this field since 2015. The authors propose a new taxonomy to categorize state-of-the-art methods based on body representation, temporal representation, feature representation, and neural architecture. They discuss current gait datasets, evaluation protocols, and the challenges in applying deep learning to gait recognition. The review highlights the performance improvements made using these methods but also emphasizes the challenges that remain, such as dataset limitations, scalability issues, and real-world applicability. Future research directions should be aimed toward more robust and adaptable solutions, although the study leaves open questions on how to address these challenges in practice. Previous studies have highlighted challenges such as sensitivity to viewpoint changes, occlusions, and clothing variations. The proposed model addresses these challenges by leveraging deeper convolutional layers and optimized feature extraction techniques, which enhance robustness and improve accuracy under diverse conditions. A comparative analysis of gait recognition is presented in
Table 1.
Recent studies using the CASIA dataset, such as those by Vasudevan et al. [
23] and Alotaibi et al. [
24], have explored deep learning models for gait recognition, achieving promising results. However, these works have focused primarily on a single architecture or feature fusion methods. In contrast, our proposed method compares multiple deep learning models (CNNs, MLP, SOMs, and EfficientNet) and introduces a more comprehensive evaluation framework. Mehmood et al. [
25] proposed a feature selection framework, while Gul et al. [
26] incorporated spatio-temporal features in multiview gait recognition. Our approach builds on these methods by evaluating the performance of multiple models under varying conditions to provide a more generalized understanding of model robustness. Additionally, the work by Hasan et al. [
27] on evaluating CNN models for gait recognition using the CASIA-B dataset aligns closely with our study, but our approach improves upon their methodology by optimizing hyperparameters and testing in diverse conditions.
This research seeks to fill these gaps by introducing a novel dual-stream deep learning framework that not only improves recognition accuracy across diverse conditions but also optimizes computational efficiency, making it more applicable to real-time scenarios. By integrating a modified feature fusion technique and optimizing feature selection with advanced algorithms, this study advances the state of the art in human gait recognition, offering a more robust and scalable solution. This study focuses on gait recognition using visual data from the CASIA-B dataset. While radar technology has shown promise in other gait recognition applications due to its ability to capture movement data in various environmental conditions [
29], the current work does not explore this modality. Future research could investigate the integration of radar technology into deep learning models to improve the robustness and accuracy of gait recognition under challenging conditions, such as poor lighting or occlusions.
3. Materials and Methods
3.1. Dataset Details
This study aimed to identify the most suitable deep learning algorithm for HGR by evaluating and comparing four different models—CNNs, MLP, SOMs, and EfficientNet—using the CASIA-B dataset. The focus was not only on proposing an improved network but also on understanding the strengths and limitations of each architecture to guide future model development. The study utilized the CASIA B dataset, a well-known benchmark dataset commonly used in gait recognition studies. It consists of 124 subsets, each showing differences in perspective, attention, and carrying circumstances. Each variant has 11 unique perspectives, providing a wide array of walking patterns for examination. The dataset comprises 11 perspectives, representing different viewing angles (0°, 18°, 36°, …, 180°). The common angles for the CASIA-B dataset are as follows:
At 0°—front view (subject walking directly toward the camera); at 18°, 36°, 54°, 72°, and 90°—side view (subject walking perpendicular to the camera); and at 108°, 126°, 144°, 162°, and 180°—back view (subject walking directly away from the camera).
Each angle provides unique insights into gait patterns, allowing for robust evaluation across diverse viewpoints. The ten classes correspond to individual subjects, ensuring comprehensive representation in the training and testing phases.
The AA subset of the CASIA B dataset, which includes gait sequences from 10 individuals, was used for our research. A total of 92,596 images were obtained from this sample; this number was 252 with each image reduced to 64 × 64 pixels for consistent data representation. In the self-organizing map (SOM) model, histogram of oriented gradient (HOG) features were retrieved from the images. These characteristics gather gradient details and store spatial arrangements and variations in intensity. Histogram of oriented gradient (HOG) features were calculated via the scikit-image package in Python.
The training set consisted of 74,081 images, whereas the remaining 18,515 images were used for testing. The meticulous dataset selection approach guaranteed that our work was carried out on a sample of gait sequences that accurately represent the population, enabling a strong assessment and comparison of several deep learning models for gait detection.
The CASIA-B dataset was selected for this study due to its established role as a benchmark in human gait recognition (HGR) research. It offers a comprehensive range of conditions, including multiple viewing angles and variations, making it ideal for evaluating and comparing model performance. Its widespread use allows for consistent comparisons with existing studies, facilitating the assessment of the performance and generalizability of the proposed models. This dataset’s diversity ensures a thorough evaluation, simulating real-world challenges and enhancing the reliability of the gait recognition system. Additionally, the CASIA-B dataset offers controlled and consistent data collection, ensuring high-quality, noise-free gait sequences, which are crucial to accurate deep learning model evaluation. Its multi-angle structure helps assess model performance across varying perspectives, a common challenge in real-world gait recognition systems. The dataset’s availability and detailed documentation make it accessible for research and replication, promoting transparency and collaboration within the scientific community. Furthermore, CASIA-B includes diverse subject demographics, which enhances the model’s ability to generalize across different populations and physical characteristics.
The dataset was split into training and testing sets with a stratified approach to ensure balanced representation from each of the 10 individuals. Specifically, 80% of the images from each class (individual) were randomly allocated to the training set, while the remaining 20% were reserved for testing. This method ensures that each class contributes equally to both the training and testing phases, preventing class imbalance and promoting fair evaluation across all subjects. By maintaining proportional representation from each subject, the model learns from a diverse set of gait patterns, improving generalizability and reducing the risk of bias toward specific classes. This approach also minimizes overfitting and ensures consistent performance assessment across different viewing angles and conditions. The data distribution table is shown in
Table 2.
3.2. Convolutional Neural Networks (CNNs)
In this study, fundamental deep learning models such as CNNs, MLP, SOMs, and EfficientNet were selected. These models were chosen to establish baseline performance metrics and provide a systematic evaluation of their effectiveness in gait recognition tasks. By starting with these foundational architectures, we aimed to understand their strengths and limitations, creating a benchmark for future comparisons with more advanced neural networks. Convolutional neural networks (CNNs) have transformed several areas of computer vision, such as gait detection. CNNs excel at extracting spatial features from images, making them well suited for identifying patterns in gait sequences. The ability of these methods to detect edges, shapes, and textures ensures effective gait recognition. CNNs are adept at capturing complex patterns and information from photos, which makes them ideal for identifying tiny variations in gait shapes. By using convolutional layers, CNNs can automatically learn hierarchical representations of gait data, incorporating spatial and temporal aspects. The networks have shown exceptional performance in gait detection tests, with high accuracy rates and resilience to fluctuations in illumination, perspective, and background clutter. This work uses a customized CNN architecture designed for gait recognition, which consists of many convolutional layers and dense layers for classification. Our goal is to utilize CNNs to recognize individuals precisely by analyzing their walking patterns.
Figure 1 presents the overall architecture of the proposed methodology.
3.2.1. Convolutional Layers
Several Conv2D layers with ascending filter dimensions (32, 64, 128, 128) were employed to capture spatial characteristics from the input pictures. ReLU activation functions add nonlinearity to the model, helping capture intricate patterns in the data. MaxPooling layers were added to lower the spatial resolution and computational load while maintaining important characteristics. Initially, the selected dataset frames were utilized as inputs to the convolutional layer. A convolutional layer connects a set of
M filters to a group of
T channels and size
A × B in comparison to a mini-batch of
X images with
T channels and size Height × width. The filter elements are denoted by
Ew,x,y,z, and the image elements are denoted by
Fg,h,i,j. The formulation of the convolutional operation is defined as follows:
The output of a whole image/filter combination can be written as follows:
where ∗ represents the two-dimensional correlation.
Figure 2 presents the convolutional layers arrangement in the proposed convolutional neural network.
3.2.2. Dense Layers
Following the convolutional layers, dense layers with decreasing units (512, 256, 128) were used to learn more complex representations of the retrieved data. ReLU activation functions were utilized for nonlinearity, and dropout layers with a rate of 0.5 were added for regularization to prevent overfitting.
3.2.3. Fully Connected Layer
The CNN-extracted feature was concatenated in the fully connected layer
ffull=
fCNN. The classification result was subsequently produced via a SoftMax operation, as described by the following formulation:
where
p is the output sample, the weight matrix of the output layer is denoted by
wp, and the output bias is denoted by
bp. The fully connected layer (FC) functions on a flattened input, where all neurons are linked to each input.
3.2.4. Output Layer
The last dense layer has several units according to the number of classes in the dataset (num_classes). The SoftMax activation function was used to calculate class probabilities for multiclass classification, allowing the model to predict outcomes for all recognized classes. The model employs four convolutional layers, which are critical in feature extraction. Each layer captures increasingly complex patterns within the input data, starting from basic edges in the early layers to more sophisticated features in the deeper layers. The choice of four layers suggests a balance between model complexity and computational efficiency, allowing adequate feature learning without overfitting.
Table 3 represents the configuration parameters of the modified convolutional neural network model.
The model starts with 32 filters in the first two convolutional layers, which are doubled to 64 and then 128 in the subsequent layers. This progressive increase in the number of filters allows the network to capture more detailed and abstract features as the depth of the network increases. This strategy helps the model learn a wide range of features, which is essential in accurately distinguishing between different classes in the dataset.
A kernel size of 3 × 3 is used for all the convolutional layers, which is a common choice in convolutional neural networks (CNNs). This kernel size is effective in detecting small patterns in the input data, providing a good balance between spatial resolution and computational efficiency. This allows the model to capture fine details in gait patterns, which is critical in gait recognition.
MaxPooling layers with a 2 × 2 window are used after certain convolutional layers to reduce the spatial dimensions of the feature maps. This operation helps in downsampling the data and reducing the number of parameters and computational load, while also making the model more robust to spatial variations in the input data. The 2 × 2 pooling size is standard and helps maintain important features while discarding less critical information.
The network includes a single fully connected (dense) layer before the output layer. This layer combines the features extracted by the convolutional layers and maps them to the final output classes. The use of one fully connected layer indicates a straightforward approach, focusing on maintaining a simple model structure that is less prone to overfitting while still being capable of learning the necessary feature combinations for accurate classification. The CNN model underwent extensive hyperparameter tuning, including adjustments to the learning rate, batch size, and regularization techniques, to optimize performance and prevent overfitting.
The ReLU activation function is applied to the convolutional layers. The ReLU is widely used because of its ability to introduce nonlinearity to the model while avoiding issues such as the vanishing gradient problem. This allows the network to learn complex patterns effectively. For the output layer, the sigmoid activation function is used, which is appropriate for binary or multilabel classification tasks. It outputs a probability value between 0 and 1 for each class, making it suitable for the final classification decision. A learning rate of 0.05 is used in the training process. This value dictates how much the model’s parameters are adjusted with respect to the loss gradient.
A learning rate of 0.05 is relatively moderate, with the aim of finding a balance between learning too rapidly (which could lead to instability) and learning too slowly (which could make the training process inefficient). The model uses a mini-batch size of 64, which means that 64 samples are processed before the model’s parameters are updated. This size is a common choice, offering a good trade-off between computational efficiency and the stability of gradient estimates. This helps achieve faster convergence while maintaining the stochastic nature of gradient descent.
Stochastic gradient descent (SGD) is employed as the learning method. SGD updates the model parameters via small batches of data, which introduces some noise into the training process but often leads to better generalization and faster convergence. Despite being simpler and requiring less memory than more advanced optimizers such as Adam, SGD can be highly effective when paired with an appropriate learning rate schedule and momentum.
The chosen configuration strikes a balance between model complexity, computational efficiency, and generalization ability. The four-layer convolutional network with progressively increasing filters and standard kernel sizes is well suited for capturing intricate features in gait data. ReLU activation in the hidden layers and the sigmoid function in the output layer ensure nonlinear learning and appropriate probability outputs for classification. This configuration is particularly tailored to the given task and dataset, aiming to achieve high accuracy in gait recognition. Algorithm 1 clarifies the detailed sequence of the working method of the modified convolutional neural network model.
Algorithm 1: The algorithm of the modified CNN model |
1. Initialize an image data generator with rescaling and validation split. |
2. Create a training data generator: Load images from the data directory. Resize images to the target size. Apply data augmentation techniques. Split the data into training and validation subsets. |
3. Create a validation data generator with similar settings. |
4. Extracting Features with CNN. |
5. Flatten the output from the convolutional layers. |
6. Add an output layer with a SoftMax activation function for multiclass classification. |
7. Find the dominant class of the input. |
8. Return the result. |
Multi-layer perceptrons (MLPs) are a traditional yet efficient method for gait identification problems. MLPs process flattened feature vectors instead of raw image input, which enhances computing efficiency and simplifies implementation compared with CNNs. Although simple, MLPs are capable of learning intricate nonlinear connections between gait characteristics and individual identities. Our work uses an MLP architecture with many dense layers and dropout regularization to avoid overfitting. We input flattened gait silhouettes into the MLP to identify key characteristics and patterns that represent distinct gait signatures. We intend to investigate the effectiveness of MLPs in gait identification and evaluate their performance in comparison with more intricate convolutional architectures.
3.2.5. Flatten Layer
The flatten layer, similar to the CNN design, converts 3D picture input into a 1D vector for processing by the following dense layers.
3.2.6. Dense Layers
Successive dense layers with decreasing unit counts (512, 256, 128) incrementally acquired and improved the abstract representations of the input information. ReLU activation functions were used to introduce nonlinearity, and dropout layers with a probability of 0.5 were included for regularization.
3.2.7. Output Layer
The output layer of the MLP model was similar to that of the CNN model, utilizing a dense layer with units equal to the number of classes in the dataset. The SoftMax activation function enables multiclass classification.
3.3. Self-Organizing Map (SOM)
The self-organizing map (SOM) provides a unique approach to gait identification through the use of unsupervised learning techniques. SOMs are adept at grouping and displaying complex datasets with many dimensions, making them ideal for analyzing the structural patterns within gait silhouette datasets. SOMs can offer insights into the inherent properties of gait signatures and aid in exploratory research by arranging comparable gait patterns into topological maps. A self-organizing map (SOM) model that starts with a grid of neurons and is trained with gait silhouette data is employed in our research. Hidden patterns and groups in the dataset have been sought to potentially provide new insights into gait dynamics and variability.
3.4. EfficientNet Transfer Learning
The use of pretrained models such as EfficientNet for transfer learning has become a useful method in gait detection problems. Transfer learning allows for the effective training of gait recognition models with few labeled data by utilizing representations acquired from extensive picture datasets. EfficientNet provides a scalable and computationally efficient structure that can adjust to different gait recognition difficulties. We utilize transfer learning via EfficientNet as a feature extractor in our work, where we fine-tune the model’s classification head for gait identification. We aim to leverage the pretrained features generated by EfficientNet to obtain enhanced performance and generalization skills in gait recognition assignments.
3.4.1. Pretrained Model
The pretrained EfficientNet model, which was initially trained on the ImageNet dataset, was utilized as a feature extractor. By freezing its weights, the model retained the robust and generalized feature representations that it had previously learned. This approach allows the model to leverage its comprehensive understanding of diverse visual patterns, applying this knowledge to human gait recognition without retraining the entire network from scratch. This expedited the training process and helped mitigate the risk of overfitting, which is particularly beneficial given the relatively small size of the CASIA-B dataset compared to datasets typically used for training such large networks.
3.4.2. Classification Head
Building upon the robust features extracted by the frozen layers of EfficientNet, a custom classification head was designed to tailor the model specifically to the gait recognition task. This new head consists of a sequence of layers starting with a flattened layer, which converts the multidimensional feature maps into a one-dimensional vector. This was followed by a dense layer with 512 units, which utilized the ReLU activation function to introduce nonlinearity and ensure that the model could capture complex relationships within the data. To enhance generalization and reduce the risk of overfitting, a dropout layer was included with a regularization rate of 0.5, randomly deactivating half of the units during training. This configuration aimed to optimize the model’s ability to differentiate between individuals based on their gait patterns while maintaining the computational efficiency and compactness characteristics of EfficientNet.
3.4.3. Output Layer
The EfficientNet model’s output layer is similar to the CNN and MLP designs, consisting of a dense layer with units corresponding to the number of classes in the dataset. The SoftMax activation function facilitated multiclass categorization.
Figure 3 shows the interior block models of EfficientNet.
5. Conclusions and Future Research Directions
This research analyzed the effectiveness of four deep-learning models in gait identification by utilizing the CASIA-B dataset. The CNN model outperforms all the other models, obtaining the excellent accuracy of 97.12%. The MLP model achieved an accuracy of 59.23%, surpassing the SOM model but not reaching the performance levels of the CNN and EfficientNet models. The SOM model demonstrated poor performance, with an accuracy of just 24.58%, indicating its inability to capture intricate nonlinear correlations in gait data. Transfer learning with the EfficientNet technique yielded unsatisfactory results, with an accuracy of only 11.76%. Compared to recent works [
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28] utilizing the CASIA-B dataset, the proposed approach demonstrated superior accuracy and robustness. By addressing key challenges such as varying viewpoints and environmental conditions that are covered in this standard dataset, the model not only improves upon existing performance benchmarks but also offers a more reliable solution for real-world gait recognition applications.
While the study shows promising results, future work will focus on evaluating the generalizability of the CNN classifier using unseen subject folds. This step is essential in ensuring the robustness of the model in real-world gait recognition applications. While this study focuses on visual data for gait recognition, there has been increasing interest in alternative technologies, such as Wi-Fi signal-based and RFID-based systems. Wi-Fi signals can be leveraged for non-intrusive, contactless gait analysis by analyzing signal variations caused by the movement of individuals [
30]. Similarly, RFID technology can be used for tracking and identifying subjects, offering an additional layer of information for gait recognition systems [
31]. Although not considered in this study, these technologies show promise in enhancing gait recognition accuracy, particularly in environments where visual data may be limited or compromised.
This study not only provides a thorough evaluation of deep learning models for gait recognition using the CASIA-B dataset, it also contributes to ongoing research by comparing multiple architectures and addressing the limitations noted in recent studies by Mehmood et al. [
25] and Gul et al. [
26]. Our findings offer valuable recommendations in selecting the most effective model for practical gait recognition applications, ensuring robustness and accuracy under diverse conditions. In addition to visual-based methods, several other technologies have been widely applied to Human Activity Recognition (HAR). For instance, wearable sensors such as accelerometers and gyroscopes, vision headsets, and smart watches are commonly used to capture motion and orientation data for activity classification. Future work will explore additional sensor technologies to further enhance the accuracy and robustness of the system. The study suggests prioritizing CNNs in order to achieve high accuracy, considering easier implementations, and exploring other strategies for EfficientNet. This work significantly contributes to the advancement of gait recognition technology, offering insights that are critical in enhancing its real-world applicability. By addressing the challenges associated with different viewing angles and exploring the capabilities of deep learning models such as EfficientNet, this research lays the groundwork for more accurate, reliable, and scalable biometric systems. These improvements can lead to enhanced security, better law enforcement tools, and expanded use in healthcare, making this research highly impactful in various practical domains. Future research will focus on developing specialized deep learning architectures for gait recognition, studying advanced feature extraction methods, enhancing model generalizability for real-world variability, and addressing ethical and privacy issues related to gait recognition technologies. By further investigating deep learning for gait identification, we can fully realize its promise in many applications in security, surveillance, healthcare, and other fields.