Next Article in Journal
Short-Term Wind Power Prediction Based on Encoder–Decoder Network and Multi-Point Focused Linear Attention Mechanism
Previous Article in Journal
Design of a Compact Multiband Monopole Antenna with MIMO Mutual Coupling Reduction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MFCF-Gait: Small Silhouette-Sensitive Gait Recognition Algorithm Based on Multi-Scale Feature Cross-Fusion

1
College of Information, Yunnan Normal University, Kunming 650500, China
2
Engineering Research Center of Computer Vision and Intelligent Control Technology, Department of Education of Yunnan Province, Kunming 650500, China
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(17), 5500; https://doi.org/10.3390/s24175500
Submission received: 11 July 2024 / Revised: 21 August 2024 / Accepted: 23 August 2024 / Published: 24 August 2024
(This article belongs to the Special Issue Artificial Intelligence and Sensor-Based Gait Recognition)

Abstract

:
Gait recognition based on gait silhouette profiles is currently a major approach in the field of gait recognition. In previous studies, models typically used gait silhouette images sized at 64 × 64 pixels as input data. However, in practical applications, cases may arise where silhouette images are smaller than 64 × 64, leading to a loss in detail information and significantly affecting model accuracy. To address these challenges, we propose a gait recognition system named Multi-scale Feature Cross-Fusion Gait (MFCF-Gait). At the input stage of the model, we employ super-resolution algorithms to preprocess the data. During this process, we observed that different super-resolution algorithms applied to larger silhouette images also affect training outcomes. Improved super-resolution algorithms contribute to enhancing model performance. In terms of model architecture, we introduce a multi-scale feature cross-fusion network model. By integrating low-level feature information from higher-resolution images with high-level feature information from lower-resolution images, the model emphasizes smaller-scale details, thereby improving recognition accuracy for smaller silhouette images. The experimental results on the CASIA-B dataset demonstrate significant improvements. On 64 × 64 silhouette images, the accuracies for NM, BG, and CL states reached 96.49%, 91.42%, and 78.24%, respectively. On 32 × 32 silhouette images, the accuracies were 94.23%, 87.68%, and 71.57%, respectively, showing notable enhancements.

1. Introduction

Gait recognition refers to the technique of identifying individuals or analyzing behaviors by analyzing their walking or movement patterns. It utilizes unique motion patterns and biological characteristics generated by the human body during walking, converting them into identifiable features [1]. Due to its non-invasive and difficult-to-disguise nature, gait recognition holds significant application potential in forensic and security domains. However, it faces numerous challenges in practical applications due to the complexity of real-world scenarios.
From a data perspective, current research in gait recognition can be categorized into model-based and silhouette-based approaches [2]. Model-based methods construct models using relationships such as the positions of human body keypoints, utilizing 2D, 3D, or SMPL models to extract gait features through data learning [3]. Silhouette-based methods focus on extracting gait features from gait silhouette images. With the rise in deep learning, research in gait recognition has also integrated deep learning techniques. Among model-based methods, PoseGait [4] utilizes 3D body pose modeling, GaitGraph [5] employs graph convolutional networks on 2D skeletal models, and HMRGait [6] achieves end-to-end gait recognition using pre-trained human body mesh models based on the SMPL model [7]. Model-based methods exhibit robustness against noise from different clothing but suffer from accuracy issues when dealing with low-resolution footage or distant subjects, which significantly impacts recognition accuracy. In contrast, silhouette-based methods are widely applicable, are simple to implement, and require fewer computational resources. GaitSet [8] treats sequences of gait silhouettes as sets, using convolutional neural networks to extract gait sequence features and compressing temporal information through max pooling to approximate mathematical set features. This approach is influential due to its simplicity and efficiency. GaitPart [9] focuses on local details of gait silhouettes, combining spatial segmentation and temporal micro-motion capture modules to enhance feature extraction capabilities. GaitGL [10] combines global and local convolution operations to complement each other, addressing the drawbacks of purely global or local approaches. The GaitBase model from the OpenGait [11] project has been meticulously designed through extensive experimentation, resulting in the selection of several streamlined, widely used, and validated modules. These modules are integrated to form a concise yet robust baseline model.
In studies on silhouette-based gait recognition, it has been a long-standing practice to preprocess gait silhouettes using the Takemura method [12] to a size of 64 × 64 for training and testing purposes. Some models also utilize larger 128 × 128 silhouettes to achieve better recognition rates [13]. However, this approach overlooks the possibility that models in real-world applications may encounter smaller silhouettes, such as those with a resolution of 32 × 32. Compared to the larger sizes, 32 × 32 silhouettes lose more image details and information, leading to a decrease in recognition accuracy. Current mainstream silhouette-based gait recognition models like GaitSet, GaitPart, and GaitGL experience significant drops in accuracy when confronted with smaller silhouettes. In complex real-world scenarios, such as when dealing with human subjects at greater distances, original silhouette resolutions lower than 64 × 64 become common. This situation occurs, for instance, with standard high-definition surveillance cameras commonly used today, especially at distances exceeding 30 m. The significance of recognizing silhouettes from further distances lies in the potential to maintain high accuracy rates even with these smaller resolutions. This capability could substantially increase the detection range of cameras, thereby achieving cost reduction goals and other practical applications, making it highly valuable in real-world settings.
This paper proposes two methods to address issues caused by small silhouettes. First, a super-resolution interpolation algorithm is applied at the input stage to standardize input silhouette resolutions to 64 × 64, attempting to compensate for geometric detail losses in small silhouettes while normalizing data. The experimental results show improved accuracy across different resolution silhouettes. Additionally, enhancing the super-resolution interpolation algorithm during the Takemura method preprocessing stage benefits recognition rates for 64 × 64 silhouettes. Historically, there have been studies improving gait recognition algorithm accuracy through super-resolution before the popularity of deep learning, such as Makihara’s work [14] using multiple low-frame-rate gait sequences to construct high-frame-rate sequences based on temporal cycles and Zhang Jun’s approach [15] capturing lost high-frequency information combined with neighborhood embedding and interpolation methods. However, these were tailored for Gait Energy Images (GEI) [16] and did not integrate deep learning, potentially paving the way for future deep learning-based gait recognition super-resolution algorithms.
Furthermore, this paper introduces a multi-scale feature cross-fusion module in the model, allowing for the main network to continuously operate while preserving and merging low-resolution features containing more semantic information with high-resolution features at multiple stages. This compensates for image detail losses after multiple convolution and pooling operations at lower levels and sensitizes the model to small-scale silhouette information. Experimental results on the CASIA-B dataset demonstrate that improvements in input and model dimensions enhance recognition accuracy, surpassing models like GaitSet and GaitPart for 32 × 32 gait silhouettes, highlighting unique advantages. This study contributes to advancing gait recognition technology, particularly in handling challenges posed by smaller resolution gait silhouettes in practical applications.

2. Related Works

Our work primarily focuses on two aspects: dataset preprocessing and model design in silhouette-based gait recognition. Currently, the mainstream method for preprocessing in silhouette-based gait recognition is the Takemura method. The purpose of this method is to segment the original gait silhouette images into appropriate sizes while attempting to remove invalid information and unqualified images as much as possible. These processed images are then used for training or testing purposes. In addition to the Takemura method, our approach includes an additional step during preprocessing using super-resolution interpolation algorithms to standardize the input data to the same resolution. This process aims to enhance the detail information of smaller silhouette images. Regarding model design, we employ set pooling as the primary feature aggregation method. This approach has been validated for its effectiveness in numerous studies. The translation maintains the technical detail and clarity of the original text, ensuring that the key concepts related to dataset preprocessing and model design in silhouette-based gait recognition are accurately conveyed.

2.1. Takemura Method

Takemura method, introduced in 2018 by Noriko Takemura in collaboration with the OU-ISIR MVLP dataset, refers to a method for normalizing gait silhouette images [12]. Specifically, the method starts by identifying the top and bottom of the silhouette image to remove excess background. It then determines the horizontal center by statistically analyzing pixel values horizontally and adjusts the image height to 64 pixels while proportionally adjusting the width. Finally, adhering to symmetry principles, the silhouette image is cropped into a square shape. The normalization process concludes by resizing the silhouette image to either 64 × 64 or 128 × 128 resolution, making it suitable for deep learning applications. Figure 1 describes the process. During this normalization process, inevitably, the use of super-resolution algorithms is involved. However, the original Takemura method paper does not specify which super-resolution algorithm was used. This lack of specification may have led to inconsistent data processing across different models, resulting in varying results when the same model is reproduced in different studies. In this paper’s experimental section, we will compare and discuss the Takemura method’s performance when using different super-resolution techniques. The continuation provides a detailed explanation of the Takemura method’s steps and addresses the potential variability introduced by unspecified super-resolution algorithms in different research contexts.
Furthermore, when using the Takemura method for processing gait silhouette images in gait recognition, images with smaller silhouettes are often filtered out for various reasons. Figure 2 presents examples of potential elements that may be removed during the preprocessing stage. Specifically, during the filtering process, the cumulative value of the entire image is computed. Given that gait datasets are typically binary images, this cumulative value effectively reflects the size of the silhouette within the image. Images with a cumulative value below a certain threshold are removed, and the remaining images are then processed by the Takemura method to be resized to 64 × 64 pixels. Although the Takemura method inherently loses details during scaling, in practical applications, there is a possibility that the distance between the person and the camera may be too great, resulting in silhouette images significantly smaller than 64 × 64 pixels. This further loss of detail can severely impact the accuracy of the model.

2.2. Super-Resolution

For deep learning models, input data are crucial. On one hand, the dataset must comprehensively reflect the problem to facilitate effective model training. On the other hand, strict adherence to data formatting requirements is essential. For silhouette-based gait recognition models, different resolutions of gait silhouettes must be standardized to the same size for input into the model. Failure to match the expected data format with that of the model can lead to training failures or errors [17]. Super-resolution algorithms effectively address this issue. Another advantage of using this method to standardize gait silhouette sizes is that it eliminates the need for modifying the model when faced with silhouettes of varying sizes. Consequently, the model only needs to be trained once to handle gait silhouettes of all sizes, thereby resulting in a robust model.
Super-resolution algorithms are image processing techniques aimed at recovering high-resolution details from low-resolution images or videos. Their primary objective is to infer missing high-frequency details from limited data by leveraging inherent image information, structure, statistical regularities, and possible prior knowledge to enhance image quality. However, mathematically, super-resolution algorithms deal with ill-posed problems. The irreversible loss of high-frequency detail in original low-resolution images or videos means any algorithm can only attempt to compensate for this loss, leading to uncertainties in restoring high-resolution images. It is impossible to completely and accurately reconstruct the original high-resolution image from low-resolution data. Consequently, the performance of different super-resolution algorithms varies [18].
This study observes that the reproducibility of results for the same model often varies across many gait recognition-related papers [11]. Upon a comparative analysis, it was found that besides factors such as training environment and adjustments to hyperparameters, the choice of super-resolution algorithm significantly influences model training outcomes. In certain deep learning models, simply substituting a superior super-resolution algorithm has shown to improve accuracy compared to the original study. This phenomenon is attributed to the heightened importance of geometric details supplemented by super-resolution when silhouette image resolutions are notably low.
The most common types of super-resolution algorithms are various interpolation methods, such as nearest neighbor interpolation [19], bilinear interpolation [20], and bicubic interpolation [21]. Interpolation algorithms, compared to other types of super-resolution methods, offer greater flexibility and can be applied at each stage of gait recognition data processing, effectively enhancing model performance. Consequently, this paper primarily discusses the impact of various interpolation-based super-resolution algorithms on gait recognition.
Nearest neighbor interpolation is a relatively straightforward super-resolution algorithm that operates by duplicating each pixel from a low-resolution image to its corresponding position in a high-resolution image using the value of the nearest neighbor pixel for filling. This method is simple and intuitive with fast computation speed, but it lacks numerical smoothing and may result in images with jagged edges. The formula is as follows:
G i + u , j + v = G i , j
In this context, where u ,   v are decimal fractions within the range [0, 1), G i , j represents the value at point i , j in the low-resolution image.
Bilinear interpolation considers the weighted relationships between the adjacent four pixels, in contrast to nearest neighbor interpolation. Consequently, it often produces smoother and more accurate results. During the process of image enlargement, bilinear interpolation typically yields smoother and more natural outcomes. The formula is as follows:
G i + u , j + v = 1 u 1 v G i , j + 1 u v G i , j + 1 + u 1 v G i + 1 , j + u v G i + 1 , j + 1
where u ,   v are decimal numbers in the interval [0, 1) and G i , j denotes the value at the low-resolution image point i , j .
Bicubic interpolation, which we ultimately adopted, represents a relatively advanced interpolation method compared to bilinear interpolation. It constitutes a further improvement over bilinear interpolation by involving more neighboring pixels in the interpolation calculation, thereby enhancing the quality and detail enhancement capability of the results. This method is capable of producing smoother and clearer enlarged images. It employs a cubic polynomial in the interpolation formula for numerical transformation, defined as follows:
f x = w 3 w 2 + 1 , 0 w 1 w 3 + 5 w 2 8 w + 4 , 1 w 2 0 , w > 2
The interpolation formula is as follows:
G i + u ,   j + v = A × B × C
where the meanings of A , B , and C are as follows:
A = S ( 1 + u ) S ( u ) S ( 1 u ) S ( 2 u )
B = G ( i 1 , j 1 ) G ( i 1 , j ) G ( i 1 , j + 1 ) G ( i 1 , j + 2 ) G ( i , j 1 ) G ( i , j ) G ( i , j + 1 ) G ( i , j + 2 ) G ( i + 1 , j 1 ) G ( i + 1 , j ) G ( i + 1 , j + 1 ) G ( i + 1 , j + 2 ) G ( i + 2 , j 1 ) G ( i + 2 , j ) G ( i + 2 , j + 1 ) G ( i + 2 , j + 2 )
C = S ( 1 + v ) S ( v ) S ( 1 v ) S ( 2 v )

2.3. Set Pooling

The concept of set pooling was first introduced in the GaitSet, a classic model for gait recognition. This method takes into account that the number of gait silhouette images for a person can vary arbitrarily, sometimes resulting in significantly different lengths of gait sequences. Therefore, using the maximum pooling function along the temporal dimension to extract gait features can capture more generalized information. In recent years, this approach has been validated for its effectiveness across various studies. The rationale for choosing set pooling lies in several unique advantages: It treats gait sequences as sets, thus eliminating the strict temporal order requirement and reducing computational burden on hardware. Furthermore, with the increasing richness of gait recognition datasets, particularly in outdoor environments, set pooling has exceeded expectations, demonstrating robustness against visual noise and varying lengths of gait sequences [22]. It has outperformed many subsequent methods in complex real-world scenarios. The specific formula is as follows:
S P F e a t u r e n , c , h , w = M a x ( F e a t u r e n , c , h , w , 0 )
where “ F e a t u r e ” denotes the feature matrix and n , c , h , w , respectively, represent the number of feature maps, channels, height, and width of the feature maps. “ M a x ( F e a t u r e n , c , h , w , 0 ) ” signifies aggregating the feature matrix along the dimension of feature map number n by retaining the maximum values, resulting in the aggregated feature as “ F e a t u r e 1 , c , h , w ”.

3. Materials and Methods

3.1. Dataset Detail

3.1.1. Dataset

The CASIA-B dataset [23] is one of the most widely used datasets for gait analysis. It includes RGB and silhouette multi-view gait data from 124 participants, captured at 11 different angles ranging from 0° to 180° with an interval of 18°. This dataset incorporates three walking conditions: normal (NM), in a coat (CL), with a bag (BG). For each participant at each viewpoint, there are 6 sequences of normal walking gaits, 2 sequences of walking with a coat, and 2 sequences of walking with a bag, resulting in a total of 110 gait sequences per person across the 11 viewpoints.

3.1.2. Dataset Splitting and Rank-1

The most commonly used testing protocol for the CASIA-B dataset is the subject-independent protocol, where individuals in the training set do not overlap with those in the testing set. A typical approach involves assigning data from the first 74 participants to the training set, leaving the data from the remaining 50 participants for testing. Within the testing set, gait sequences are categorized into probe set and gallery set [24].
In gait recognition systems, “Probe” typically refers to the probe set (or verification set), while “Gallery” can be understood as the gallery set (or enrollment set). In this system, for each known individual, there existed a corresponding set of gait sequences, which were combined into a gallery set. In the trained gait model, all sequences from the gallery set were inputted, resulting in the generation and storage of vector representations. During identification of a new gait sequence for verification, this sequence was inputted into the trained gait model, producing a new vector representation. The system calculated the Euclidean distance between the outputted vector and all vectors in the gallery set, subsequently computing the Rank-1 identification rate. If the new sequence matched the closest known individual in terms of Euclidean distance, the identification was considered successful; otherwise, it was deemed a failure. The Rank-1 identification rate was computed to assess the performance of the gait model. The computational details are illustrated in Figure 3.
The reason for adopting such a testing protocol was because it is impractical in real-world applications to pre-train the identities of individuals to be recognized. This approach allowed researchers to better simulate scenarios where gait recognition technology is applied in the real world, thus more reasonably assessing the performance of gait recognition algorithms on untrained individuals.

3.2. Methods

3.2.1. Model Overview

As shown in Figure 4, the model in this paper was primarily divided into four parts: super-resolution processing, backbone network, multi-scale feature cross-fusion module (MFCF), and single-scale horizontal pyramid mapping (SHPM).
The role of super-resolution processing is twofold: First, it aims to compensate for lost image details in small silhouette images to enhance recognition rates [15]. Second, deep learning imposes strict requirements on input data formatting to avoid computational errors, necessitating uniform formatting. Furthermore, our experiments found that selecting a superior super-resolution algorithm also improved recognition rates for 64 × 64 silhouette images. The backbone network was constructed using the classic VGGNet [25], employing alternating convolutional and pooling layers to extract gait features, followed by set pooling (SP) to aggregate these features. The MFCF module integrated shallow and deep features from the backbone network. Shallow features contain rich geometric details but less semantic information, whereas deep features possess more semantic information but fewer geometric details. By collecting and fusing information from different layers, this module compensated for each other’s deficiencies [26]. Compared to other multi-scale feature fusion methods, MFCF ensures sufficient information exchange through iterative fusion of features across scales, minimizing information loss caused by pooling layers and making the model more sensitive to small silhouette images.
The SHPM includes the crucial component SHPP (Single-scale Horizontal Pyramid Pooling), adapted from Horizontal Pyramid Pooling (HPP) [27], initially used in pedestrian re-identification and later adopted in gait recognition. HPP achieves this by segmenting the feature maps from each module into strip-like regions at four different scales, thereby encouraging the neural network to focus on features of various sizes. The three-dimensional strip-like feature maps are then compressed into one-dimensional features through global max pooling and global average pooling, followed by summation. Finally, the pooled features are mapped and classified via a fully connected layer. HPP divides feature maps from various modules into strips at four different scales to encourage the neural network to focus on features of different sizes. It then uses global max pooling and global average pooling to compress the three-dimensional strip features into one-dimensional features, which are summed. Finally, the pooled features are mapped and classified through fully connected layers. However, our study found that the multi-scale segmentation method employed by HPP did not significantly benefit our model. Based on experimental results, a single scale proved effective. Therefore, we simplified HPP into the single-scale SHPP, reducing model parameters while preserving functionality. The specific structure of SHPP is shown in Figure 5.

3.2.2. Multi-Scale Feature Cross-Fusion Module

In convolutional neural networks (CNNs), the utilization of pooling layers is a common operation known to significantly impact various aspects of model performance, such as feature extraction, enlarging receptive fields, enhancing translational invariance, and reducing computational complexity [28]. However, pooling layers also introduce irreversible loss of geometric details in feature maps, ultimately leading to high-level feature maps closer to discriminative modules lacking geometric details, thereby affecting the accuracy of smaller targets. Consequently, a method to address this issue involves fusing lower-level feature maps rich in geometric details with higher-level feature maps. Traditional multi-scale feature fusion methods either incorporate pooling layers within fusion modules, resulting in loss of geometric details, or fail to sufficiently integrate information across different scales, neglecting further exploration of fused features, thereby hindering the model’s ability to learn from small contour images [29].
In consideration of these challenges, this paper introduced a multi-scale feature cross-fusion (MFCF) module. The two branches of this module extract feature maps of specific sizes from different layers of the model. After a series of processing steps, these feature maps were combined with the output of the backbone network and fed into the SHPM (Scale-aware Hierarchical Pooling Module). The key characteristic of these specific layer feature maps was that after pooling, they resembled lower-resolution images that had lost some geometric details. Overall, after branch-wise inference and fusion, these feature maps were combined with the backbone network results and entered the discrimination stage. This process enhanced the weight of low-resolution image features in the final results, thereby guiding the model to focus on learning finer details of small silhouette features and ultimately improving the recognition accuracy of small-sized silhouette images. This module extracted feature maps from different levels of the model without internal pooling operations to preserve original resolutions and continued feature computation using convolutional kernels to augment semantic information of fused features [30]. Inter-branch operations involved cross-fusion through multiple stages with various convolutional kernels, employing techniques such as upsampling and downsampling to ensure comprehensive information exchange. Applying this approach to high-resolution image features yielded additional semantic information relevant to gait, which facilitated more accurate discrimination during the classification phase. Conversely, for low-resolution image features, this method recovered more of the lost geometric details, thereby contributing to improved accuracy during the discrimination stage.
Consequently, MFCF retained and explored detailed information while minimizing computational overhead, transmitting operation results alongside backbone network outcomes to discriminative modules to enhance the significance of multi-scale feature maps, thereby improving recognition rates for low-resolution gait silhouettes.

3.2.3. Loss

This paper employed joint training of the model using Triplet Loss [31] and Cross Entropy Loss [32]. Triplet Loss is a commonly used loss function in gait recognition models, particularly advantageous for learning fine-grained features, as evidenced by numerous studies in the field. However, Triplet Loss may suffer from imbalanced sample quantities between classes during training, where some classes have fewer samples than others. This imbalance can lead to insufficient attention to samples from certain classes during training, resulting in unstable model training and slow convergence. Combining Triplet Loss with other loss functions in joint training can constrain these issues and compensate for these drawbacks. The formulation of the joint loss function is as follows:
L o s s = a × L o s s t r i + b × L o s s c e
The coefficients a and b correspond to the respective weighting factors for the two loss functions.
The objective of Triplet Loss is to learn to map samples from the same class into a compact feature space while differentiating samples from different classes into separate regions. In Triplet Loss, each training sample consists of three samples: an anchor sample, a positive sample, and a negative sample. Here, the anchor sample and the positive sample belong to the same class, whereas the negative sample belongs to a different class. The goal of the loss function is to minimize the distance between the anchor and the positive sample while maximizing the distance between the anchor and the negative sample. By adjusting the feature representations of these samples, the model can effectively distinguish between different classes. The computation of Triplet Loss is defined as follows:
L o s s t r i = m a x ( 0 ,   d ( a ,   p ) d ( a ,   n ) + m a r g i n )
where a denotes the feature representation of the anchor sample, p represents the feature representation of a positive sample from the same class as the anchor, and n represents the feature representation of a negative sample from a different class than the anchor. The function d ( x , y ) denotes the Euclidean distance or cosine similarity between features x , y . m a r g i n is a hyperparameter used to control the margin boundary, ensuring that the distance between samples from the same class is at least marginally greater than the distance between samples from different classes.
The objective of Cross Entropy Loss is to minimize the discrepancy between the predicted values of the model and the actual labels, ultimately reducing the gap between predicted results and true class categories. The computation of Cross Entropy Loss is as follows:
L o s s c e = 1 m i = 1 m j = 1 n p i j ln q i j
where m represents the total number of samples, n denotes the total number of classes, p i j equals 1 if the sample belongs to the class and 0 otherwise, and q i j represents the predicted probability that the sample belongs to the class.

4. Experimental Results

4.1. Implementation Details

The experimental setup and hyperparameters are crucial in deep learning. The experimental setup determines the platform on which the model runs, while appropriate hyperparameter settings greatly assist in model convergence. The experimental setup of this study is shown in Table 1, and the experimental hyperparameters are listed in Table 2.

4.2. Super-Resolution Algorithm Comparison Experiments

In silhouette-based gait recognition models, it is necessary to standardize gait silhouette images of varying sizes to a uniform resolution using super-resolution algorithms before processing. The most common practice is to adjust the resolution to 64 × 64 pixels. This study investigates the impact of different super-resolution algorithms on model accuracy during training. For gait silhouette images with larger original sizes and finer details in the CASIA-B dataset, this study directly compresses them to a resolution of 64 × 64 pixels using various super-resolution algorithms.
In addressing the case where gait silhouette images are smaller than 64 × 64 in practical applications, this paper employs the simplest interpolation method, namely nearest neighbor interpolation, to compress gait silhouette images from the CASIA-B dataset to 32 × 32. This simulates the loss of image details due to excessive distance in real scenarios. For silhouette images smaller than 64 × 64, the experimental design involves using various super-resolution algorithms to upscale them to a 64 × 64 resolution in order to investigate the effectiveness of different super-resolution algorithms in recovering image details. During training, the models uniformly use gait silhouette images compressed directly to 64 × 64 resolution. During testing, trained models process the corresponding directly compressed data and compress then upscale the data separately, yielding respective accuracy results. The experimental results are shown in Table 3.
Based on the data analysis from the experimental results table, the following conclusions can be drawn: After compressing larger silhouette images to a resolution of 64 × 64, the highest accuracy is achieved using bicubic interpolation, followed by bilinear interpolation, with nearest neighbor interpolation yielding the lowest accuracy. Regarding silhouette images at 32 × 32 resolution, bicubic interpolation exhibited the best accuracy in this study. The reason bicubic interpolation performs best in model recognition, as suggested by this paper, lies in its ability to fill in more local details, enabling the model to comprehensively capture feature information crucial for recognition. This approach allows the model to more accurately differentiate and identify differences in various gait features. Additionally, silhouette images processed through bicubic interpolation exhibit smooth grayscale transitions, providing probabilistic information beyond simple binary black-and-white images. The gradient transitions between high brightness areas and key points of the human body represent a confidence distribution, allocating more weight to regions with higher likelihoods. This enhances the reliability and reference capability of gait recognition models. The gait silhouettes processed by different super-resolution algorithms are illustrated in Figure 6. This achievement offers valuable guidance and insights for future gait recognition research and practical applications. Further exploration and enhancement of super-resolution algorithms, coupled with the fusion of local details and probabilistic information, hold promise for improving the performance and accuracy of gait recognition.

4.3. Feature Fusion Module Ablation Experiments

To validate the effectiveness of the proposed multi-scale feature cross-fusion module, particularly in exploring the accuracy on 32 × 32 silhouette images, experiments were conducted by modifying the base model of this paper to exclude this module and instead integrate other multi-scale feature fusion modules for comparison. The experimental setup involved four approaches: utilizing a unified super-resolution algorithm without additional modules, integrating the MGB module [8] from the original GaitSet model, incorporating the FPN module, and employing the method proposed in this paper. The MGB module, derived from the GaitSet model, aims to collect and fuse features from different depths to enhance representation. FPN [33] (Feature Pyramid Network) is a classic feature extraction structure that performs feature fusion across different levels of feature maps through both top-down and bottom-up pathways, effectively extracting feature representations with rich semantics and multi-scale information. The experimental results are presented in the data table. The experimental data for the different modules are presented in Table 4.
Based on the data presented above, it is evident that our proposed method not only outperforms others in accuracy on 64 × 64 silhouette images but also significantly surpasses alternative modules in performance on 32 × 32 silhouette images, making it particularly suitable for recognizing small-scale silhouettes.

4.4. Comparison with the Existing Methods

In comparison with current mainstream silhouette-based gait recognition models, our model demonstrates superior performance, particularly when handling small-scale silhouettes. Under identical hardware configurations and utilizing a unified super-resolution algorithm, our model is compared with GaitSet, GaitPart, and GaitGL on 64 × 64 and 32 × 32 silhouette images. The experimental results are presented in Table 5.
Based on the tabulated experimental results, it is evident that the model proposed in this paper exhibits outstanding performance across different image resolutions. Specifically, on 64 × 64 silhouette images, our model performs exceptionally well, surpassing the GaitSet model and achieving accuracy comparable to that of the GaitPart model. However, the accuracy does not surpass that of the GaitGL and GaitBase models. This may be due to the fact that in the MFCF module, a larger number of low-resolution image features are processed and combined with the backbone network features before entering the subsequent module. While this approach increases the discrimination weight of low-resolution images within the branch, it also dilutes the weight of high-resolution image features processed by the backbone network. Additionally, the simplicity of the backbone network used in this study might not have effectively captured the information from high-resolution images, thereby reducing the accuracy of the 64 × 64 silhouette images. On 32 × 32 silhouette images, our model demonstrates significant superiority over these three models. These findings indicate that the model modifications and the choice of super-resolution algorithm in this study are highly effective. On one hand, by modifying and optimizing the model structure while retaining the advantages of the original model, this study successfully enhances the overall model performance. On the other hand, the study identifies and to some extent addresses the often overlooked issue of small-scale silhouette images in current mainstream research. This issue holds significant practical importance, as small-scale silhouettes are more prevalent in certain real-world scenarios.

5. Conclusions

This paper addresses the issue in practical applications of silhouette-based gait recognition, where the accuracy of models is affected by excessively small silhouette images. To tackle this challenge, this study thoroughly investigates the impact of super-resolution algorithms at the input stage on gait recognition accuracy. The research reveals that employing super-resolution algorithms capable of enhancing image details with smooth transitional properties significantly improves model accuracy, particularly noticeable in the recognition of small-scale silhouette images. Furthermore, the paper proposes a novel multi-scale fusion module in the model architecture. This module is designed to better leverage multi-scale information, focusing on learning from small-scale details, thereby further enhancing the accuracy and robustness of the model towards small-scale silhouette images. Through these investigations, the effectiveness of model modifications and super-resolution algorithms is validated, demonstrating their crucial role in enhancing model performance and robustness. This advancement contributes to bridging the gap between gait recognition research and practical applications.

Author Contributions

Conceptualization, C.S. and L.Y.; methodology, C.S. and L.Y.; software, C.S. and L.Y.; validation, C.S. and L.Y.; formal analysis, C.S. and L.Y.; investigation, C.S. and R.L.; resources, C.S. and L.Y.; data curation, C.S. and R.L.; writing—original draft preparation, C.S. and L.Y.; writing—review and editing, C.S. and R.L.; visualization, C.S. and L.Y.; supervision, C.S.; project administration, C.S. and L.Y.; funding acquisition, C.S. and L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This study conducted experiments using the CASIA-B dataset. This dataset is provided by the Center for Biometrics and Security Research at the Chinese Academy of Sciences.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wan, C.; Wang, L.; Phoha, V.V. A survey on gait recognition. ACM Comput. Surv. (CSUR) 2018, 51, 1–35. [Google Scholar] [CrossRef]
  2. Mogan, J.N.; Lee, C.P.; Lim, K.M. Advances in vision-based gait recognition: From handcrafted to deep learning. Sensors 2022, 22, 5682. [Google Scholar] [CrossRef] [PubMed]
  3. Sepas-Moghaddam, A.; Etemad, A. Deep gait recognition: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 264–284. [Google Scholar] [CrossRef] [PubMed]
  4. Liao, R.; Yu, S.; An, W.; Huang, Y. A model-based gait recognition method with body pose and human prior knowledge. Pattern Recognit. 2020, 98, 107069. [Google Scholar] [CrossRef]
  5. Teepe, T.; Khan, A.; Gilg, J.; Herzog, F.; Hörmann, S.; Rigoll, G. Gaitgraph: Graph convolutional network for skeleton-based gait recognition. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 2314–2318. [Google Scholar]
  6. Li, X.; Makihara, Y.; Xu, C.; Yagi, Y.; Yu, S.; Ren, M. End-to-end model-based gait recognition. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
  7. Kanazawa, A.; Black, M.J.; Jacobs, D.W.; Malik, J. End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7122–7131. [Google Scholar]
  8. Chao, H.; He, Y.; Zhang, J.; Feng, J. Gaitset: Regarding gait as a set for cross-view gait recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8126–8133. [Google Scholar]
  9. Fan, C.; Peng, Y.; Cao, C.; Liu, X.; Hou, S.; Chi, J.; Huang, Y.; Li, Q.; He, Z. Gaitpart: Temporal part-based model for gait recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14225–14233. [Google Scholar]
  10. Lin, B.; Zhang, S.; Yu, X. Gait recognition via effective global-local feature representation and local temporal aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 14648–14656. [Google Scholar]
  11. Fan, C.; Liang, J.; Shen, C.; Hou, S.; Huang, Y.; Yu, S. Opengait: Revisiting gait recognition towards better practicality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
  12. Takemura, N.; Makihara, Y.; Muramatsu, D.; Echigo, T.; Yagi, Y. Multi-view large population gait dataset and its performance evaluation for cross-view gait recognition. IPSJ Trans. Comput. Vis. Appl. 2018, 10, 4. [Google Scholar] [CrossRef]
  13. Hou, S.; Cao, C.; Liu, X.; Huang, Y. Gait lateral network: Learning discriminative and compact representations for gait recognition. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020. [Google Scholar]
  14. Makihara, Y.; Mori, A.; Yagi, Y. Temporal super resolution from a single quasi-periodic image sequence based on phase registration. In Proceedings of the Computer Vision–ACCV 2010: 10th Asian Conference on Computer Vision, Queenstown, New Zealand, 8–12 November 2010; Revised Selected Papers, Part I 10. Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
  15. Zhang, J.; Cheng, Y.; Chen, C. Low resolution gait recognition with high frequency super resolution. In Proceedings of the PRICAI 2008: Trends in Artificial Intelligence: 10th Pacific Rim International Conference on Artificial Intelligence, Hanoi, Vietnam, 15–19 December 2008; Proceedings 10. Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  16. Han, J.; Bhanu, B. Individual recognition using gait energy image. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 316–322. [Google Scholar] [CrossRef] [PubMed]
  17. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
  18. Nguyen, K.; Fookes, C.; Sridharan, S.; Tistarelli, M.; Nixon, M. Super-resolution for biometrics: A comprehensive survey. Pattern Recognit. 2018, 78, 23–42. [Google Scholar] [CrossRef]
  19. Blu, T.; Thévenaz, P.; Unser, M. Linear interpolation revitalized. IEEE Trans. Image Process. 2004, 13, 710–719. [Google Scholar] [CrossRef] [PubMed]
  20. Keys, R. Cubic convolution interpolation for digital image processing. IEEE Trans. Acoust. Speech Signal Process. 1981, 29, 1153–1160. [Google Scholar] [CrossRef]
  21. Chung, M.; Jung, M.; Kim, Y. Enhancing Remote Sensing Image Super-Resolution Guided by Bicubic-Downsampled Low-Resolution Image. Remote Sens. 2023, 15, 3309. [Google Scholar] [CrossRef]
  22. Song, C.; Huang, Y.; Wang, W.; Wang, L. CASIA-E: A large comprehensive dataset for gait recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2801–2815. [Google Scholar] [CrossRef] [PubMed]
  23. Yu, S.; Tan, D.; Tan, T. A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 4, pp. 441–444. [Google Scholar]
  24. Gao, W.; Cao, B.; Shan, S.; Chen, X.; Zhou, D.; Zhang, X.; Zhao, D. The CAS-PEAL large-scale Chinese face database and baseline evaluations. IEEE Trans. Syst. Man Cybern.-Part A Syst. Hum. 2007, 38, 149–161. [Google Scholar]
  25. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  26. Han, X.; Wang, L.; Wang, X.; Zhang, P.; Xu, H. A multi-scale recursive attention feature fusion network for image super-resolution reconstruction algorithm. Sensors 2023, 23, 9458. [Google Scholar] [CrossRef] [PubMed]
  27. Fu, Y.; Wei, Y.; Zhou, Y.; Shi, H.; Huang, G.; Wang, X.; Yao, Z.; Huang, T. Horizontal pyramid matching for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8295–8302. [Google Scholar]
  28. Yu, D.; Wang, H.; Chen, P.; Wei, Z. Mixed pooling for convolutional neural networks. In Rough Sets and Knowledge Technology: Proceedings of the 9th International Conference, RSKT 2014, Shanghai, China, 24–26 October 2014, Proceedings 9; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 364–375. [Google Scholar]
  29. Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
  30. Liu, Z.; Gao, G.; Sun, L.; Fang, Z. HRDNet: High-resolution detection network for small objects. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021. [Google Scholar]
  31. Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
  32. Mao, A.; Mohri, M.; Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In Proceedings of the International Conference on Machine Learning, ICML 2023, Honolulu, HI, USA, 23–29 July 2023; pp. 23803–23828. [Google Scholar]
  33. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on COMPUTER vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Figure 1. The Takemura method workflow: (a) begin with the original image, (b) trim excess background from the top and bottom, (c) resize height to 64 pixels and find the horizontal center of the contour image, and (d) crop the image width to 64 pixels.
Figure 1. The Takemura method workflow: (a) begin with the original image, (b) trim excess background from the top and bottom, (c) resize height to 64 pixels and find the horizontal center of the contour image, and (d) crop the image width to 64 pixels.
Sensors 24 05500 g001
Figure 2. Some images removed during the preprocessing of the CASIA-B dataset.
Figure 2. Some images removed during the preprocessing of the CASIA-B dataset.
Sensors 24 05500 g002
Figure 3. The method for calculating recognition rates in gait recognition algorithms.
Figure 3. The method for calculating recognition rates in gait recognition algorithms.
Sensors 24 05500 g003
Figure 4. The overall architecture of the MFCF-Gait network.
Figure 4. The overall architecture of the MFCF-Gait network.
Sensors 24 05500 g004
Figure 5. Detailed structure of SHPP.
Figure 5. Detailed structure of SHPP.
Sensors 24 05500 g005
Figure 6. The gait silhouette images processed by different super-resolution algorithms. The first row shows images at a resolution of 64 × 64 pixels, while the second row displays images at a resolution of 32 × 32 pixels. (a) Nearest neighbor interpolation, (b) bilinear interpolation, (c) bicubic interpolation.
Figure 6. The gait silhouette images processed by different super-resolution algorithms. The first row shows images at a resolution of 64 × 64 pixels, while the second row displays images at a resolution of 32 × 32 pixels. (a) Nearest neighbor interpolation, (b) bilinear interpolation, (c) bicubic interpolation.
Sensors 24 05500 g006
Table 1. Experimental configuration.
Table 1. Experimental configuration.
Environment NameConfigure Parameters
Operating systemWindows10
GPUNVIDIA RTX 3060
CUDA11.8
Compiled languagePython 3.9
Open source frameworkPytorch 1.12
Table 2. Training hyperparameters.
Table 2. Training hyperparameters.
HyperparametersValue
iter40,000
OptimizerSGD
Learning Rate0.1
Momentum0.9
Weight Decay0.0005
Learning Rate ScheduleMultiStep
Milestones10,000
20,000
30,000
gamma0.1
Table 3. Super-resolution algorithms comparison experiments.
Table 3. Super-resolution algorithms comparison experiments.
ResolutionNearest NeighborBilinearBicubicACC/%
NMBGCL
64 × 64 95.087.771.4
96.190.776.0
96.591.478.2
32 × 32 86.575.453.7
87.077.857.1
94.287.771.6
Table 4. Feature fusion module ablation experiments.
Table 4. Feature fusion module ablation experiments.
ResolutionMGBFPNMFCF
(Ours)
ACC/%
NMBGCL
64 × 64 95.990.276.8
96.491.675.5
96.090.476.4
96.591.478.2
32 × 32 83.875.854.6
85.977.554.9
90.382.663.4
94.287.771.6
Table 5. Model comparison experiments.
Table 5. Model comparison experiments.
ResolutionTypeGallery0~180°Mean
Probe18°36°54°72°90°108°126°144°162°180°
64 × 64NMGaitSet90.897.999.496.993.691.795.097.898.996.885.895.0
GaitPart94.198.699.398.594.092.395.998.499.297.890.496.2
GaitGL96.098.399.097.996.995.497.098.999.398.894.097.4
GaitBase95.699.210099.097.695.497.599.410099.194.297.9
MFCF-Gait (Ours)93.899.299.698.194.993.396.698.798.698.590.196.5
BGGaitSet83.891.291.888.883.381.084.190.092.294.479.087.2
GaitPart89.194.896.795.188.394.989.093.596.193.885.891.5
GaitGL92.696.696.895.593.589.392.296.598.296.991.594.5
GaitBase92.595.796.195.792.190.292.295.397.195.489.993.8
MFCF-Gait (Ours)89.894.895.094.089.485.588.893.294.994.785.591.4
CLGaitSet61.475.480.777.372.170.171.573.573.568.450.070.4
GaitPart70.785.586.983.377.172.576.982.283.880.266.578.7
GaitGL76.690.090.387.184.579.084.187.087.384.469.583.6
GaitBase68.079.982.781.577.475.576.879.980.678.767.577.1
MFCF-Gait (Ours)72.384.085.281.577.374.076.879.881.481.267.178.2
32 × 32NMGaitSet64.479.688.188.980.075.881.186.386.879.860.779.2
GaitPart70.687.392.991.084.379.886.789.892.685.067.184.3
GaitGL73.590.693.291.586.983.986.891.694.390.567.986.4
GaitBase84.196.598.595.490.188.792.096.197.794.480.292.2
MFCF-Gait (Ours)86.595.898.997.694.091.895.498.197.595.385.594.2
BGGaitSet55.770.276.479.969.464.270.081.377.674.851.970.1
GaitPart62.077.883.181.674.568.677.581.684.274.654.674.6
GaitGL68.186.088.586.281.577.380.286.688.686.960.380.9
GaitBase74.286.691.588.986.182.886.289.491.586.780.285.2
MFCF-Gait (Ours)80.189.893.593.188.183.087.092.892.087.977.287.7
CLGaitSet35.048.458.557.455.350.955.956.952.846.832.850.0
GaitPart38.360.264.866.362.156.961.767.761.149.436.156.8
GaitGL45.071.877.572.669.963.164.169.367.862.339.463.9
GaitBase45.064.370.169.667.161.664.467.563.761.243.161.6
MFCF-Gait (Ours)58.773.979.875.774.370.673.276.776.672.855.071.6
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Song, C.; Yun, L.; Li, R. MFCF-Gait: Small Silhouette-Sensitive Gait Recognition Algorithm Based on Multi-Scale Feature Cross-Fusion. Sensors 2024, 24, 5500. https://doi.org/10.3390/s24175500

AMA Style

Song C, Yun L, Li R. MFCF-Gait: Small Silhouette-Sensitive Gait Recognition Algorithm Based on Multi-Scale Feature Cross-Fusion. Sensors. 2024; 24(17):5500. https://doi.org/10.3390/s24175500

Chicago/Turabian Style

Song, Chenyang, Lijun Yun, and Ruoyu Li. 2024. "MFCF-Gait: Small Silhouette-Sensitive Gait Recognition Algorithm Based on Multi-Scale Feature Cross-Fusion" Sensors 24, no. 17: 5500. https://doi.org/10.3390/s24175500

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop