1. Introduction
Gait is a biometric feature that describes the walking pattern of every individual. Compared with other biometric features such as the face, iris, fingerprint, and ears, gait has several unique properties. Gait can be captured from a greater distance than face or iris, which also means that the person does not have to interact with the sensor, i.e., a camera. In addition, gait is difficult to change, making it a reliable biometric feature. The gait of an individual can also be extracted from low-resolution sensors, such as those found in most current surveillance cameras. The range of applications of gait biometric is wide, e.g., in surveillance scenarios, access control, and identification of individuals for crime investigation purposes.
Gait biometric has several limitations when applied in the real world. First, factors such as illumination changes, shadows, and occlusions can significantly alter the appearance of an individual’s gait. Second, the cameras that capture an individual’s gait often have different viewing angles, resulting in drastically different appearances of the gait, even though the individual’s gait signature is the same. Third, the carrying modalities are also commonly present, such as individuals wearing a bag, coat, hat, or another accessory, which visually change the individual’s gait from an appearance perspective.
In the literature, there are two general approaches for tackling the task of gait recognition. The first is compressing the silhouettes of a single gait cycle of an individual into a single image, which serves as a gait features representation [
1,
2]. Han et al. [
1] propose compressing the individual’s binary silhouettes of one gait cycle, extracted from video frames by background subtraction, into one compact gait representation, called Gait Energy Image (GEI). The second approach considers the gait as a sequence of silhouettes of an individual, which are used individually as input for feature extractor [
3,
4,
5]. In both approaches, the current state-of-the-art methods rely exclusively on deep learning. From their emergence in 2012 [
6], CNNs have in recent years dominated the field of image-based deep learning, and have naturally become the standard backbone network used in approaches tackling gait recognition [
3,
4,
7,
8,
9]. Wu et al. [
7] extracted gait features using the deep CNNs via similarity learning. GaitSet [
3] proposed treating the silhouettes of an individual as a set using custom CNN with triplet loss for gait representation learning. Rijun et al. [
5] extracted gait features using the information about an individual’s pose throughout video frames, using the CNN for predicting the body pose from an image. Bai et al. [
10] addressed the problem of radar-based gait recognition based on the dual-channel CNN. Chen et al. [
11] used CNN network for gait classification based on the multistatic micro-doppler signatures. A detailed description of a typical gait identification pipeline can be found in [
3,
7,
12].
However, in recent years, a new architecture emerged as a direct competitor to CNNs in the field of image classification—vision transformers (ViTs). ViT architecture was proposed by Dosovitskiy et al. [
13], applying the standard transformer encoder from the field of natural language processing to the field of computer vision, i.e., image classification task. ViTs have shown excellent results on many image classification benchmarks [
13,
14,
15], demonstrating their strong generalization capability. Compared with CNNs, ViTs demand fewer computational resources to train and have stronger modeling capability. Still, their application in the domain of gait recognition has not yet been explored.
All of the previously mentioned methods use a supervised approach to address the gait recognition problem. Supervised deep learning requires that annotated samples are available for training and test data, which can be expensive to obtain. Moreover, many of the state-of-the-art methods use complex model architectures to extract useful gait features [
3,
4,
8,
16]. Complex deep learning models often result in long training time, slow convergence, and a large number of model parameters that need to be tuned.
In this manuscript, we propose a new architecture and learning approach for gait recognition. Since labeling all samples in a dataset is an expensive and time-consuming process, we propose using a self-supervised approach for learning useful gait features from the input data. The self-supervised approach has emerged in recent years [
17,
18,
19,
20] and has been successfully applied to a number of problems [
21,
22]. The main goal of self-supervised learning is to learn useful data representations from the unlabeled data by creating a pretext task. The pretext task involves predicting an occluded portion of an input image based on the rest of the image. In this manuscript, we opted to use the DINO [
20] approach. DINO showed excellent results on the image classification dataset ImageNet and outperformed previous self-supervised approaches based on CNNs at a significantly lower computational cost. This approach uses the ViT model as a backbone, which has an interesting property compared with CNNs trained in the same way. It has been shown that the self-supervised approach in conjunction with ViTs results in ViTs models learning to separate the desired object from the background without explicit guidance [
20]. GEI images are used as input data for the DINO model, representing a single gait cycle of an individual.
Our approach uses the general ViT architecture as a backbone model, in conjunction with DINO self-supervised learning method, for learning useful gait features from GEI images of individuals, which can then be used as input to a simple fully connected neural network (FCNN) classifier to classify individuals.
3. The Proposed Approach
In this section, we describe our proposed approach, along with a detailed explanation of its key components. The overall processing pipeline is depicted in
Figure 1. The first part of our proposed approach uses the DINO self-supervised model to learn gait features from unlabeled training data, as shown in
Figure 1a. Next, a simple FCNN is used as a classifier for the features obtained by the DINO feature extractor model, and is trained on gallery samples and tested on query samples, as shown in
Figure 1b. Labeled samples are only needed for training the FCNN classifier, as the classifier is trained using a supervised approach.
3.1. Preprocessing
The first step in our proposed approach is data preparation. In general, assuming the input data are in the form of raw RGB image sequences taken from a camera, the typical gait data preprocessing steps [
12,
27] are applied. First, the noise is filtered from the images. Second, the silhouettes are extracted for every subject in binary form, using, e.g., background subtraction method. Third, images are normalized so that all silhouettes have the same height and are horizontally aligned. Then, a gait cycle estimation is performed in order to construct a final gait representation. In this manuscript, image-based gait features are used in the form of GEI [
1]. GEI is able to preserve the static information of a gait sequence, such as the shape of the subject’s body, and the subject’s dynamic information, such as the variation of frequency and phase during the subject’s locomotion. The GEI representation
G for a given gait cycle can be calculated with the formula
where
N represents the number of silhouette frames in the gait cycle,
t represents the frame number in a gait cycle at a moment in time, and
is the original silhouette image with
values in the 2D image coordinate.
3.2. Learning Discriminative Gait Features
The second step in our proposed approach is training the feature extractor. In this manuscript, we propose using a self-supervised learning paradigm in order to tackle the problem of learning discriminative gait features. We use the recently proposed method called DINO [
20], which showed promising results in various computer vision tasks such as image classification and image retrieval. The DINO architecture is depicted in
Figure 2.
Originally, DINO constructs a set of eight local views ( crops, passed only through ) and two global views ( crops, passed through both and ). In this work, to adapt to gait-specific data, we use eight local views but with local crops of size , while two global crops are of size . We change crop sizes in order to adapt to the sizes of our gait training images while retaining the similar ratios of global and local crops as in the original manuscript. Moreover, since the DINO was originally trained on ImageNet, we change the augmentations used during training, by removing most of the image augmentations used (color jitter, Gaussian blur, solarization, random horizontal flip) and using only the random erasing augmentation, since the aforementioned augmentations do not bring a performance gain when used on gait-specific data.
The DINO method exhibits the ability to segment the foreground objects in an image, i.e., object boundaries, in a self-supervised manner. In natural images, such as ImageNet, foreground object segmentation is a difficult problem, considering that many possible variations of the foreground object and the background exist. In a gait recognition scenario, where images are presented in the form of, e.g., GEI, the foreground object, i.e., a subject, is clearly outlined in relation to the background, which could lead to the model focusing its attention on the most significant parts of an image such as the dynamic features presented as pixels in the range of .
Since gait datasets lack the large amount of data needed to train the ViT model from scratch [
13], the fine-tune strategy is used in this work. The DINO model is trained on the ImageNet dataset and then fine-tuned to gait data.
We propose using the DINO method as a feature extractor to produce discriminative features of input images to be used later for classification.
3.3. Vision Transformers
The DINO uses the vision transformer model [
13] as its backbone network, although CNN’s also work without modifying the general DINO architecture. The ViTs input consists of patches of resolution
that represent non-overlapping sections of the input image. For an image
I,
where
H represents the height of an image,
W represents its width, and
C is the number of channels in an image, the resulting image patches are
where
is the number of patches and
p is the patch resolution.
Patches are linearly projected into an embedding, and a CLS token is added, which serves as a class token, i.e., representation of the entire input image, and is used for the actual classification. Furthermore, at this step, the positional embeddings are added to help the model retain the positional information of input patches. Then, patch embeddings, positional embeddings, and CLS token are passed through the standard Transformer Encoder, which consists of self-attention and feed-forward layers, with skip connections. Finally, the output CLS token of the Transformer Encoder is sent to a Multilayer Perceptron (MLP) model for classification.
We use the small ViT model, as defined by Touvron et al. [
37]. Furthermore, we train models with a patch size of 16 and 8 to investigate the influence of patch size on model performance.
3.4. Classifier
After the DINO feature extractor model is trained, the gait features for gallery and query image can be extracted and used for classification. In order to classify the features, we propose using a simple FCNN classifier. Accordingly, we set the gait recognition problem as a gait classification problem, where the gallery acts as training data for the FCNN classifier and query acts as test data. For example, if a gallery contains 100 subjects we consider that a classification problem with 100 classes. We design a simple FCNN—depicted in
Figure 3—that consists of two linear layers, together with batch normalization, ReLU activation function, and dropout. The hyperparameters of a proposed FCNN are determined empirically. Additionally, we use the center loss [
38] to further facilitate learning a more diverse feature representation. The main loss used is the cross-entropy loss, and the combination with center loss is given by the formula
where
L represents final loss value;
and
are values of cross-entropy loss and center loss functions, respectively; and
is a scalar that balances influence of the center loss on the overall loss value and is set to
.
As in feature extractor training, the images were normalized according to the custom dataset’s normalization values. Random erasing is used as a data augmentation technique. Furthermore, in order to boost representation learning, we concatenate the CLS tokens from all 12 blocks of the DINO model as a final input image representation that serves as input to the FCNN classifier. Dimensionality of CLS token for the small ViT model is 384; thus, the input dimensionality of FCNN classifier is 4608.
4. Experimental Setup
To validate the proposed approach, we conducted experiments to assess the performance of the proposed DINO feature extractor model and the performance of the FCNN classifier trained on features extracted with the feature extractor model. Experiments were conducted in a way that allows for easy comparison with current state-of-the-art models used in gait recognition, following the same dataset splits and comparison metrics. The experimental setup is described next; then, the results are presented and analyzed.
4.1. Datasets
In this manuscript, we conducted experiments on two widely used gait recognition datasets: CASIA-B [
39] and OU-MVLP [
40], where CASIA-B a presents a smaller but widely used dataset, while OU-MVLP presents one of the largest gait datasets to date. The aforementioned allows for analyzing the performance of the proposed approach on a smaller or larger dataset, to see if the data amount is critical in training a successful DINO feature extractor.
CASIA-B dataset [
39] is one of the most popular gait datasets in the literature. It consists of 124 subjects, three different walking conditions, and 11 different views (0–
with an increment of
). Walking conditions are normal (NM) with six sequences per subject, walking with a bag (BG) with two sequences per subject, and walking with a coat or a jacket (CL) also with two sequences per subject. In total, 110 sequences are available for each subject in the dataset. Since in this manuscript we use GEI images, the aforementioned translates to almost 13,600 images in total, with an average of 110 images per subject. We conduct experiments on three partition settings for training and testing, commonly used in literature. First, the ST (small-sample) setting uses the first 24 subjects for training and the rest (100 subjects) are used for testing. Second, the MT (medium-sample) setting uses the first 62 subjects for training and the rest (62 subjects) are used for testing. Third, the LT (large-sample) setting uses the first 74 subjects for training and the rest (50 subjects) are used for testing. In all three partition settings, the first 4 sequences of the NM modality are used in the gallery, while the remaining 6 sequences of NM modality are used in the query along with the 2 sequences of BG and CL modalities.
OU-MVLP dataset [
40] is one of the largest public gait datasets available today. It consists of 10,307 subjects and 14 different views (
–
and
–
, in increments of
) per subject. For every view, there are two sequences (
–01). For training, 5153 subjects are used, while for testing, the rest of the 5154 subjects are used. In the test set, sequences with index
are used as a gallery, while the ones with index
are used as a query. In total, there are over 267,000 GEI images, with approximately 26 GEI images per subject.
Additionally, we resize all images from both datasets to size 64 × 44 as performed in [
3,
36], to ensure comparison compatibility as well as lowering computing requirements for training the DINO model. Furthermore, when training the DINO model, the training data are normalized using the
and
calculated from the used training data.
4.2. Experiments
In order to evaluate the performance of our proposed approach, we constructed GEI image representations for each subject in each dataset. Then, we trained DINO feature extraction models on two aforementioned datasets, CASIA-B and OU-MVLP. For each dataset, two models were trained: one with a patch size of 16 and one with a patch size of 8. Next, a simple FCNN classifier was trained on gallery samples, to construct the final model for gait classification. Finally, the trained FCNN classifier was evaluated using the query samples.
4.3. DINO Implementation Details
For the implementation of the DINO method, the official GitHub repository was used [
41], with slight modifications, as explained in
Section 3.2, to account for the different data distribution of gait data in comparison with natural images of ImageNet dataset, such as adjusted global and local crop sizes and different training data augmentations. In order to fine-tune both the student and the teacher networks, the full ImageNet pretrained DINO model checkpoint was used. In our experiments, we used only small ViT models, which roughly correspond to the size of normal Resnet-50 [
42] architecture by the number of parameters in the network. We trained models with patch sizes 16 and 8 to study the effect of patch size on the model’s accuracy. The remaining DINO model parameters, such as momentum teacher value, teacher temperature, and global and local crop scales are the same as in the original manuscript [
20].
4.4. Training Details
We trained the DINO models for 1000 epochs, with a batch size of 32 for all experiments on the CASIA-B and OU-MVLP datasets. The optimizer used was AdamW [
43] with a learning rate of 0.0005. The training was performed using one Nvidia 2080Ti 11 GB GPU.
The FCNN classifier was trained for 100 epochs, with a batch size of 128. The Adam optimizer was used for FCNN classifier with a learning rate of 0.0005; similarly, the Adam was used for the center loss optimizer with a learning rate of 0.1.
For both DINO models and the FCNN classifier, the learning rates were determined empirically. The learning rates were searched within the range of 0.1 to 0.000001 using the grid search method. The number of epochs for training the DINO model was set to 1000, as the accuracy did not improve when training the model for longer. Similarly, the number of epochs for training the FCNN classifier was set to 100. The batch size for both models was set by finding the optimal value between the batch sizes of 8 and 128, with steps of the power of 2.
4.5. Evaluation Protocol
For evaluation of our experimental results, we use rank-1 accuracy, where we look at the percentage of predictions where the top prediction is the correct one, i.e., matches the ground-truth value. Additionally, the identical-view cases are excluded for comparability with other state-of-the-art methods.
5. Results
In this section, the results of conducted experiments are presented. It is worth noting that, except SelfGait [
36], which uses self-supervised learning, every other method compared uses a supervised learning approach. Furthermore, the state-of-the-art methods mentioned in this section use silhouettes as input data, as well as features extracted directly from frames of a subject walking, while the method proposed by Liao et al. [
27] uses GEIs, the same as our method.
5.1. CASIA-B
For the ST setting, the results are presented in
Table 1. Compared with the other state-of-the-art methods, our method achieves the highest accuracy in the NM and BG modality. Although, the CL modality accuracy is the lowest among the state-of-the-art methods.
In the MT setting,
Table 2, the overall accuracy of the NM modality of our method outperforms the rest of the methods again, while the BG modality is below the rest of the methods. Further, the CL modality showed significantly lower results.
Finally, in the LT setting,
Table 3, our method again gained the best accuracy in the NM modality, while the BG modality is comparable although lower in accuracy than the rest of the methods. CL modality in this setting showed poor accuracy.
Overall, our approach performs best on the NM modality, regardless of the CASIA-B dataset setting. The BG modality performs best in the ST setting, and in the other settings, it is comparable with other methods. The CL modality showed the lowest accuracy in all the settings. The reason for that could be that our model focused its attention primarily on the NM modality, which has the most training data and is easiest to discriminate, without any other covariate condition. BG modality considers the subject carrying a bag, which alters the subject’s appearance slightly; thus, the results for BG modality are overall comparable with those of other state-of-the-art methods. The CL modality considers the subject wearing a coat, which alters the subject’s appearance significantly; as a result, it is the hardest modality available in the dataset, on which our method achieved low accuracy. As such, our proposed method on CL modality may not be the best choice in practical applications, compared with other methods. Further research into boosting the proposed method’s accuracy in the mentioned modality will be performed. Considering the presented results, our approach showed the ability to perform well across different modalities, excluding CL modality. Furthermore, our method discriminates well across the different angles of subjects at which they are recorded. The best accuracy is obtained for the angles that are closer to values of and , while the lowest are in the area around the angle.
Both models with patch size 16 and patch size 8 performed similarly in the NM modality, without significant differences in accuracy, across all dataset settings. The significant differences in accuracy arise in BG and CL modalities, where the model with patch size 8 showed significant improvement in accuracy compared with the model with patch size 16. This effect could be due to the ability of the model with patch size 8 to focus its attention to smaller parts of the image, hence, building a model that is more robust to the effect of covariate factors such as a bag or a coat.
5.2. OU-MVLP
In
Table 4, the results for the OU-MVLP dataset are presented. The results show that our approach achieved comparable results with the other state-of-the-art methods. Our method performs well across all angles—specifically, the
and
angles—while the lowest accuracy is at an angle of
. The method SelfGait [
36] also uses the self-supervised learning approach but with a specialized backbone network that enhances the spatio-temporal ability of the model, and it achieves the state-of-the-art result on this dataset. In contrast, our approach uses a standard unmodified ViT network, with simple FCNN as a classifier, and achieves comparable accuracy. As the OU-MVLP dataset contains many images, the DINO model was able to learn discriminative features and achieve results comparable with the state-of-the-art. Compared with SelfGait, the advantage of our approach is that it uses a simple general ViT architecture, as opposed to the gait-specific network used in SelfGait. In addition, our method does not explicitly infer temporal features from the data, unlike SelfGait, which uses MTB to learn temporal features from silhouettes, thus making our method more straightforward in terms of learning since only appearance features are learned.
The model with patch size 16 performed slightly better on this dataset compared with the model with patch size 8. As, in this dataset, there are no covariate conditions such as a bag or coat, the model with patch size 8 does not bring any performance improvement as in CASIA-B dataset.
5.3. Self-Attention Visualization
In order to assess the features learned by the DINO model, we visualize the different attention heads in the last multihead self-attention block. A random image from each of the datasets is chosen, for which the attention is displayed. The model used was the ViT small model, which has heads per self-attention block.
In
Figure 4 and
Figure 5, the random images from CASIA-B and OU-MVLP datasets are shown, respectively. As depicted in
Figure 4a and
Figure 5a, each head learns different features from the data, as its attention is focused on different parts of the image. Some attention heads are focused on the subject’s head, while others are on the legs or the left or right part of the subject in the image.
Figure 4b and
Figure 5b show the average of all attentions across all the heads. This observation is consistent with the ones from the original DINO manuscript, where it is noted that the DINO method successfully segments objects of interest inside the image. In GEI images, the most important area of the image is the outline of the subject, which our proposed approach successfully detects and uses that information for the classification of subjects, producing good results, as shown in
Section 5.
5.4. Ablation Experiments
In this section, the effectiveness of the vision transformer backbone network and the proposed classifier is studied.
To evaluate the effectiveness of the vision transformer network, we trained the DINO model with the Resnet-50 as a backbone network for comparison. The Resnet-50 is chosen because it has a similar number of parameters in the network compared with the small ViT network, with 23 million and 21 million parameters, respectively. Both models were trained on the CASIA-B dataset’s LT setting, for 1000 epochs and with a patch size of 16. Hyperparameters of the small ViT model were determined as described in
Section 4.4, while for the Resnet-50 model the same methodology was used, setting the
. In both models, the full ImageNet pretrained DINO model checkpoint was used for fine-tuning. For evaluation, the FCNN network proposed in
Section 3.4 was used. In
Table 5, the comparison of accuracy of Resnet-50 and the small ViT model is shown. It is evident that the small ViT model significantly outperforms the Resnet-50 model in accuracy across all modalities, proving the effectiveness of the ViT model for the problem of gait recognition.
In
Table 6, the comparison of different classifiers is shown. To study the effectiveness of the proposed FCNN classifier, we evaluated the trained ViT feature extractor model using the standard weighted nearest neighbors classifier (k-NN) as in [
46]. The feature extractor model used was the small ViT model with a patch size of 16. The FCNN classifier is the same as proposed in
Section 3.4. An evaluation is performed on the CASIA-B dataset using the LT setting. The results show that the proposed FCNN classifier significantly outperformed the k-NN classifier in all modalities and angles, especially in the BG modality.
6. Conclusions
In this manuscript, we propose a novel approach that uses self-supervised learning for application in the gait recognition task. Using the DINO self-supervised method, the useful gait features are learned using training samples without any annotations. The obtained model is used as a feature extractor for gallery and query images. The simple FCNN classifier is trained using the features extracted from gallery images, and query images are evaluated using the trained model. Experiments conducted on two widely used gait recognition datasets, CASIA-B and OU-MVLP, showed that our proposed approach achieved good results, outperforming the supervised approaches in some cases. Moreover, the self-supervised feature extractor focused its attention on the outlines of the individuals in the GEI images, deeming the outline as the most meaningful information in the image. Taking into account covariate factors, such as different camera viewpoints and different carrying modalities, our method also produced good results comparable with those of other state-of-the-art methods, considering both supervised and self-supervised approaches. We also note that our approach is one of the first that employs ViTs in the domain of gait recognition. In future work, we will investigate the effect of training the feature extractor on specific parts of an image such as the legs, torso, or head on recognition accuracy. Furthermore, additional work will be conducted to further reduce the gap between poorer BG and CL modality results compared with those of NM modality in CASIA-B dataset. Newly proposed variants of vision transformers will also be tested in conjunction with DINO to further boost the recognition accuracy.