The Internet era makes a face [
1], iris [
2], fingerprint [
3], and other biometric features of people’s digital identification. Biometrics can automatically detect, capture, process, analyze, and identify these digital physiological or behavioral signals, which is a typical and complex pattern recognition problem and has been at the forefront of the development of artificial intelligence technology. Finger vein recognition is a biometric technique that uses human finger vein images to identify individuals. Finger vein recognition refers to the use of a charge-coupled device (CCD) camera to obtain an individual’s finger vein distribution map by irradiating fingers with near-infrared light and then using advanced filtering image binarization and other means to extract digital image features by comparing them with the finger vein feature values stored in the host and using a complex matching algorithm to match the characteristics of the finger veins, so as to realize personal identification. Compared with other biometrics, finger vein recognition has become the second generation of biometrics because of its advantages in high security, precision, stability, and ease of use.
Early finger vein recognition is mainly based on feature engineering methods, which extract distinguishable features from pre-processed finger vein images, such as local texture features [
4], vein pattern features [
5], minutiae features [
6], etc., by measuring the similarity between the image features to be compared and the extracted features to achieve recognition. However, feature extraction of finger veins is greatly affected by ambient temperature [
7]. When the finger is cold, the veins of the finger shrink and thin, making vein information scarce. Secondly, the surrounding strong ambient light (such as sunlight) will cause different degrees of interference with the near-infrared image, affecting the authentication rate of vein recognition. At the same time, finger vein acquisition devices have certain requirements for the acquisition posture, so the acquisition posture will have a certain impact on the image quality. These defects affect the effect of traditional feature extraction algorithms and have a negative impact on the performance of finger vein recognition. With the rapid development of deep learning, self-learning features based on the deep framework have made great progress in image recognition in recent years. Compared with traditional algorithms, the goal of deep learning is to learn features. The problem of manually extracting feature points was solved by obtaining the feature information of each layer through the network. Based on the learning of massive data and under the constraints of deep framework theory, we adjusted the parameters of the multi-layer network, established the optimal nonlinear fitting network between input and output nodes, and then compared the target samples with the samples mapped by the deep network, marking the correspondence between them to get as close as possible to the real distribution. By training the deep network model, the maximum probability distribution of the target classification was obtained. Finger vein recognition [
8] is a recognition technology for complete image classification. Many methods of finger vein recognition based on deep learning have been proposed in recent years and achieved satisfactory results. Das et al. [
9] proposed a CNN-based finger vein recognition model and tested its effectiveness on four public finger vein image datasets. Wang et al. [
10] proposed an HGAN-based data expansion strategy for the CNN finger vein recognition model and compressed the model using filtered pruning and low-rank decomposition. Lu [
11] et al. proposed a CNN-based local descriptor named CNN-Competitive Order (CNN-CO) for the finger vein recognition model of the Deep Convolutional Neural Network (DCNN). However, CNN did not consider the spatial relationship between potential target features and performed poorly in exploring the spatial relationship between features. Moreover, the pooling layer of CNN loses a lot of valuable information, which makes the result of finger vein recognition unable to achieve a great improvement. Hinton et al. [
12] presented the capsule network, which defined features in a more reasonable way than CNN by ensuring the invariance of translation and rotation. Dilara Gumusbas et al. [
13] used capsule networks for recognition using a limited number of samples in four finger vein datasets. Although the capsule network overcomes some of the drawbacks of CNNs, the model cannot selectively focus on the important information in the image, resulting in a much smaller receptive field during actual processing than the theoretical receptive field. In fact, after we detected the key points, object boundaries, and other basic units that made up visual elements, high-level visual semantic information tended to focus more on how these elements related to each other to form a target object and how the spatial position relationships between these target objects constituted a scene. However, models such as the capsule network do not achieve the desired effect when dealing with the relationships between these elements. Dosovitskiy [
14] put forward a transformer model applied to computer vision, which achieved good results on multiple image recognition benchmarks. Unlike CNN models, vision transformer uses a self-attention mechanism to integrate information across the entire image. Even at the lowest level, the vision transformer captures global contextual information by using self-attention to establish remote dependencies on targets and extract more powerful features.
To solve the problem that the capsule network lacks the ability to encode the long-range dependencies in the image and cannot selectively pay attention to important image feature information when the capsule network is used for image classification, we integrated the advantages of the capsule network in processing the underlying vision information, as well as the advantages of the transformer in processing the relationship between the visual elements and the target objects, and propose a new vision transformer-based capsule network model (ViT-Cap) for finger vein recognition. The model can encode the dependencies between image features so as to improve the effect of image classification, especially multi-label classification. Experimental results showed that the proposed model has better recognition performance than the existing recognition methods.
The main contributions of this article are as follows. A new vision transformer model based on a capsule network is proposed for finger vein recognition.