In this paper, the system uses the finger vein static short-term video as the data source for in vivo detection. There is blood flow within the superficial subcutaneous vein, and there will be slight expansion and contraction, leading to slight changes in the absorption rate. During angiographic imaging, when near-infrared (NIR) light penetrates a vein, there is a slight grayscale change in the vein area, which can be detected using video image processing methods [
28,
29], and when there is no fluid flow through the ‘blood vessel’ in the prosthesis, there will be no grayscale change in the vein area. However, the slight grayscale transformation will be submerged in the speckle noise of near-infrared angiography. In order to solve this problem, this paper proposes an artificial shallow feature extraction with a deep learning model for finger liveness detection. The flow chart of this system is shown in
Figure 1. The first step is to obtain a short-duration static finger vein video. The next step is vein area segmentation, which belongs to the preprocessing. The third step is to select and cut out small blocks on the edge of the vein. The fourth step is to construct the MSTmap for these sorted small blocks. The fifth step is training the proposed Light-ViT model with the MSTmap. The last step is to output the result of liveness detection.
When a user places their finger between the light source of the device and the viewing window, the image grabber starts capturing the image of the finger vein. In this process, the controller constantly monitors and analyzes the quality of the captured image, and it quickly adjusts the light intensity of the light source according to the changes in the brightness, contrast, and other parameters of the image, so that the brightness and clarity of the captured image are always kept within the preset range. The application of intelligent control can be quickly adjusted according to the actual situation to ensure that the qualities of the captured images are similar, thus improving the recognition accuracy and stability of the equipment.
3.2. Preprocessing of Video Frames
In order to meet the real-time application of vein liveness detection and facilitate use, when collecting the finger vein video, we used the three-frame difference method [
30] to extract the frames in which the fingers remained stationary in the short-duration video. The method efficiently detects moving targets and captures their subtle movement changes, eliminating the need for later multi-frame pixel-level registration of the finger vein, to reduce and prevent an excessive computational load. The user’s finger only needs to stay in the acquisition device for 1.5 s. The camera uses a high-speed camera, taking 120 frames per second, and the pixel resolution is 640 × 480. The camera collects RGB three-channel images. Although there are near-infrared (NIR) filters in front of the lens, the corresponding near-infrared contrast images can still be collected. The noise of each channel is different, which can be used to increase the signal-to-noise ratio of small grayscale changes in the vein area.
The multi-scale and multi-direction Gabor filtering method is used to segment the vein region. This paper presents a fast Gabor filter design method to remove the DC component, which is equivalent to the method in Reference [
31]. The real part of the traditional Gabor filter is designed as follows:
Here, represents the aspect ratio of the space, which determines the ellipticity of the shape of the Gaussian kernel function curve. When , the shape is a positive circle; σ is the standard deviation of the Gaussian function, and its value cannot be directly set, but is related to the bandwidth. is the wavelength of the sine function; is a sine function phase; is the rotation angle. Equations (2) and (3) illustrate that the Gabor function can be elongated in the plane in any direction determined by .
In order to quickly remove the DC component of the filter template, this paper proposes to directly calculate the mean value of the Gabor filter as the DC component removed. The formula can be expressed as follows:
3.5. Build Light-ViT Model
The MSTmap of the finger vein that we constructed contains varying grayscale information while preserving features, which is challenging for prosthetics to achieve. The transformation of the static finger vein video into the MSTmap necessitates that the network possesses the capability to effectively manage long-range pixel relationships. Simultaneously, the conversion of vein edge position features requires dedicated attention to local characteristics. Convolutional neural networks (CNNs) have consistently exhibited exceptional proficiency in feature extraction [
32]. However, when dealing with spatial temporal maps of multiple scales that encompass both global and local features, CNNs still exhibit limitations in comprehensive global feature extraction. In contrast, the ViT network leverages its multi-head attention mechanism to excel in local feature handling, complemented by its capacity for long-range pixel-to-pixel feature extraction through positional encoding. The ViT network has been empirically proven to deliver superior performance, yet it grapples with challenges such as large parameter scales, training complexity, and suboptimal performance on smaller datasets. Furthermore, finger vein presentation attack detection (PAD) serves as a pivotal technology in biometric identification, where precision and real-time responsiveness constitute fundamental requirements. Considering that biometric identification devices are typically compact in size and operate within the constraints of limited computational resources offered by computer chips, it becomes imperative to maintain a compact system architecture. Consequently, we introduce the Light-ViT model, which not only adeptly captures both global and local data features but also facilitates seamless integration into our system, achieving high-precision finger vein counterfeit detection at a significantly reduced cost.
The fundamental concept of the Light-ViT network involves the creation of L-ViT blocks to replace the conventional convolution used in MobileNet. The L-ViT backbone constitutes the core of Light-ViT. This network is composed of multiple MobileNet blocks (MN blocks) and L-ViT blocks stacked alternately. Specifically, the MN block employs depth-wise separable convolution operations with the aim of learning local image features while controlling the network’s parameter count, enabling better adaptation to large-scale datasets. On the other hand, the L-ViT block adopts a Transformer structure to capture global image features and integrates them with locally extracted features obtained through convolution.
The MN block is a convolution module within the network used for learning image biases and local features. Its structure is illustrated in the diagram. For input features
, it initially undergoes a 1 × 1 convolution layer, effectively mapping to a higher dimension via pointwise convolution. Subsequently, it passes through a batch normalization (BN) layer and the SiLU activation function to obtain
, where
, and it can be adjusted based on network requirements. Following this, it undergoes group convolution, followed by another BN layer and the SiLU activation function to obtain
. Here, T represents the stride, and adjusting this parameter governs the dimensions of the resulting tensor. After merging with the input features through a reverse residual structure, the dimensions of
are mapped to
using pointwise convolution (PW), followed by a BN layer to yield the output
. The structure of the MN block is shown in
Figure 5.
This method has the advantage of significantly reducing the number of parameters and computational resources required for convolution, while maintaining the same convolutional kernel sensing field for the convolution kernel, as previously discussed [
33]. Additionally, to equip the CNN with the capacity to learn global features, we have introduced an L-ViT module, as depicted in
Figure 6. Given an input
, we commence by encoding the local spatial information through a 3 × 3 standard convolutional layer. Subsequently, we utilize a 1 × 1 convolutional layer to project the feature dimensions to a higher space, resulting in
. To enable Light-ViT to acquire global representations with spatial inductive bias, we unfold
into N patches and aggregate them to yield
, where
, with
and
denoting the width and height of a patch, and
.
Subsequent spatial coding of produces , preserving both the ordering information among each patch and the spatial details of the pixels within each patch. Then, undergoes a 1 × 1 convolutional projection to return to a lower-dimensional space. It is then concatenated with through a residual structure, followed by fusion of features using a standard convolution operation.
The L-ViT block is primarily used to learn global features of the feature map. We employ an Unfold operation to process the input tensor, introducing positional information to the feature map while retaining the Encoder part that introduces attention mechanisms. However, we replace the Decoder part with convolution operations to adjust the size of the output feature map. The enhanced Light-ViT further strengthens the network’s understanding of images and provides more powerful performance through the comprehensive extraction and fusion of local and global features. The structure is depicted in
Figure 7.
The input image first passes through a standard convolutional layer, primarily serving the purpose of generating shallow-level feature maps and adjusting the input dimensions. Subsequently, the feature maps are fed into the Light-ViT backbone, which consists of multiple MN blocks and L-ViT blocks stacked alternately. This allows the network to learn new high-level feature maps that integrate both local and global features. These features can capture image details and global information more accurately, thereby providing richer and more comprehensive information for subsequent classification tasks. Then, a 1 × 1 convolutional layer maps the feature maps into a high-dimensional space. Finally, the network’s output is obtained through global pooling and a linear layer. In global pooling, the network can perform statistical operations over the entire feature map to acquire more comprehensive and rich information. Finally, through the transformation of a linear layer, the network converts the feature map into a vector with corresponding class probability distribution.
Our proposed Light-ViT significantly reduces the demand for computational resources and greatly enhances the network’s feature learning capabilities by introducing MN blocks and improving L-ViT blocks. When integrated into the system, this lightweight network exhibits high recognition accuracy.