Presentation attack can be detected in a variety of ways. In this paper, we focus on two types of face presentation attack detection methods: handcrafted and deep learning-based methods. In this section, we present most previous work in face presentation attack detection. However, we only focus on those that are thematically closer to our goals and contributions.
2.1. Handcraft-Based Techniques
Texture features, which can describe the contents and details of a specific region in an image, are an important low-level feature in face presentation attack detection methods. Therefore, the analysis of image texture information is used in many techniques, such as compressed sensing, which preserves texture information and denoising at the same time [
8,
9]. These techniques based on handcrafted features provide accurate features that increase the detection rate of a spoofing system. Smith et al. [
10] proposed a method for countering attacks on face recognition systems by using the color reflected from the user’s face as displayed on mobile devices. The presence or absence of these reflections can be utilized to establish whether or not the images were captured in real time. The algorithms use simple RGB images to detect presentation attack. These strategies can be classified into two categories: static and dynamic approaches. The static is used on a single image, whilst dynamic is used on the video.
The majority of approaches for distinguishing between real and synthetic faces are focused on texture analysis. Arashloo et al. [
11] combined two spatial–temporal descriptors using kernel discriminant analysis fusion. They are multiscale binarized statistical image features on three orthogonal planes (MBSIF-TOP) and multiscale local phase quantization on three orthogonal planes (MLPQ-TOP). To distinguish between real and fake individuals, Pereira et al. [
12] also experimented with a dynamic texture that was based on local binary pattern on three orthogonal planes (LBP-TOP). The good results of LBP-TOP are due to the fact that temporal information is crucial in face presentation attack detection. Tirunagari et al. [
13] used local binary pattern(s) (LBP) for dynamic patterns and dynamic mode decomposition (DMD) for visual dynamics. Wen et al. [
14] proposed an image distortion analysis-based method (IDA). To represent the face images, four different features were used: blurriness, color diversity, specular reflection, and chromatic moments, also relying on the features that can detect differences between real image and fake one without capturing any information about the user’s identity. Patel et al. [
15] investigated the impact of different RGB color channels (R, G, B, and gray scale) and different facial regions on the performance of LBP and dense scale invariant feature transform (DSIFT) based algorithms. Their investigations revealed that extracting the texture from the red channel produces the best results. Boulkenafet et al. [
16] proposed a color texture analysis-based face presentation attack detection approach. They employed the LBP descriptor to extract texture features from each channel after encoding the RGB images in two color spaces: HSV and YCbCr, and then concatenated these features to distinguish between real and fake faces.
Some methods, such as [
17], have recently used user-specific information to improve the performance of texture-based FAS techniques. Garcia et al. [
18] proposed face presentation attack detection by looking for Moiré patterns caused by digital grid overlap where their detection is based on frequency domain peak detection. For classification, they used support vector machines (SVM) with an radial basis function kernel. They started to run their tests on the Replay Attack Corpus and Moiré databases. Other face presentation attack detection solutions are based on textures on 3D models, such as those used in [
19]. Because the attacker in 3D models utilizes a mask to spoof the system, the introduction of wrinkles might be extremely helpful in detecting the attack. The presented work in [
19] examines the viability of performing low-cost assaults on 2.5D and 3D face recognition systems using self-manufactured three-dimensional (3D) printed models.
2.2. Deep Learning-Based Techniques
Actually, deep learning is used in a variety of systems and applications for biometric authentication [
20], where the deep network can be trained using a number of patterns. After learning all of the dataset’s unique features, the network can be used to identify similar patterns. Deep learning approaches have mostly been used to learn face presentation attack detection features. Moreover, deep learning is efficient at classification (supervised learning) and clustering tasks (unsupervised learning). Thus, the system assigns class labels to the input instances in a classification task, but the instances in clustering approaches are clustered based on their similarity without the usage of class labels.
To train models with significant discriminative abilities, Yang et al. [
21] used a deep CNN rather than manually constructing features from scratch. Quan et al. proposed a semi-supervised learning-based architecture to fight face presentation attack threats using only a few tagged data, rather than depending on time-consuming data annotations. They assess the reliability of selected data pseudo labels using a temporal consistency requirement. As a result, network training is substantially facilitated. Moreover, by progressively increasing the contribution of unlabeled target domain data to the training data, an adaptive transfer mechanism can be implemented to eliminate domain bias. According to the authors in [
22], they use a type of ground through (GT) termed appr-GT in conjunction with the identity information of the spoof image to generate a genuine image of the appropriate subject in the training set. A metric learning module constrains the generated genuine images from the spoof images to be near the appr-GT and far from the input images. This reduces the effect of changes in the imaging environment on the appr-GT and GT of a spoof image.
Jia et al. [
23] proposed a unified unsupervised and semi-supervised domain adaptation network (USDAN) for cross-scenario face presentation attack detection, with the purpose of reducing the distribution mismatch between the source and target domains. The marginal distribution alignment module (MDA) and the conditional distribution alignment module (CDA) are two modules that use adversarial learning to find a domain-invariant feature space and condense features of the same class.
Raw optical flow data from the clipped face region and the complete scene were used to train a neural network by Feng’s team et al. [
24]. Motion-based presentation attack detection does not need a scenic model or motion assumption to generalize. They present an image quality-based and motion-based liveness framework that can be fused together using a hierarchical neural network.
In their work [
25], Liu et al. proposed a deep tree network (DTN) that learns characteristics in a hierarchical form and may detect unanticipated presentation attack instrument by identifying the features that are learned.
Yu et al. [
26] introduces two new convolution and pooling operators for encoding fine-grained invariant information: central difference convolution (CDC) and central difference pooling (CDP). CDC outperforms vanilla convolution in extracting intrinsic spoofing patterns in a number of situations.
As described in Qin et al. [
27], adaptive inner-update (AIU) is a novel meta learning approach that uses a meta-learner to train on zero- and few-shot FAS tasks utilizing a newly constructed Adaptive Inner update Meta Face Anti spoofing (AIM-FAS).
According to Yu et al. [
28], the multi-level feature refinement module (MFRM) and material-based multi-head supervision can help increase BCN’s performance. In the first approach, local neighborhood weights are reassembled to create multi-scale features, while in the second, the network is forced to acquire strong shared features in order to perform tasks with multiple heads.
CDC-based frame level FAS approaches, proposed by the authors in [
29], have been developed. These patterns can be captured by aggregating information about intensity and gradient. In comparison to a vanilla convolutional network, the central difference convolutional network (CDCN) built with CDC has a more robust modeling capability. CDCN++ is an improved version of CDCN that incorporates the search backbone network with the multiscale attention fusion module (MAFM) for collecting multi-level CDC features effectively.
Spatiotemporal anti-spoof network (STASN) is a new attention mechanism invented by Yang et al. [
30] that combines global temporal and local spatial information, allowing them to examine the model’s understandable behaviors.
To improve CNN generalization, Liu et al. [
31] proposed to use innovative auxiliary information to supervise CNN training. A new CNN-RNN architecture for learning the depth map and rPPG signal from end-to-end is also proposed.
Wang et al. [
32] proposed a depth-supervised architecture that can efficiently encode spatiotemporal information for presentation attack detection and develops a new approach for estimating depth information from several RGB frames. Short-term extraction is accomplished through the use of two unique modules: the optical flow-guided feature block (OFFB) and the convolution gated recurrent units (ConvGRU). Jourabloo et al. [
33] proposed a new CNN architecture for face presentation attack detection, with appropriate constraints and supplementary supervisions, to discern between living and fake faces, as well as long-term motion. In order to detect presentation attacks effectively and efficiently, Kim et al. [
34] introduced the bipartite auxiliary supervision network (BASN), an architecture that learns to extract and aggregate auxiliary information.
Huszár et al. [
35] proposed a deep learning (DL) approach to address the problem of presentation attack instruments occurring from video. The approach was tested in a new database made up of several videos of users juggling a football. Their algorithm is capable of running in parallel with the human activity recognition (HAR) in real-time. Roy et al. [
36] proposed an approach called the bi-directional feature pyramid network (BiFPN) to detect presentation attacks because the approach containing high-level information demonstrates negligible improvements. Ali et al. [
37]—based on stimulating eye movements by using the use of visual stimuli with randomized trajectories to detect presentation attack instrument. Ali et al. [
38]—by the combination of two methods, which are head-detection algorithm and deep neural network-based classifiers. The test involved various face presentation attacks in thermal infrared in various conditions.
It appears that most of the existing handcraft and deep learning-based features may not be optimal for the FAS task due to the limited representation capacity for intrinsic spoofing features. In order to learn more robust features for the domain shift as well as more discriminative patterns for liveness detection, we propose deep background subtraction and majority vote algorithm to take into account both dynamic and static information.