3.1. Criterion for Determining Light Makeup
To date, no studies have introduced a quantitative method for establishing the criteria for light makeup. Instead, assessments are predominantly based on subjective evaluations by participants, who rely on specific features. However, these subjective, experience-based judgments are frequently affected by personal biases, resulting in a lack of precision, repeatability, and general applicability.
This section proposes a quantitative criterion for determining light makeup to objectively measure the degree of makeup. Drawing on the research by Chen et al. [
7], bare-faced and makeup face images of the same identity are input into the Face++ [
26] face comparison model. The model outputs confidence scores, which are illustrated in
Figure 1, along with the input images. A higher confidence score indicates a greater probability that the model considers the two face images to belong to the same identity. This confidence score serves as the first criterion for determining light makeup, denoted as
, as shown in Equation (
1).
where
denotes the index of the bare-faced image,
represents the makeup face image,
represents the bare-faced image, and
is the confidence score output by the face comparison model, indicating the likelihood that the two face images belong to the same identity. A higher
value suggests lighter makeup.
The developers of the Face++ face comparison model provide a threshold for determining whether two face images are captured from the same identity. However, this threshold is primarily intended to judge whether two images belong to the same individual, considering factors such as environmental lighting, face angle, and facial expressions, in addition to makeup. Consequently, directly using this threshold as a criterion for light makeup is not accurate. To address this limitation and enhance the generalizability of the light makeup determination criterion, our laboratory collected an image dataset for makeup assessment from the Internet. We conducted statistical analysis on the output confidence scores of makeup and bare-faced images. The dataset, primarily sourced from Little Red Book [
27], includes 50 subjects, each with two bare-faced images captured under different environments—one image subjectively judged as light makeup and one image subjectively judged as heavy makeup.
To determine the
threshold, we conducted a statistical analysis of the
values for light makeup and heavy makeup faces. Specifically, we input pairs of
and
, as well as
and
, from the same subject into the face comparison model. The confidence scores were then plotted as scatter plots, as illustrated in
Figure 2. In these plots, the x-axis represents the subject’s identity, while the y-axis denotes the confidence score output by the model. The yellow dashed lines in each plot indicate the maximum, mean, and minimum confidence scores. By averaging the
values for all image pairs, we obtained a mean value of 78.425. Although enhancing user experience and avoiding inconveniencing innocent users are important considerations, the primary objective of face anti-spoofing is to resist spoofing attacks. To simplify the data and minimize errors, we rounded 78.425 up to 79. Consequently, we established the first criterion for light makeup as
for both comparisons of the makeup image with the two bare-faced images.
In order to mitigate the effects of environmental lighting, face angle, and facial expression, this section introduces a second criterion for determining light makeup, as represented by Equation (
2):
where the term
within
quantifies the confidence score changes attributable to factors other than makeup, thereby reducing the influence of environmental lighting, face angle, and facial expression. A smaller
value indicates lighter makeup.
To establish a reasonable threshold for
, scatter plots of
values for light makeup and heavy makeup images from the makeup assessment dataset were generated, as depicted in
Figure 3. The y-axis of the scatter plot represents the
values. Similar to
, the average of all image pairs’
values was calculated and rounded down to 11. Consequently, the second criterion for light makeup is set as
. A makeup face image is determined to be light makeup if it satisfies both
and
. Examples of this judgment are illustrated in
Figure 4, where the first two images in each group are bare-faced images and the third image is the makeup face to be evaluated.
Drawing on the approach proposed by Dantcheva et al. [
28], this paper generates makeup face images through virtual makeup. The process involves several steps to ensure logical accuracy and detail. First, makeup transfer is performed on the real, bare-faced real images from an existing face anti-spoofing database to create makeup face images. Next, light makeup face images are screened from the generated makeup face images using specific light makeup determination criteria. Finally, these light makeup face images are combined with the original database to form a face anti-spoofing database that includes light makeup faces. The advantages of this approach are twofold:
It minimizes the differences between bare-faced and makeup faces caused by pose, lighting, and expression, thereby allowing for a focused analysis of the effects of makeup.
By leveraging an existing face anti-spoofing database, which already contains real, bare-faced images and spoofed face images, the makeup transfer on real, bare-faced images facilitates the construction of a face anti-spoofing database that includes light makeup faces.
3.2. Collection of Data for Makeup Transfer
Makeup transfer requires the collection of real, bare-faced images and reference makeup face images. The specific sources are as follows: Real, bare-faced images were primarily obtained from commonly used face anti-spoofing databases, including the MSU-MFSD database [
29], Replay-Attack [
30], CASIA-FASD [
31], and OULU-NPU [
32] (hereafter referred to as the M database, R database, C database, and O database, respectively), which contain real face images. Reference makeup face images were mainly collected manually from the Internet, with Little Red Book [
27] being the primary source. To ensure high-quality images, only those uploaded by the subjects themselves or makeup artists, taken with the original camera and not post-processed, were collected. This minimizes quality loss and interference during transmission and post-processing. Additionally, the collected images must have a face region occupying at least half of the image, and the facial skin texture should be clearly visible to provide sufficient information and detail for subsequent makeup transfer.
Ultimately, a total of 496 reference makeup face images were collected from over 300 individuals. The reference makeup face image dataset was divided according to the ratio of real video counts in the training, validation, and test sets of the four selected face anti-spoofing databases. Example reference makeup face images are shown in
Figure 5.
3.3. Generating Light Makeup Faces with Makeup Transfer Algorithms
To enrich the makeup effects, this paper employs two makeup transfer algorithms: SpMT (Semi-Parametric Makeup Transfer via Semantic-Aware Correspondence) [
33] and EleGANt (Exquisite and Locally Editable GAN for Makeup Transfer) [
34]. The SpMT algorithm generates more subtle makeup effects, while the EleGANt algorithm performs better for fine details such as eye makeup, resulting in more noticeable makeup effects.
This section uses the makeup transfer models released by the original authors of SpMT and EleGANt. The algorithms are applied to the real, bare-faced images from the M database, R database, C database, and O database, using the reference makeup face images as references. For each video in the training, validation, and test sets, a reference makeup face image is randomly selected from the corresponding reference makeup face image set for makeup transfer. To avoid confusion and facilitate subsequent applications, the makeup face images generated using different makeup transfer methods and from different original videos are stored in separate video folders. Next, the real, bare-faced images and the generated makeup face images from the same identity are input into the face comparison system. If the generated makeup face images satisfy the light makeup determination criteria (both
and
), they are considered light makeup face images and are retained.
Figure 6 shows an example triplet of the original real, bare-faced image, the reference makeup face image, and the generated light makeup face image.
To evaluate the performance degradation of algorithms when handling real faces transitioning from bare-faced to light makeup in the target domain, this section replaces the real, bare-faced images in the original face anti-spoofing databases with their corresponding light makeup face images. Additionally, to ensure that the final constructed database includes the shooting environments and identities of all original real, bare-faced videos and contains light makeup videos generated by both makeup transfer methods, this section alternates between the light makeup videos generated by the two methods. If there is no corresponding light makeup video for an original real video, the original real video is used directly.
Ultimately, a face anti-spoofing database was constructed that includes the original bare-faced real videos and light makeup videos generated by both makeup transfer methods (Makeup_Mix, hereafter referred to as Mkx). The specific distribution of the videos is shown in
Table 1.
3.4. Assessment of Face Anti-Spoofing Algorithms in Light Makeup Scenarios
This section evaluates the performance of existing representative face anti-spoofing algorithms in light makeup scenarios and validates the constructed database by referencing several papers discussed in
Section 2.
We evaluate the proposed models using several datasets: the I, C, M, O, and Mkx datasets, which specifically contain spoofing detection data with light makeup. The Mkx dataset is further divided into subsets by transferring makeup from the I, C, M, and O datasets, labeled as Mk(I), Mk(C), Mk(M), and Mk(O), respectively.
The models are initially trained on a source domain comprising faces without light makeup and subsequently tested on two distinct target domains: one primarily consisting of real faces with light makeup and the other with bare faces. This experimental setup facilitates a comparative analysis of the models’ performance in scenarios both with and without light makeup. A leave-one-out testing strategy, a prevalent method in the face anti-spoofing field, is employed. Specifically, the models are trained on the I, C, and M datasets and tested on the O dataset, denoted as ICM_O. Additional testing strategies include OCI_M, OCI_Mk(M), OIM_C, OIM_Mk(C), OCM_I, OCM_Mk(I), and ICM_Mk(O), resulting in a total of eight testing strategies.
The models’ performance is assessed using two widely recognized metrics in the face anti-spoofing domain: Area Under the Curve (AUC) and Half Total Error Rate (HTER). Given that HTER calculation involves threshold selection, this paper adheres to the experimental settings outlined in the SA-FAS and SSAN papers to ensure fair evaluation. Specifically, the threshold is determined at the point on the ROC curve where the value of (TPR-FRR) is maximized. HTER is then calculated using this threshold, and the model with the highest (AUC-HTER) value is identified as the best-performing model. To guarantee the accuracy and reliability of the experimental results, all parameter settings for the evaluated algorithms are sourced from their respective original papers. The specific evaluation of the experimental results is shown in
Table 2,
Table 3,
Table 4 and
Table 5.
The results presented in
Table 3 exhibit several noteworthy characteristics:
(1) Impact of Light Makeup: The transition of real faces in the target domain from bare to light makeup results in a performance decline for most algorithms. This observation underscores the inadequacy of current face anti-spoofing methods in handling scenarios involving light makeup.
(2) Performance in Zero-shot Scenario: In the zero-shot scenario, cross-domain methods exhibit superior performance, followed by large model-based methods. In contrast, binary supervision-based methods demonstrate relatively poor performance.
(3) Performance Differences Across Testing Strategies: A detailed analysis of the results from various testing strategies indicates that all algorithms experience a significant performance drop in ICM_Mk(O) compared to ICM_O. Conversely, some algorithms show a smaller performance drop in OCI_Mk(M), OIM_Mk(C), and OCM_Mk(I) compared to OCI_M, OIM_C, and OCM_I, with a few algorithms even showing performance improvements.
These differences are primarily attributed to the domain variations within the O, C, M, and I datasets, as well as the reference makeup images used for makeup transfer, as depicted in
Figure 7. The O dataset was primarily collected from domestic identities with minimal variations in lighting conditions. In contrast, the C dataset, also from domestic identities, exhibits larger variations in lighting conditions. The I and M datasets are predominantly composed of foreign identities, characterized by significant population distribution differences and greater variations in lighting conditions. For makeup transfer, high-quality reference makeup images are mainly sourced from makeup display photos of domestic women, as shown in
Figure 5, which exhibit smaller variations in population distribution and lighting conditions. Makeup transfer aims to maintain the original lighting conditions and identity features of the bare face images. However, the process inherently modifies skin tone and facial features. Consequently, makeup transfer typically involves minor adjustments to the bare face images based on the reference makeup images, leading to reduced domain variations in population distribution and lighting conditions compared to the original datasets. Nevertheless, makeup transfer also introduces domain variations in makeup. The varying generalization abilities of different algorithms to these two types of variations—population distribution and lighting conditions—result in differing performance across various testing strategies.