Study on the Generation and Comparative Analysis of Ethnically Diverse Faces for Developing a Multiracial Face Recognition Model

Park, Yeongje; Baek, Junho; Kim, Seunghyun; Jeong, Seung-Min; Seo, Hyunsoo; Lee, Eui Chul

doi:10.3390/electronics13183627

Open AccessArticle

Study on the Generation and Comparative Analysis of Ethnically Diverse Faces for Developing a Multiracial Face Recognition Model

by

Yeongje Park

^1,†

,

Junho Baek

^1,†

,

Seunghyun Kim

¹

,

Seung-Min Jeong

¹

,

Hyunsoo Seo

¹

and

Eui Chul Lee

^2,*

¹

Department of AI & Informatics, Graduate School, Sangmyung University, Seoul 03016, Republic of Korea

²

Department of Human-Centered Artificial Intelligence, Sangmyung University, Seoul 03016, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2024, 13(18), 3627; https://doi.org/10.3390/electronics13183627

Submission received: 1 August 2024 / Revised: 5 September 2024 / Accepted: 10 September 2024 / Published: 12 September 2024

(This article belongs to the Special Issue Pattern Recognition and Image Processing: Latest Advances and Prospects)

Download

Browse Figures

Versions Notes

Abstract

:

Despite major breakthroughs in facial recognition technology, problems with bias and a lack of diversity still plague face recognition systems today. To address these issues, we created synthetic face data using a diffusion-based generative model and fine-tuned already-high-performing models. To achieve a more balanced overall performance across various races, the synthetic dataset was created by following the dual-condition face generator (DCFace) resolution and using race-varied data from BUPT-BalancedFace as well as FairFace. To verify the proposed method, we fine-tuned a pre-trained improved residual networks (IResnet)-100 model with additive angular margin (ArcFace) loss using the synthetic dataset. The results show that the racial gap in performance is reduced from 0.0107 to 0.0098 in standard deviation terms, while the overall accuracy increases from 96.125% to 96.1625%. The improved racial balance and diversity in the synthetic dataset led to an improvement in model fairness, demonstrating that this resource could facilitate more equitable face recognition systems. This method provides a low-cost way to address data diversity challenges and help make face recognition more accurate across different demographic groups. The results of the study highlighted that more advanced synthesized datasets, created through diffusion-based models, can also result in increased facial recognition accuracy with greater fairness, emphasizing that these should not be ignored by developers aiming to create artificial intelligence (AI) systems.

Keywords:

demographic groups; face recognition; face generation; variations in resolution

1. Introduction

Face recognition technology has rapidly advanced in recent years, being widely used in various fields such as security, surveillance, and user authentication. Particularly, with the development of deep learning technology, the accuracy of face recognition has significantly improved, increasing its utility in real-life applications. However, a persistent issue with face recognition technology is its relatively low accuracy in multiracial environments. This is primarily due to the lack of diversity in datasets and inherent biases toward certain races, leading to higher error rates for specific racial groups.

To overcome these issues, many previous studies have been conducted, and various multiracial datasets have been established, facilitating research on face recognition across diverse racial environments. For instance, the BUPT-BalancedFace dataset was developed to enhance the fairness of face recognition models by balancing the representation of various races [1]. However, the process of constructing such datasets typically involves web crawling or sampling from existing datasets by race, making it challenging to obtain a sufficient number of images for each subject. Additionally, this collection and filtering process requires manual intervention, leading to high costs. Moreover, collecting datasets for individual races poses an even more challenging problem. In contrast, by utilizing generative model-based datasets, it is possible to generate a wide variety of images for each subject, which contributes to training more robust face recognition models. Furthermore, the use of AI-based automated methods allows for the easier achievement of racial balance with fewer image samples, thereby significantly reducing costs compared with traditional methods.

Therefore, attempts are being made to increase the accuracy of multiracial face recognition by utilizing synthesized data. The problem of data shortage can be effectively addressed by using a generative adversarial network (GAN) to generate realistic images of various races. This approach not only reduces the cost of data construction but also provides datasets that better reflect racial diversity, thus contributing to the development of more robust face recognition systems. Integrating synthetic data into the training pipeline of a face recognition system has the potential to significantly improve performance and diversity across different demographic groups.

Additionally, this study adopts a method of utilizing existing high-performance models to improve multiracial face recognition tasks, rather than building a new model from scratch. This approach involves fine-tuning existing models to address performance imbalances related to racial diversity in face recognition tasks. After training the face recognition model based on this approach, it is experimentally validated. This exploration aims to enhance the robustness of face recognition technology and provide higher performance across various racial groups. The main contributions of this study are as follows:

Enhanced Data Diversity and Quality: This study improved the data diversity and quality by generating synthetic data using a diffusion-based generative model, as opposed to the conventional GAN-based approaches. This advancement contributed to the enhancement of face recognition model performance.
Improved Data Quality through Image Filtering: This study proactively identified potential issues in the synthetic data generation process and addressed them through image filtering. The comparative analysis before and after filtering underscored the importance of selecting high-quality base images, which substantially improved the reliability and usability of the synthetic data.
Optimized Model Performance with ArcFace Loss: The study utilized synthetic data to train and fine-tune a pre-trained face recognition model, employing the additive angular margin (ArcFace) [2] loss function. This approach enhanced the overall model accuracy and improved the performance balance across different racial groups, thereby contributing to greater fairness in face recognition.

The paper is organized as follows. Section 2 explains the proposed synthetic dataset construction method and the face recognition model utilizing these data. The experimental results are presented in Section 4. The discussions and conclusions are provided in Section 5 and Section 6, respectively.

2. Related Works

2.1. Face Recognition

Face recognition is one of the most prominent and long-studied problems in computer vision [3,4,5]. Therefore, there is a wide variety of datasets [6,7,8,9,10,11,12,13,14,15] available.

Kim et al. proposed DiscFace, a method to address the problem of process disparity in softmax-based face recognition models [16]. The approach introduces minimum disparity learning to align the orientation of sample features within a class using a single learnable criterion. DiscFace performed well on a variety of benchmarks, achieving a true acceptance rate (TAR) of 93.37% at a false acceptance rate (FAR) of

10^{- 5}

on cross-pose LFW (CPLFW) [17], 88.83% on IARPA janus benchmark-B (IJB-B) [18], 93.71% on IARPA janus benchmark-C (IJB-C) [19], and 97.44% on MegaFace [10]. The model was trained on the CASIA-WebFace [7] and MS1MV2 [2] datasets using QMUL-SurvFace [20] to evaluate low-resolution surveillance face recognition. DiscFace effectively mitigates the mismatch between the training and evaluation phases to improve the accuracy of face recognition tasks.

Chrysos et al. [21] proposed deep polynomial neural networks, neural network architecture that incorporate polynomial expansion to capture complex patterns in data. The method extends traditional linear transforms to polynomial transforms to improve the expressiveness of the neural network. Deep polynomial neural networks (DPNNs) have shown significant performance gains on various benchmarks, including the Canadian Institute for Advanced Research, 10 classes (CIFAR-10) [22] (IS 8.49, FID 16.79), ImageNet [23] (top 1% error 22.827%, top 5% error 6.431%), and face verification tasks.

Kim et al. [24] introduced adaptive margin function (AdaFace), a face recognition model that uses quality-adaptive margins to improve robustness and accuracy. This approach addresses the common problem of adjusting the margin of the loss function based on the input image quality to account for variations in resolution, illumination, occlusion, etc. AdaFace achieved state-of-the-art results on benchmarks such as labeled faces in the wild (LFW) [6] (accuracy 99.78%), MegaFace [10] (class 1 identification accuracy 98.1%, TAR 95.2% at FAR

10^{- 6}

), and IJB-C [19] (TAR 93.7% at FAR

10^{- 5}

), demonstrating significant improvements in accuracy and robustness.

Alansari et al. [25] introduced GhostFaceNets, a set of lightweight face recognition models that utilize ghost modules to reduce the computational cost while maintaining performance. Ghost modules decompose traditional convolutional operations into intrinsic feature generation and cheap linear transformations to efficiently generate more feature maps. The model was evaluated on a variety of face recognition benchmarks and demonstrated competitive accuracy while significantly reducing the model size and floating point operations per second (FLOPs). GhostFaceNet is designed to be both efficient and effective, making it suitable for deployment in resource-constrained environments. However, most of the research has been conducted in the Western world, and accuracy for other cultures, such as Asian and African, is still poor [26,27,28]. To overcome this, attempts are being made to create datasets that are increasingly racially diverse [1,19,29,30].

However, creating new datasets with equal proportions of each race is time-consuming and expensive. Also, augmentations to meet the number of data offer limited performance improvement because they do not actually generate people with diverse identities. To overcome this problem, this study does not actually collect data, but generates data of different races from existing data.

2.2. Synthetic Face Datasets

Recent advancements in facial recognition technology have been largely attributed to the utilization of large-scale datasets, leading to the development of various synthetic datasets.

DigiFace-1M [31] is a dataset comprising 1.22 million synthetic face images generated through a computer graphics pipeline based on three-dimensional (3D) face scan data. This dataset creates unique identities by randomly combining facial geometry, texture, and hairstyles, and renders realistic face images under various environments.

SynFace [32] is a dataset generated using DiscoFaceGAN [33] containing 500,000 synthetic face images with 10,000 diverse identities. This dataset controls various attributes such as facial posture, expression, and illumination, and employs identity mixup (IM) and domain mixup (DM) [34] techniques to reduce the performance gap between synthetic and real data, demonstrating high-accuracy facial recognition using only synthetic data.

Flickr-Faces-HQ Dataset (FFHQ) [15] is a high-quality face image dataset created by NVIDIA using advanced deep learning techniques, consisting of approximately 70,000 images with a resolution of 1024 × 1024. FFHQ generates synthetic images incorporating diverse ages, genders, ethnicities, backgrounds, and accessories using StyleGAN.

SegTex [35] is a method for generating synthetic face data by converting segmentation maps to textured images. This approach creates segmentation maps representing key facial regions from the CelebAHQ-Mask [36] dataset and applies textures extracted from the CelebAMask-HQ [36] dataset to these segmentation maps. SegTex enhances the quality and diversity of synthetic images using state-of-the-art techniques such as adaptive instance normalization (AdaIN) [37] and spatially-adaptive de-normalization (SPADE) [38].

These synthetic datasets address ethical and privacy issues. Additionally, they generate images that include a variety of races, ages, and genders, thereby reducing racial bias in data and ensuring diversity in datasets. This significantly contributes to advancements in facial recognition technology. However, the existing generated data need to be filtered because they contain a significant amount of corrupted data as well as uncanny valleys, which are unpleasant to use directly. To overcome this problem, we generated our own clean data and analyzed the problems with using unclean data.

2.3. Face Data Generation Model

Recent synthetic face data generation techniques use advanced diffusion models to produce high-quality and diverse face images, maintaining identity consistency. These methods significantly improve face recognition performance on standard datasets.

The dual-condition face generator (DCFace) proposed by Kim et al. [39] is a synthetic face generation model that utilizes a dual-condition diffusion model to generate high-quality diverse face images. The dataset undergoes a two-stage generation process of conditional sampling and mixing. In the conditional sampling stage, high-quality identity (ID) images are generated and style images are selected from a real dataset. In the subsequent mixing stage, a diffusion-based generator is used to combine these two images, maintaining both inter-class and intra-class variability in the synthesized face. Additionally, DCFace employs a patch-wise style extractor to capture style features from image patches without ID information and uses time-step-dependent ID loss to ensure consistent ID representation throughout the generation process. The DCFace dataset constructed through this approach significantly reduced the performance gap with the real data, CASIA-WebFace [7], for face recognition tasks on the LFW [6], AgeDB [40], and CFP-CP [41] datasets compared with existing datasets.

The identity denoising diffusion probabilistic model (ID3PM) proposed by Kansy et al. [42] addresses the challenge of generating high-quality identity-preserving face images in a black-box setting without full access to the face recognition model. ID3PM uses a denoising diffusion probabilistic process to iteratively transform random noise into realistic face images while preserving identity-specific features. This model extracts ID vectors using a pre-trained face recognition network, which conditions the diffusion model to maintain ID consistency. ID3PM is suitable for scenarios with restricted model access, as it does not require gradients from the face recognition model during inference and includes mechanisms to control attributes such as pose and lighting, allowing for the intuitive manipulation of the generated images. The ID3PM dataset demonstrated performance comparable to models trained on real datasets for face recognition tasks on the LFW [6], AgeDB [40], and CFP-CP [41] datasets, and it showed superior performance compared with other face generation datasets.

We investigated the applicability of the above two well-behaved generative models and used them in this study to generate clean generative multiracial data and overcome the performance of the models.

3. Proposed Method

In this study, we propose a method to generate synthetic face image data using a diffusion-based generative model and employ this data to fine-tune the model for improved face recognition performance. The method is detailed in two main sections. Section 3.1 describes the data synthesis process, where we generate a synthetic dataset using the DCFace framework [39]. The dataset is created using a combination of ID and style images, allowing for the generation of diverse facial images under varying conditions. Section 3.2 outlines the face recognition model fine-tuning process, in which the synthetic dataset is used to fine-tune a pre-trained improved residual networks (IResNet)-100 model [43] with ArcFace loss [2].

3.1. Data Synthesis

In this study, we propose a method of generating synthetic data to learn a face recognition model and using this data to train the model. First, DCFace was used to generate synthetic data to be used for model training [39]. DCFace is a generative model trained for the purpose of generating a synthetic dataset based on a diffusion model. Upon testing the performance of the face recognition model trained with the generated images, an average performance increase of 6.11% was observed in 4 out of 5 datasets compared with previous work. This study used a two-stage approach when creating the face dataset and generated faces using the subject’s ID and style, which means that, depending on the style, face images under various conditions can be generated even with the same ID. Inspired by its high performance with various datasets and the ability to set various generation conditions according to style, DCFace was adopted as a synthetic face dataset generation model in this study, and the generation process implemented the one proposed by DCFace.

First, the dataset used for the style image that forms the basis of the synthetic face image is BUPT-BalancedFace [1]. This dataset includes 7000 subjects per race and consists of a total of 1.3 M images. The races include Caucasian, Indian, Asian, and African. In this study, to consider demographic fairness, 2500 subjects per race were extracted, and a total of 10,000 subjects were used to generate synthetic data. Synthetic data were generated using 50 images per subject, and the final number of images generated was 500,000.

The dataset used as the ID image corresponding to the eye, nose, and mouth of the synthetic face image is FairFace [29]. This dataset consists of seven races: White, Black, Indian, East Asian, Southeast Asian, Middle Eastern, and Latino Hispanic. The number of images for each race is shown in Table 1. The ratio for each race is also shown in Figure 1.

The ID image is composed of 7 races, whereas the style image is composed of 4 races, so to match the number of races, the 7 races were reclassified into 4 races by referring to FairFace [29]. Southeast Asian and East Asian were classified as Asian, and White, Middle Eastern, and Latino Hispanic were classified as Caucasian. Indian was classified as Indian and Black was classified as African. Latino Hispanic is often recognized as white, so it was classified as Caucasian. In addition, since the ID image should consist of 10,000 subjects, the final number of the 7 race subjects was also equalized. The final 7 race-specific numbers are shown in Table 2.

Additionally, this study addressed issues that could potentially arise during the synthetic data creation process. Among the images to be used as ID images, there were images whose resolution was so low that the face could not be properly identified, or where the eyes, nose, and mouth could not be fully synthesized because the subject’s face was not facing forward. It was also determined that images where the face is obscured by a shadow or the face is excessively covered by a hat, etc., could also be a problem in the synthetic data creation process. An example of an image determined to be problematic is shown in Figure 2. Therefore, data filtering was performed before data generation, and the difference between synthetic data before and after filtering is shown in Figure 3.

Ultimately, the total number of images in the synthetic dataset used to train the face recognition model was 500 K, and the images were composed of various ages, genders, and races, considering demographic perspectives. An example of the generated image is shown in Figure 4. These synthetic data consist of the Caucasian race and contain 18 style images and 18 corresponding synthetic images, and it was confirmed that the face of the ID image was naturally synthesized with the style image.

3.2. Face Recognition Fine Tuning Using Synthetic Dataset

In order to evaluate the usefulness of the newly generated dataset, we fine-tuned the generated data using a pre-trained face recognition model and compared the performance before and after. Through this, we wanted to confirm how effectively the newly generated dataset can improve the face recognition performance compared with the existing dataset. For pre-training, we used the widely used CASIA-WebFace dataset [7]. The CASIA-WebFace dataset provides 494,414 face images of 10,575 individuals and is considered an important resource for face recognition research and experiments. This dataset includes face images captured in various environments and conditions, contributing to improving the generalization performance of the face recognition model.

The model for learning is based on the latest IResnet-100 [43] architecture, and a simple fully connected layer head for classification is attached and used. This model has high expressiveness through a deep and complex structure, and can learn features suitable for face recognition. ArcFace loss [2] was used for efficient learning. ArcFace loss is a loss function that optimizes the angular distance between each class to improve the performance of the face recognition model and provides a high recognition rate and generalization performance. Through this, the model learns more sophisticated and accurate facial features. The mathematical definition of ArcFace loss is given in Equation (1). ArcFace loss is based on cross-entropy loss, and angle-related terms are added to optimize the angular distance between each class. It induces learning to focus only on angles without embedding bias, thereby minimizing intra-class similarity in data classification tasks that only contain faces.

L = - log (\frac{e^{s cos (θ_{y_{i}} + m)}}{e^{s cos (θ_{y_{i}} + m)} + \sum_{j = 1, j \neq y_{i}}^{N} e^{s cos θ_{j}}})

(1)

L: Loss value; represents the penalty that the model seeks to minimize during training.
$y_{i}$ : Ground truth class label for the i-th sample, indicating the correct class.
s: Scale parameter; adjusts the scale of the feature vectors.
$θ_{y_{i}}$ : Angle between the feature vector and the correct class vector.
m: Margin parameter, added to the angle to enhance class separation.
N: Total number of classes.
$θ_{j}$ : Angle between the feature vector and other class vectors.

The overall flow of the learning process is shown in Figure 5. This figure visually illustrates the overall pipeline of the learning process. The figure clearly shows how each step is connected and how the data are transformed.

It is very important to prevent overfitting in model learning. If overfitting occurs, the model will be too biased toward the training data, which will result in poor generalization performance for new data. In the study of [44], dropout was used between LSTM layers to prevent the model from over-adapting to specific data patterns. In this study, we added a dropout layer to the IResNet-100 model to effectively prevent overfitting. In addition, the model structure included multiple batch normalization layers. Batch normalization normalizes the output values of each layer to reduce the instability that may occur during the learning process and increase the learning speed. This makes the model more stable and converge faster, contributing to reducing the risk of overfitting. In addition, data augmentation techniques were actively utilized during the training process. Augmentation techniques such as horizontal flip, random resize crop, and color jitter artificially increase the diversity of the training data, helping the model to perform well in various situations without relying on specific patterns. These augmentation techniques are important factors that enable the model to generalize better to new data. In addition, we controlled the complexity of the model by setting the weight decay hyperparameter to 5 × 10⁻⁴. Weight decay regulates the model parameters so that they do not become too large, which prevents the model from becoming too complex. As a result, the model is not specialized for the training data but can be applied well to new data. In this way, we combined various techniques such as dropout, batch normalization, data augmentation, and weight decay to prevent overfitting during the model learning process and to secure more generalized performance. This strategic design ensures that the model is not limited to the training data, but can perform stably on a wider range of data in real environments.

4. Results

4.1. Generation of Synthetic Dataset

The pre-trained DCFace model was used to generate the synthetic dataset. The training data of the DCFace model comprise the FFHQ dataset. DCFace takes two face images as input: an ID image and a style image. The ID image utilizes the main features of the face, such as the eyes, nose, and mouth. Therefore, the ID of the generated synthetic data follows the ID image. In this study, the refined FairFace dataset was used as the ID image dataset. The style image is used to paste the characteristic parts from the ID image. The style image does not determine the ID of the generated synthetic image. Therefore, the images generated as the same object are derived from the same ID image, but the style images are different. In this study, the BUPT-BalancedFace dataset was used as the style image dataset.

This study aimed to verify the effect of refining ID images and creating synthetic images by balancing races on the quality of the generated images. Figure 3 shows a comparison between the FairFace dataset used as ID images in the process of creating synthetic images in this study, when refined and when unrefined. As a result, it can be confirmed that using ID images by balancing races as ID images results in better quality than when not.

4.2. Verification of Generated Image Data through Fine Tuning

In this paper, we performed fine-tuning using the generated data for the pre-trained face recognition model to evaluate the usability of the generated face dataset. In the pre-training stage, the IResNet-100 model was used. The model was trained using CASIA-WebFace as the training dataset. The stochastic gradient descent (SGD) optimizer was used for model optimization, and the initial learning rate was set to 1 × 10⁻³. As the learning progressed, the learning rate was adjusted by scheduling the learning rate to decrease by 0.1 every 5, 10, and 20 epochs. The weight decay was set to 5 × 10⁻⁴ to prevent overfitting. The pre-training was performed for a total of 30 epochs.

In the fine-tuning stage, the weights of all layers except the head and the lowest convolution layer of the pre-trained model were frozen to maintain the low-level features and essential representations learned during pre-training. This approach is crucial for preserving the generalization capability acquired from the pre-training phase while adapting to the new data. The generated synthetic facial images were used for fine-tuning. The learning rates for the ArcFace parameter and the IResNet parameter were set differently—1 × 10⁻³ for the ArcFace parameter and 1 × 10⁻⁵ for the IResNet parameter. This differentiation allows for the precise tuning of the classification task while ensuring that the backbone model undergoes more subtle adjustments to avoid performance degradation. To prevent overfitting, especially given the smaller size of the fine-tuning dataset, we monitored the model’s performance on a validation set throughout the process. The other settings, such as the optimizer, weight decay, and batch size, were consistent with those used in the pre-training phase. Fine-tuning was carried out for a total of 5 epochs, which was sufficient to achieve convergence without overfitting, leveraging the stability provided by the pre-trained weights. The hyperparameters used in the pre-training and fine-tuning processes are summarized in Table 3. A comparison of the pre-learning and fine-tuning results is shown in Table 4. It can be confirmed that the fine-tuning results show improved performance in terms of the accuracy and standard deviation of accuracy across races.

Going beyond the comparison of the accuracy and standard deviation before and after simple tuning, this paper presents confusion matrices for each race before and after tuning to verify whether fairness between races was achieved by tuning the generated dataset. Figure 6 and Figure 7 provide confusion matrices for the four races provided by BUPT-BalancedFace which were used as test datasets, namely Caucasian, African, Asian, and Indian, before and after each tuning. The confusion matrix was constructed by setting it to positive if the face image pair to be compared is the same subject, and negative if it is different. As a result, it can be confirmed that the performance of the Asian race, which did not show good classification performance compared with the other races before fine tuning, was improved when fine-tuning was performed with the dataset generated by the method proposed in this study.

5. Discussion

In this study, we investigated the effect of refining ID images on the quality of synthetic datasets. To this end, we investigated the effect of refining ID images on the quality of synthetic datasets. When the FairFace dataset was refined and used as ID images, compared with when it was used without refining, the quality of synthetic images generated using refined ID images was further improved. The synthetic dataset using refined ID images had good racial balance and reflected the main characteristics of each race more clearly. These results show that refined ID images contributed to the overall quality improvement of the synthetic dataset, which ultimately had a positive effect on the performance of the face recognition model.

In addition, we fine-tuned synthetic data using a pre-trained face recognition model and compared the performance changes. The pre-trained model using the CASIA-WebFace dataset included various face images of 10,575 people, which improved the generalization performance of the model. In addition, when the BUPT-BalancedFace dataset was used as a test dataset to evaluate the racial balance, the performance of the model was noticeably improved after fine-tuning using the synthetic dataset. In particular, the accuracy increased slightly through optimization using the pre-trained IResNet-100 model and ArcFace loss, and improvements were also made in terms of the racial balance. For example, the change in accuracy before and after tuning increased from 96.125% to 96.1625%, and the standard deviation decreased from 0.0107 to 0.0098. The reduction in standard deviation is particularly noteworthy, as it indicates not only a decrease in performance disparity across different racial groups but also an improvement in the overall fairness of the model. A lower standard deviation suggests that the model performs more consistently across various demographic groups, reducing the likelihood of bias toward any particular race. This improvement in fairness is crucial in ensuring the ethical application of facial recognition technology, as it enhances the model’s reliability and trustworthiness across diverse populations. The decreased standard deviation, therefore, reflects a significant step toward creating a more equitable and balanced face recognition system, which is essential for its deployment in real-world scenarios where fairness is paramount.

Moreover, the results of the confusion matrix analysis by race provide further insights into the positive impact of the fine-tuning process. Specifically, in the Asian 100 group, the reduction in false negatives (from 2.4% to 2.1%) and the increase in true negatives (from 47.6% to 47.9%) indicate a notable improvement in the model’s recognition accuracy for this group. This suggests that the fine-tuning process has led to better representation and performance for the Asian demographic, contributing to an overall enhancement in fairness. In the Caucasian 100 group, the increase in true positives (from 47.2% to 47.5%) and the decrease in false positives (from 2.8% to 2.5%) highlight a refined precision in the model’s predictions for this group. These improvements indicate that the fine-tuning process not only enhanced the accuracy but also reduced the error rates, thereby increasing the model’s reliability for this demographic. Additionally, it is important to note that the Indian 100 and African 100 groups maintained similar performance levels before and after fine-tuning, which suggests that the model was able to preserve its accuracy across these demographics without introducing new biases. This consistency across different racial groups reinforces the effectiveness of the synthetic dataset in promoting balanced performance across diverse populations.

In conclusion, this study suggests that the synthetic dataset can be an important resource for improving the performance of the face recognition model. We were able to confirm that the racial balance and diversity of the synthetic dataset contributed to the overall performance improvement of the model, and the refinement of the ID images played an important role in simultaneously improving the accuracy and balance of the facial recognition model by further improving the quality of the synthetic dataset.

6. Conclusions

In this study, we proposed a method for generating and utilizing synthetic datasets to improve the performance of face recognition models. In particular, we investigated the quality improvement effect of synthetic datasets generated by refining ID images. When the FairFace dataset was refined and used as ID images, the quality of the synthetic images was improved compared with the unrefined case. The synthetic dataset generated using the refined ID images was well-balanced across races and reflected the main characteristics of each race more clearly. These results show that the refined ID images contributed to the overall quality improvement of the synthetic dataset, which also had a positive effect on the performance of the face recognition model.

In addition, this study used a pre-trained face recognition model to fine-tune synthetic data and compare the performance changes. In the pre-training stage, the CASIA-WebFace dataset was used to train the IResNet-100 model. When the BUPT-BalancedFace dataset was used as a test dataset to evaluate the racial balance, the performance of the model improved after fine-tuning using the synthetic dataset. In particular, the accuracy was slightly increased and the racial balance was improved through optimization using the pre-trained IResNet-100 model and the ArcFace loss function. The results of the confusion matrix analysis by race also showed that the fine-tuned model using the synthetic dataset improved the recognition performance for a specific race. For example, the recognition rate for Asians, which showed a relatively low recognition rate in the existing model, was slightly improved after fine-tuning using the synthetic dataset. This shows that the newly generated synthetic dataset can learn the distribution of the real dataset through the refined ID image, and, furthermore, it can solve the problem of racial imbalance in the real dataset.

However, this study evaluated the face recognition performance using only a single model and loss function. In future research, a more comprehensive performance evaluation will be conducted by applying various models and loss functions. Additionally, while this study addressed racial balance in the face recognition task, it did not examine recognition performance across different age groups. Future studies will include performance evaluations using data that span various age groups for each individual.

In conclusion, this study suggests that the synthetic dataset can be an important resource to improve the performance of the face recognition model. It was confirmed that the racial balance and diversity of the synthetic dataset contributed to the overall performance improvement of the model, and it shows that the refinement of the ID image played an important role in further improving the quality of the synthetic dataset, thereby simultaneously improving the accuracy and balance of the face recognition model.

Author Contributions

Conceptualization, Y.P., J.B., S.K., S.-M.J. and H.S.; methodology, Y.P., J.B., S.K., S.-M.J. and H.S.; software, Y.P.; validation, J.B.; formal analysis, Y.P.; investigation, Y.P.; resources, J.B.; data curation, J.B.; writing—original draft preparation, Y.P., J.B., S.K., S.-M.J. and H.S.; writing—review and editing, Y.P., J.B., S.K., S.-M.J. and H.S.; visualization, Y.P. and J.B.; supervision, E.C.L.; project administration, E.C.L.; funding acquisition, E.C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the NRF (National Research Foundation) of Korea, funded by the Korean government (Ministry of Science and ICT) (RS-2024-00340935).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

The FairFace dataset employed in this study comprises publicly available images sourced from the YFCC100M Flickr dataset. These images are shared under the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits their use with proper attribution. The dataset was curated with the intent to provide a balanced representation of individuals across different races, genders, and ages, thereby supporting research in bias measurement and mitigation. While explicit informed consent from individuals depicted in the images was not obtained, the public availability and licensing of the data ensure compliance with ethical standards for research.

Data Availability Statement

The BUPT-Balancedface dataset used in this study was provided by the Beijing University of Posts and Telecommunications. The dataset is available for research purposes under specific licensing conditions. For more details, see http://www.whdeng.cn/RFW, accessed on 30 July 2024. The FairFace dataset used in this study is licensed under the MIT License. The dataset is freely available for research and commercial use. For more information, see https://github.com/joojs/fairface, accessed on 30 July 2024. Since the data used in this study comprise a public open dataset, they can be used by contacting the data holder.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, M.; Deng, W. Mitigating bias in face recognition using skewness-aware reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9322–9331. [Google Scholar]
Deng, J.; Guo, J.; Yang, J.; Xue, N.; Kotsia, I.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5962–5979. [Google Scholar] [CrossRef] [PubMed]
Kortli, Y.; Jridi, M.; Al Falou, A.; Atri, M. Face recognition systems: A survey. Sensors 2020, 20, 342. [Google Scholar] [CrossRef] [PubMed]
Modi, P.; Patel, S. A state-of-the-art survey on face recognition methods. Int. J. Comput. Vis. Image Process. (IJCVIP) 2022, 12, 1–19. [Google Scholar] [CrossRef]
Adjabi, I.; Ouahabi, A.; Benzaoui, A.; Taleb-Ahmed, A. Past, present, and future of face recognition: A review. Electronics 2020, 9, 1188. [Google Scholar] [CrossRef]
Huang, G.B.; Ramesh, M.; Berg, T.; Learned-Miller, E. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments; Technical Report 07-49; University of Massachusetts: Amherst, MA, USA, 2007. [Google Scholar]
Banerjee, S.; Scheirer, W.; Bowyer, K.; Flynn, P. On hallucinating context and background pixels from a face mask using multi-scale gans. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 300–309. [Google Scholar]
Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. VGGFace2: A Dataset for Recognising Faces across Pose and Age. In Proceedings of the International Conference on Automatic Face and Gesture Recognition, Xi’an, China, 15–19 May 2018. [Google Scholar]
Guo, Y.; Zhang, L.; Hu, Y.; He, X.; Gao, J. MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 87–102. [Google Scholar]
Kemelmacher-Shlizerman, I.; Seitz, S.M.; Miller, D.; Brossard, E. The MegaFace benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4873–4882. [Google Scholar]
Klare, B.F.; Pawar, S.; Relan, D.; Hoffman, N.; Taborsky, E.; Ricanek, K.; Li, J.; Jain, A.K. Pushing the frontiers of unconstrained face detection and recognition: IARPA Janus Benchmark A. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1931–1939. [Google Scholar]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Rothe, R.; Timofte, R.; Gool, L.V. Deep expectation of real and apparent age from a single image without facial landmarks. Int. J. Comput. Vis. (IJCV) 2018, 126, 144–157. [Google Scholar] [CrossRef]
Yang, S.; Luo, P.; Loy, C.C.; Tang, X. WIDER FACE: A Face Detection Benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar]
Kim, I.; Han, S.; Park, S.J.; Baek, J.W.; Shin, J.; Han, J.J.; Choi, C. Discface: Minimum discrepancy learning for deep face recognition. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
Zheng, T.; Deng, W. Cross-Pose LFW: A Database for Studying Cross-Pose Face Recognition in Unconstrained Environments; Technical Report 18-01; Beijing University of Posts and Telecommunications: Beijing, China, 2018. [Google Scholar]
Whitelam, C.; Taborsky, E.; Blanton, A.; Maze, B.; Adams, J.; Miller, T.; Kalka, N.; Jain, A.K.; Duncan, J.A.; Allen, K.; et al. IARPA Janus Benchmark-B Face Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 592–600. [Google Scholar] [CrossRef]
Maze, B.; Adams, J.; Duncan, J.A.; Kalka, N.; Miller, T.; Otto, C.; Jain, A.K.; Niggel, W.T.; Anderson, J.; Cheney, J.; et al. Iarpa janus benchmark-c: Face dataset and protocol. In Proceedings of the 2018 International Conference on Biometrics (ICB), Gold Coast, Australia, 20–23 February 2018; IEEE: New York, NY, USA, 2018; pp. 158–165. [Google Scholar]
Cheng, Z.; Zhu, X.; Gong, S. Surveillance Face Recognition Challenge. arXiv 2018, arXiv:1804.09691. [Google Scholar]
Chrysos, G.G.; Moschoglou, S.; Bouritsas, G.; Deng, J.; Panagakis, Y.; Zafeiriou, S. Deep polynomial neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4021–4034. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. (IJCV) 2015, 115, 211–252. [Google Scholar] [CrossRef]
Kim, M.; Jain, A.K.; Liu, X. Adaface: Quality adaptive margin for face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18750–18759. [Google Scholar]
Alansari, M.; Hay, O.A.; Javed, S.; Shoufan, A.; Zweiri, Y.; Werghi, N. Ghostfacenets: Lightweight face recognition model from cheap operations. IEEE Access 2023, 11, 35429–35446. [Google Scholar] [CrossRef]
Yucer, S.; Tektas, F.; Moubayed, N.A.; Breckon, T.P. Racial bias within face recognition: A survey. arXiv 2023, arXiv:2305.00817. [Google Scholar]
Yucer, S.; Tektas, F.; Al Moubayed, N.; Breckon, T.P. Measuring hidden bias within face recognition via racial phenotypes. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 995–1004. [Google Scholar]
Wu, H.; Albiero, V.; Krishnapriya, K.S.; King, M.C.; Bowyer, K.W. Face Recognition Accuracy Across Demographics: Shining a Light Into the Problem. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 1041–1050. [Google Scholar] [CrossRef]
Karkkainen, K.; Joo, J. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1548–1558. [Google Scholar]
Wang, M.; Deng, W.; Hu, J.; Tao, X.; Huang, Y. Racial faces in the wild: Reducing racial bias by information maximization adaptation network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 692–702. [Google Scholar]
Bae, G.; de La Gorce, M.; Baltrušaitis, T.; Hewitt, C.; Chen, D.; Valentin, J.; Cipolla, R.; Shen, J. Digiface-1m: 1 million digital face images for face recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 3526–3535. [Google Scholar]
Qiu, H.; Yu, B.; Gong, D.; Li, Z.; Liu, W.; Tao, D. Synface: Face recognition with synthetic data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10880–10890. [Google Scholar]
Deng, Y.; Yang, J.; Chen, D.; Wen, F.; Tong, X. Disentangled and controllable face image generation via 3d imitative-contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5154–5163. [Google Scholar]
Xu, M.; Zhang, J.; Ni, B.; Li, T.; Wang, C.; Tian, Q.; Zhang, W. Adversarial domain adaptation with domain mixup. In Proceedings of the AAAI conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 6502–6509. [Google Scholar]
Ambardi, L.; Hong, S.; Park, I.K. SegTex: A Large Scale Synthetic Face Dataset for Face Recognition. IEEE Access 2023, 11, 131939–131949. [Google Scholar] [CrossRef]
Lee, C.H.; Liu, Z.; Wu, L.; Luo, P. Maskgan: Towards diverse and interactive facial image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5549–5558. [Google Scholar]
Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1501–1510. [Google Scholar]
Park, T.; Liu, M.Y.; Wang, T.C.; Zhu, J.Y. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2337–2346. [Google Scholar]
Kim, M.; Liu, F.; Jain, A.; Liu, X. Dcface: Synthetic face generation with dual condition diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 12715–12725. [Google Scholar]
Moschoglou, S.; Papaioannou, A.; Sagonas, C.; Deng, J.; Kotsia, I.; Zafeiriou, S. Agedb: The first manually collected, in-the-wild age database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, Honolulu, HI, USA, 21–26 July 2017; Volume 2, p. 5. [Google Scholar]
Sengupta, S.; Chen, J.C.; Castillo, C.; Patel, V.M.; Chellappa, R.; Jacobs, D.W. Frontal to Profile Face Verification in the Wild. In Proceedings of the IEEE Conference on Applications of Computer Vision, Rome, Italy, 27–29 February 2016. [Google Scholar]
Kansy, M.; Raël, A.; Mignone, G.; Naruniec, J.; Schroers, C.; Gross, M.; Weber, R.M. Controllable Inversion of Black-Box Face Recognition Models via Diffusion. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France, 2–6 October 2023; pp. 3159–3169. [Google Scholar]
Duta, I.C.; Liu, L.; Zhu, F.; Shao, L. Improved residual networks for image and video recognition. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: New York, NY, USA, 2021; pp. 9415–9422. [Google Scholar]
Haq, M.A. DBoTPM: A deep neural network-based botnet prediction model. Electronics 2023, 12, 1159. [Google Scholar] [CrossRef]

Figure 1. FairFace dataset percentage distribution by race.

Figure 2. Examples of ID images that can be problematic when creating synthetic data.

Figure 3. Example of synthetic image data generated before and after data filtering: (a) before filtering; (b) after filtering.

Figure 4. Example of final generated Caucasian race synthetic image data.

Figure 5. Face recognition model learning process: (a) figure of model train process; (b) figure of model inference process. IResNet-100 was used as a feature extractor, and a simple classification head was attached and used. ArcFace was used together during the training process.

Figure 6. Confusion matrix by race: original results before tuning.

Figure 7. Confusion matrix by race: results after tuning.

Table 1. Number of images for each race in FairFace dataset.

Race	Images
Black	26,022
East Asian	26,124
Indian	26,154
Latino Hispanic	28,357
Middle Eastern	19,641
Southeast Asian	23,005
White	35,139

Table 2. Number of subjects for each race in the final configured FairFace dataset.

Race (Combined)	Race	Subjects
Asian	Southeast Asian	1450
Asian	East Asian	1450
Caucasian	White	850
	Latino Hispanic	825
	Middle Eastern	825
Indian	Indian	2500
African	Black	2500

Table 3. Hyperparameters for pre-training and fine-tuning.

Hyperparameter	Pre-Training	Fine-Tuning
Model	IResNet-100	IResNet-100
Dataset	CASIA-WebFace	Generated Synthetic Face Image
Optimizer	SGD	SGD
Initial Learning Rate	1 × 10⁻³	1 × 10⁻³/1 × 10⁻⁵
Learning Rate Scheduling	0.1 decay at 5, 10, 20 epochs	None
Weight Decay	5 × 10⁻⁴	5 × 10⁻⁴
Number of Epochs	30	5
Layer Unfreezing	All layers	Head, lowest conv. layers only

Table 4. Comparison of results before and after fine-tuning using generated images.

Model (IResNet-100)	Params	GFLOPs	Acc	std	Acc Details
Model (IResNet-100)	Params	GFLOPs	Acc	std	Africa	Asia	Caucasian	India
Original	65.1 M	12.13	0.96125	0.0107	0.961	0.944	0.968	0.972
Tuned	65.1 M	12.13	0.961625	0.0098	0.959	0.947	0.967	0.973

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, Y.; Baek, J.; Kim, S.; Jeong, S.-M.; Seo, H.; Lee, E.C. Study on the Generation and Comparative Analysis of Ethnically Diverse Faces for Developing a Multiracial Face Recognition Model. Electronics 2024, 13, 3627. https://doi.org/10.3390/electronics13183627

AMA Style

Park Y, Baek J, Kim S, Jeong S-M, Seo H, Lee EC. Study on the Generation and Comparative Analysis of Ethnically Diverse Faces for Developing a Multiracial Face Recognition Model. Electronics. 2024; 13(18):3627. https://doi.org/10.3390/electronics13183627

Chicago/Turabian Style

Park, Yeongje, Junho Baek, Seunghyun Kim, Seung-Min Jeong, Hyunsoo Seo, and Eui Chul Lee. 2024. "Study on the Generation and Comparative Analysis of Ethnically Diverse Faces for Developing a Multiracial Face Recognition Model" Electronics 13, no. 18: 3627. https://doi.org/10.3390/electronics13183627

APA Style

Park, Y., Baek, J., Kim, S., Jeong, S.-M., Seo, H., & Lee, E. C. (2024). Study on the Generation and Comparative Analysis of Ethnically Diverse Faces for Developing a Multiracial Face Recognition Model. Electronics, 13(18), 3627. https://doi.org/10.3390/electronics13183627

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Study on the Generation and Comparative Analysis of Ethnically Diverse Faces for Developing a Multiracial Face Recognition Model

Abstract

1. Introduction

2. Related Works

2.1. Face Recognition

2.2. Synthetic Face Datasets

2.3. Face Data Generation Model

3. Proposed Method

3.1. Data Synthesis

3.2. Face Recognition Fine Tuning Using Synthetic Dataset

4. Results

4.1. Generation of Synthetic Dataset

4.2. Verification of Generated Image Data through Fine Tuning

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI