Surveying Racial Bias in Facial Recognition: Balancing Datasets and Algorithmic Enhancements

Sumsion, Andrew; Torrie, Shad; Lee, Dah-Jye; Sun, Zheng

doi:10.3390/electronics13122317

Open AccessReview

Surveying Racial Bias in Facial Recognition: Balancing Datasets and Algorithmic Enhancements

Electrical and Computer Engineering Department, Brigham Young University, Provo, UT 84602, USA

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(12), 2317; https://doi.org/10.3390/electronics13122317

Submission received: 10 May 2024 / Revised: 5 June 2024 / Accepted: 10 June 2024 / Published: 13 June 2024

(This article belongs to the Special Issue Applications of Computer Vision, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Facial recognition systems frequently exhibit high accuracies when evaluated on standard test datasets. However, their performance tends to degrade significantly when confronted with more challenging tests, particularly involving specific racial categories. To measure this inconsistency, many have created racially aware datasets to evaluate facial recognition algorithms. This paper analyzes facial recognition datasets, categorizing them as racially balanced or unbalanced while limiting racially balanced datasets to have each race be represented within five percentage points of all other represented races. We investigate methods to address concerns about racial bias due to uneven datasets by using generative adversarial networks and latent diffusion models to balance the data, and we also assess the impact of these techniques. In an effort to mitigate accuracy discrepancies across different racial groups, we investigate a range of network enhancements in facial recognition performance across human races. These improvements encompass architectural improvements, loss functions, training methods, data modifications, and incorporating additional data. Additionally, we discuss the interrelation of racial and gender bias. Lastly, we outline avenues for future research in this domain.

Keywords:

biometrics; deep learning; deep learning bias; facial recognition; race bias

1. Introduction

Although current facial recognition systems achieve high average accuracy, numerous current improvements are directed towards addressing the disproportionate accuracies across different categories, including race [1,2,3,4]. Many attribute the majority of these issues to unbalanced datasets [5,6,7,8]. Suresh and Guttag [9] from MIT explore the factors contributing to data bias. They categorize biases into five subcategories: historical, representational, measurement, evaluation, and aggregation. These biases frequently arise unintentionally, often remaining unnoticed by researchers and consequently impact results [9].

In an attempt to quantify and mitigate varying accuracy across race in facial recognition systems, many researchers have encouraged the use of racially balanced datasets over the use of those that are racially unbalanced [10,11,12,13]. When a racially balanced dataset is not available for training, a racially balanced evaluation is encouraged [14,15,16,17]. The use of racially balanced facial recognition systems is critical to applications that require high security, regardless of race. As such, we examine publicly available datasets and provide information on whether they are racially balanced or unbalanced.

Recently, large improvements have been made in the field of image generation [18]. The effects of these generated images on facial recognition tasks are now emerging. We provide a discussion on how image generation has impacted deep learning systems corresponding to the face in cross-race transformation. Many use this cross-race transformation to generate or balance through augmentation datasets that were originally racially unbalanced to be racially balanced. This is accomplished by transforming the race of individuals in the images through the use of generative adversarial networks (GANs) or latent diffusion models.

Various tradeoffs come from balancing datasets and point to the need for other improvements besides only balancing the dataset. In response to this need, various network improvements were developed that reduce variability in performance across different races. These improvements vary from loss functions and architectural changes to data modification and the use of additional data. We compile and analyze these contributions to demonstrate the field’s current state of overcoming racial bias in facial recognition.

In summary, our contributions are as follows:

1. An analysis of both racially balanced and unbalanced datasets reveals significant imbalance across race in numerous widely used facial recognition datasets. Additionally, a list of the datasets that are racially balanced is included.

2. An examination of dataset balancing techniques through data generation, accompanied by an exploration of the implications of these methods.

3. Discussion on various network enhancements that effectively narrow the error gaps between different racial groups in facial recognition tasks.

Additionally, as gender bias is greatly interrelated with racial bias, we discuss gender bias in the context of these three contributions.

The rest of this paper is structured as follows: In Section 2, we present our methods. In Section 3, we discuss the differences between various facial recognition datasets comparing those that are racially unbalanced with those that are racially balanced. In Section 4, we discuss balancing datasets through image generation and the impact of data generation. In Section 5, we discuss various network improvements to decrease the skew of accuracies of facial recognition across race. We recommend future research directions in Section 6 and conclude in Section 7.

2. Methods

This section describes the approach for our survey paper for the dataset comparison in Section 2.1, balancing datasets through data generation in Section 2.2, the network improvements across the human race in Section 2.3, and our decision on terminology in Section 2.4. We consulted the PRISMA review methodology [19] and incorporated the majority of its checklist items to enhance the organization and clearly demonstrate the contributions of this work.

2.1. Datasets Methods

In this work, we selected 22 datasets for comparison in Section 3 based on Google Scholar searches for facial recognition datasets and selected the most popular datasets as defined by Serna et al. [20]. We analyzed the 22 datasets by reviewing the title, the abstract, and the dataset collection description. Each paper was initially reviewed by a single researcher. If inconsistencies or questions arose, a second researcher conducted an additional review. To maintain consistency and minimize bias, the primary researcher performed all initial reviews. Although this does not eliminate all potential biases, it ensures consistent biases throughout the analysis. Any papers that contained grayscale image datasets, such as that from the work of Samaria and Harter [21], were removed from the analysis.

After the datasets were selected, we compiled the racial and gender distributions for each dataset using the original publication of the dataset. When these distributions were not provided by the original publication, we referenced Serna et al. [20] to report the available distributions. When neither source provided the required data, we extracted the racial and gender labels directly from the dataset to obtain the distributions. These three methods allowed us to identify each dataset’s racial and gender distributions without the need for hand labeling. We acknowledge that the varying methods may result in inconsistencies in race classification.

We then determined which racially balanced datasets contained overlap with racially unbalanced datasets, as discussed in Section 3.3. Our definition of a racially balanced dataset is when the proportionate difference between the most and least represented races in the dataset does not exceed 5 percentage points. As described by the following equation:

|\frac{n_{\max} - n_{\min}}{N}| \leq 0.05

(1)

where

n_{\max}

is the number of samples of the most represented race,

n_{\min}

is the number of samples of the least represented race, and N is the total number of samples in the dataset. This definition is based on the principle of an alpha value of 0.05 in statistics. Our decision on the 0.05 alpha value is discussed more in depth in Section 3. The papers for the racially balanced datasets were further analyzed to determine the details of creation, benefits, tradeoffs, and what overlap exists with other popular datasets.

2.2. Dataset Balancing through Generation Methods

One of the first papers to propose generating a balanced dataset through generation was the work by Yucer et al. [22] as discussed more fully in Section 4.2. For this work, we examined all papers on Google Scholar that cited the work of Yucer et al. [22] and additionally searched for works that balance datasets through generation for facial recognition. The titles and abstracts were used to identify papers of interest. Papers were removed that did not focus on generating faces for creating a racially balanced dataset for facial recognition. Each paper’s contributions were retrieved and reported in this work.

As the impact of generating images is a growing research field with the growth of additional image generation platforms and networks, we chose to include a discussion on the impacts of data generation. We acknowledge that entire survey papers are made to discuss the impact of data generation. This paper provides a small sample from the literature to highlight the use and benefits of data generation to reduce racial bias. Not including this small sample of data generation impacts would do the reader a disservice by not demonstrating the tradeoffs of the given approach.

2.3. Network Improvements across Human Race Methods

Upon reviewing various network improvements using Google Scholar we chose to focus our review of network improvements across the human race on loss/training, architecture, dataset modification, and the use of additional data. To compare works that made network improvements, we focus on works that were trained on the BUPT-Balancedface dataset while using the ArcFace loss and evaluated on the RFW dataset. The only exceptions are when the improvement is an adjustment to the loss function. This limitation allows for accurate comparison on the RFW dataset.

As we reviewed the articles surveyed in this paper, we discovered many approaches benefit from race classification. As such, we discuss the field of racial classification with articles found through Google Scholar. These articles were analyzed and the split between the two-class and multiple-class race classification categories.

2.4. Terminology

Across the facial recognition literature, the terms “race” and “ethnicity” are used interchangeably. For example, the work by Wang et al. [15] uses the term “race” while the work by Li et al. [16] uses the term “ethnicity” to define the same concept—varying physical appearance that comes from ancestral heritage. In our work, we use the similar, but distinct definitions of race and ethnicity provided by the Merriam-Webster Dictionary. Race is defined as “any one of the groups that humans are often divided into based on physical traits regarded as common among people of shared ancestry” [23], and ethnicity is defined as “ethnic quality or affiliation” [24] with ethnic being defined as “of or relating to large groups of people classed according to common racial, national, tribal, religious, linguistic, or cultural origin or background” [25]. Thus, we refer to “race” as the physical characteristics of the face and “ethnicity” as the cultural aspects of the individual. This clarification is provided to shed light on the use of these terms throughout the paper, acknowledging that others may use different definitions for these terms.

Notably, this survey paper’s referenced works refer to 2–7 racial classifications. We acknowledge that there are many more racial classifications than these 2–7 categories and that many individuals identify as multicultural. As evident by the publication of a statistical policy directive defining what classifications and definitions of race ought to be used throughout the United States by the Office of Information and Regulatory Affairs and the Office of Management and Budget, racial classification best practices are still evolving [26]. This lengthy report discusses the various intricacies of racial identification, and includes multiple recent changes that demonstrate the complexity of classifying racial groups.

While we acknowledge the complexities of racial classification, in this work, we maintain the 2–7 categories used in previous racial facial recognition scholarships for consistency’s sake. In line with one of the first racially balanced datasets, RFW, which created the standard across the field of racially aware facial recognition tasks [15], we primarily use the following ordering and groupings for racial classification: Caucasian, African, Indian, and Asian. In addition, throughout this paper, various approaches refer to “White” and “Non-White” classifications. We acknowledge that this definition may vary depending on where the authors of the various approaches reside. However, the use in this paper is primarily to emphasize the differing accuracies between different races.

Overall, our method is centered around using Google Scholar’s search to find relevant papers. Titles and abstracts were manually checked to identify articles of interest. Further analysis of the articles was then performed to remove any articles that did not meet the focus of this work. Finally, for a paper to be included, a complete analysis of the paper was performed to identify contributions, results, and any pertinent background information. We used the PRISMA checklist [19] as a guide in the process of the creation of this paper with many of the checklist items included throughout the content of this paper.

3. Facial Recognition Dataset Comparison

Early facial recognition datasets often implemented constraints for what variation was acceptable in lighting, head pose, background, and facial expression. One early dataset, the AT&T Database of Faces (originally known as the ORL Database of Faces), incorporated several of these constraints [21]. Although this was a novel task in 1994, after only two years researchers shifted towards performing facial recognition in an unconstrained fashion seeking a model to increase generalization beyond the specified constraints [27]. In 2007, the Labeled Faces in the Wild (LFW) dataset was released with images containing a variety of head poses, lightings, camera parameters, and resolutions moving toward generalization across more scenarios [28]. However, in the last 5 years, the racial imbalance of these web-scraped datasets and the overfitting to the distribution of these datasets were demonstrated [15]. To overcome these datasets’ inaccurate distributions, recent datasets synthetically generate images to enforce the prior of proper data distributions [29].

A recent dataset, Digi-Face 1M, released in 2023, is generated with synthetic images with a focus placed on previously overlooked biases such as race, lighting, and even make-up [29]. Although synthetic image generation has advanced significantly, authentic images remain essential in certain aspects of deep learning training pipelines [30]. Although approaches have identified the need for proper racial distributions across training datasets [15,29], there is no standard criteria to measure racial equality in deep learning systems [31].

To address the absence of criteria for measuring racially balanced datasets, we define a racially balanced dataset as a dataset with a 5 (or less) percentage point difference in the proportion of images from the highest to the lowest represented racial categories. This 5 percentage point difference is chosen based on statistics containing a p-value of less than 0.05 considered statistically significant for an alpha of 0.05. The 0.05 alpha value was originally proposed by Fisher [32] and was adopted as the default alpha value across many statistical approaches [33]. Following this trend, we set our alpha value to 0.05 resulting in the 5 percentage point maximum difference between race distributions to be considered a racially balanced dataset. A graph demonstrating the racial distribution across prevalent datasets is given in Figure 1.

3.1. Unbalanced Datasets

The pursuit to improve facial recognition led to the creation of many datasets. To create the largest datasets possible, web scraping Yahoo, Flickr, and other internet platforms are popular methods for compiling datasets. Some of the prevalent datasets include LFW [28], BioSecure [34], PubFig [35], YouTube Faces [36], CASIA [37], CelebA [38], VGGFace [39], MegaFace [40], MSCeleb1M [41], UTKFace [42], VGGFace2 [43], IJB-C [44], and FRVT2018 [45]. Each of these datasets was collected using different methods. However, each of these datasets was collected with a focus on obtaining the most images in the most cost-effective and efficient method available. Convenience sampling is defined as a method of focusing on ease of access when selecting participants from a target population [46]. When the population of interest is the entire human race, these datasets can be considered a convenience sample. The racial distribution of each of these datasets is provided in Table 1. We define a racially unbalanced dataset as a dataset that contains larger than a five percentage point difference from the most represented race to the least represented. As such, we define each of these datasets as racially unbalanced.

Despite being unbalanced, some racially unbalanced datasets remain valuable for racially focused facial recognition systems. These racially aware datasets employ targeted data collection methodologies. One of these datasets is the BUPT-Globalface dataset. Rather than basing the racial distribution on maintaining equal proportions across the dataset, they base their distribution on the worldwide distribution of race [10].

In contrast to the majority of racially aware datasets that collected their images by subsampling existing facial recognition datasets, the FairFace dataset [13] collected new images for facial recognition. FairFace used a large public dataset not originally designed for facial recognition purposes, Yahoo YFCC100M [47], and detected faces in the images. They note that another dataset, Diversity in Faces (DIF) [48], also created a dataset in this manner. However, DIF does not focus on racial distribution. Additionally, FairFace varies from other racially aware datasets as they used seven race classifications based on the accepted race classification from the U.S. Census Bureau: Black, East Asian, Indian, Latino, Middle-East, South East Asian, and White. While we acknowledge that the U.S. Census Bureau’s definition of race is not based on entirely visual aspects, the FairFace dataset is based on visual aspects as the race, gender, and age groups were manually labeled using images that were labeled on Amazon’s Mechanical Turk. The racial distribution proportions for the FairFace dataset found in Table 1 were derived from the dataset labels as they were not specified in the original work. Despite the FairFace dataset’s clear emphasis on racial equality, it is classified as unbalanced according to our criteria due to a maximum racial disparity of 8.3 percentage points, which exceeds the limit of 5 percentage points.

3.2. Balanced Datasets

While there are many scenarios in which convenience sampling is sufficient to represent a population of interest, strata sampling is a more structured approach to ensure proportionate representation. Strata sampling is defined as taking proportionately equal samples from each category or stratum from the population of interest [49]. Many assert that strata sampling is optimal for facial recognition datasets, as it employs proportional sampling from each race [10,11,12,13,14,15,48].

One of the first datasets presented with relatively proportionally equal distributions across race was the Racial Faces in the Wild (RFW) dataset. The RFW dataset was created by subsampling the MS-Celeb-1M dataset [50] with an equal proportion of images across the Caucasian, African, Indian, and Asian races. To identify the race, the researchers used the nationality attribute in FreeBase celebrities [51] to select individuals of Asian or Indian race. Then, they used the Face++ API [52] to estimate race for Caucasians and Africans. Afterward, the dataset was manually and thoroughly cleaned. This resulted in a dataset of four racial subsets (African, Asian, Caucasian, and Indian), each with 10K images of 3K individuals [15]. We acknowledge that their race labels came from the “nationality” label, and nationality is defined as “a people having a common origin, tradition, and language and capable of forming or actually constituting a nation-state” [53]. The nationality label is not equivalent to a race label but is more in line with the ethnicity of an individual. While the nationality label is not directly correlated to race, we acknowledge their manual cleaning to ensure the racial labels’ accuracy. The racial distribution of the RFW dataset is compared with other unbalanced and balanced datasets in Table 1.

Released in the same year as the RFW dataset, the DemogPairs dataset followed a similar pattern of taking a subsample of unbalanced datasets [14]. However, instead of taking the samples all from one dataset, samples were taken from the CWF, VGGFace, and VGGFace2 datasets. The DemogPairs dataset is designed to be a validation dataset that can be used to evaluate model performance across the full spectrum of human racial diversity. If a model is trained on CWF, VGGFace, or VGGFace2, and validated on DemogPairs, the overlapping images must be removed from the training dataset to prevent polluting the test set. The DemogPairs dataset contains 0.13% of VGGFace2, 0.02% of VGGFace, and 1.27% of CWF [14].

In a trend to increase the size of racially balanced datasets, DiveFace [11] was assembled as a subset of the Megaface datasets MF2 [11]. This sampling brought variation in head pose, lighting, age, facial expression, and quality from the Flickr scraped images. However, they limit their racial categories to three races: combining the Indian and African categories. In Table 1, we have placed the African and Indian categories into the African category as African falls before Indian in alphabetical order.

Among the largest racially balanced datasets to date are the BUPT-Globalface and BUPT-Balancedface datasets. BUPT-Balancedface follows our racially balanced definition while the BUPT-Globalface represents each race according to the global proportions of that race and is racially unbalanced by our definition [10]. Both are subsets of the MS-Celeb-1M dataset and are extended using the one-million FreeBase celebrity list [50] and used the nationality labels in the FreeBase celebrities [51] along with the Face++ API [52], similar to RFW [15]. In addition, they removed any overlapping images with RFW, allowing for models that are trained on BUPT-Balancedface and BUPT-Globalface to be evaluated on the RFW dataset.

Another racially balanced dataset is the CASIA-SURF Cross-ethnicity Face Anti-spoofing (CeFA) dataset. The CeFA dataset was designed as an anti-spoofing dataset with 3 races, 1607 subjects, 3 modalities, and 2D plus 3D spoofing attack types [16]. This dataset was designed to counteract spoofing attacks by providing balanced races and depth information collected by infrared sensors.

The FaceARG dataset was collected by scraping the web for pictures of celebrities and resulted in over 175,000 facial images. The individuals’ race was assigned labels corresponding to four main races, following the RFW split [12]. This dataset is referred to as an “in the wild” dataset due to the variety of head poses and orientations of the individuals’ faces. More information about the racial distribution of the FaceARG dataset is given in Table 1.

The Balanced Faces in the Wild (BFW) dataset [17] was released recently in 2023 as a subset of the VGG2 dataset [43]. The BFW dataset was balanced across identity, gender, and race. The BFW dataset demonstrates the ongoing research on racially aware facial recognition systems along with the need to balance across multiple attributes in addition to balancing across race. However, to achieve the desired balance, the BFW dataset is less than 1% the size of the VGGFace2 dataset. Balancing datasets across race often requires significantly reducing the size of large datasets, resulting in a severe tradeoff.

In this work, we focus on datasets with images of individuals taken from monocular and stationary cameras. It is important to acknowledge other datasets exist with different emphases, such as those stated in past review articles [54]. Since the introduction of the RFW and DemogPairs datasets in 2019, there has been an ongoing effort to create racially balanced datasets. We emphasize that although using an authentic racially balanced dataset for training is ideal, the tradeoff tends to decrease the overall size of the dataset. To overcome this tradeoff, dataset balancing through data generation and various network improvements have been developed; these are discussed in Section 4 and Section 5.

3.3. Evaluating Networks Trained on Unbalanced Datasets

The impact of training on a racially unbalanced dataset is best seen when evaluated on a racially balanced dataset. To properly evaluate network performance on facial recognition across race, it is crucial to have no overlap between the training and testing datasets. As many of the racially balanced datasets contain overlap with frequently used training datasets, we analyze which data should be partially excluded to ensure valid evaluation, as detailed for each of the racially conscious datasets in Table 2.

In addition to publishing the RFW dataset, Wang et al. [15] also evaluated many state-of-the-art facial recognition methods on the RFW dataset. They demonstrated that certain races had lower individual accuracies, as seen in Table 3. For example, African individuals had nearly twice the average error rate than their Caucasian counterparts. To simplify comparison, we also report the skewed error ratio (SER) as defined by

S E R = \frac{M a x E r r o r_{r}}{M i n E r r o r_{r}}

with r being the races: Caucasian, Indian, Asian, and African.

3.4. Intersection of Racially Balanced Datasets and Gender Bias

As our survey focuses primarily on racial bias, we do not conduct an in depth analysis of gender bias across the datasets. However, as gender and race bias intersection is well known [62], we provide an analysis of the gender bias that remains in multiple racially balanced and racially unbalanced datasets.

In Table 4, we provide the distribution across race of both racially balanced and racially unbalanced datasets. We note that the BUPT-Balancedface and BUPT-Globalface datasets are not presented in the table as the initial work, nor subsequent works, have published labels for gender across these two datasets. In addition, the CeFA dataset only provides the distribution across gender and not the gender across the different races. We obtained the gender distributions for the FaceARG, DiveFace, and FairFace datasets by manually analyzing the datasets’ provided gender labels. The RFW dataset distribution was obtained from Sarridis et al. [63]. We note that the BFW dataset is the most gender-balanced network that is also racially balanced. In contrast, the RFW dataset’s gender distribution contains the greatest unbalance across gender for racially balanced datasets. In the discussion of racially balanced datasets, it is critical to understand which datasets are also balanced across gender.

3.5. Discussion on Racially Balanced Datasets

As the field of facial recognition continues to grow, certain biases, such as bias across race begin to be more understood. Many point to the unbalance of training and validation datasets as the cause for this bias. We provide a definition for racially balanced and racially unbalanced datasets based on statistical principles. We define racially balanced datasets as those datasets that contain less than a 5 percentage point difference between the least represented race and the most represented race. In contrast, a racially unbalanced dataset contains a greater than five percentage point difference.

The first racially balanced dataset was released in 2019, with continual releases of additional racially balanced datasets. We provide a list of racially balanced datasets in Table 1 and describe the overlap between racially balanced and racially unbalanced datasets in Table 3.

As racially balanced datasets are becoming more available, the intersection of race and gender is highlighted. We identify that although many racially balanced datasets are created to decrease the racial bias, the gender bias often remains. The most recently released racially balanced dataset, BFW, is not only balanced across race but also gender.

4. Balancing Datasets through Data Generation

While there are various datasets that are balanced [10,11,12,14,15,16,17], many of the largest datasets remain unbalanced [39,40,41,43,45]. One method for obtaining racially balanced datasets from a racially unbalanced dataset is to augment the dataset through race transformations. The concept is centered around taking all images of people in one race and transferring the images to also be in all other races, creating an equally balanced dataset. Facial recognition focused on racially balancing datasets through race transformation frequently uses generative adversarial networks (GANs) [64] and diffusion models [65,66] to perform this augmentation.

Since the original publication of the GAN by Goodfellow et al. [64] in 2014, GANs have continued to increase in complexity and ability. Some of the top research papers on GANs are written on topics including adding details to segmentations or outlines of items such as handbags or cars [67], style-based networks that generate human faces [68] and even creating anime characters simply off of a list of attributes given in text [69]. In this paper, we focus on the effect that GANs have on increasing and decreasing racial equality as well as cross-race transformation.

Although stable diffusion was created in 2015 [65], it significantly grew in popularity and performance in 2022 with the release of stable diffusion in the latent space [66]. Latent stable diffusion is currently the leading method for image generation in various applications [70]. A recent variant, the diffusion transformer [71], has shown promising results in further enhancing this technology’s capabilities.

4.1. Understanding the Impact of Data Generation

As GANs and latent stable diffusion continuously improve, many users remain unaware of the unintended consequences of data generation. One team of researchers demonstrated unintended consequences by using headshots of engineers to train a deep donvolutional GAN [72], resulting in the majority of GAN-produced images having white skin tones and masculine facial features [73]. They point out that GANs can increase data biases that are already in the dataset if GANs are used to generate facial data.

Before the publication of the work by Jain et al. [73], another team used a GAN to attempt to lower racial inequality with their focus on quantifying attractiveness to simulate equality of opportunity [74]. To accomplish this, they trained a network to take a subset of images, such as one gender, and be able to distinguish whether they are attractive or not. They then compared across the different subsets using publicly available datasets: CelebA [38], Soccer (many analysts, one dataset) [75], and Quick Draw! [76]. They designed their generator and discriminator network structures based on previous works [77,78,79] and used conditional batch normalization [80]. Overall, they improved on the simulated equality of opportunity with their GAN-based approach [74].

A larger example of a GAN demonstrating racial inequality is seen with a team that created a face-generating GAN [81]. The GAN was used to generate faces that would pass on multiple individuals when compared using top-performing face authentication systems such as DLib [82], FaceNet [83], and SphereFace [60]. They termed these faces “master faces” [81]. To generate face images, they use Dlib’s embedding representation as the input to their network. Their network has a linear layer and four de-convolutional layers, based on the DC-GAN architecture [72]. They trained on DLib’s embedding representations of the FFHQ dataset [68]. Upon inspection of the faces generated that are displayed in their paper, the highest proportion of any race of generated images is Caucasian [81]. Examples of generated images are shown in Figure 2.

While there are various examples of GANs having large racial biases, there are also GANs that display the ability to decrease racial biases. One such work used a classical GAN-based approach to lower the racial bias in machine learning. Their approach consisted of training a network to identify race using California Census data, followed by an adversarial network to train against the identifying network. Their method increased race identification accuracy from 39% to 76% on their dataset [84]. This demonstrates the ability of GANs to greatly increase the ability of classification networks particularly with race. It also demonstrates that improvements continue to be made with GANs to decrease racial inequality.

4.2. Cross-Race Transformation

While using a racially balanced dataset to train a facial recognition model is beneficial, it often results in a tradeoff with decreasing the size of the datasets. To continue the use of large-scale facial recognition datasets, a beneficial method is race transformation. This method augments datasets to be racially balanced by transforming the race of images to other races.

Yucer et al. [22] overcomes the issues of racially unbalanced datasets using a racial transformation data augmentation approach. They used a four-race dataset: African, Asian, Caucasian, and Indian to compare more easily with the RFW (Racial Faces in the Wild) dataset. For the racial transformation, they trained six CycleGAN models [85] on the BUPT-Transferface dataset [15]. For facial recognition, they trained a common DCNN, Resnet [86], and used a ResNet100 [61] to obtain the final 512-D feature space representation. They experimented with various loss functions: Softmax [60], CosFace [87], and ArcFace [61]. They trained on a subset of 1200 images from VGGFace2 [43] with 300 images for each of their four races. Each of these 300 images was then transferred to the other 3 races using the CycleGANs for the augmented dataset. Their results not only improve accuracy on the RFW dataset but also the overall accuracy on the LFW (Labeled Faces in the Wild) dataset [22]. Example images of the transferred race images are available in Figure 3.

A similar approach was taken by Ge et al. [88] as they use a Fan-Shaped GAN (FGAN) to also transfer the race of an individual to another race. Their network structure is a combination of previous architectures [85,89] with the addition of spectral normalization [78]. For the dataset, they note that RFW is collected from MS-Celeb-1M. So, they collect their images from MS-Celeb-1M with 5000 images in each racial subset for their dataset. They compare their network against the StarGAN [89] and CycleGAN [85]. In doing so, they displayed that their approach was more accurate in all races for FID and received a better inception score on all races besides African [88].

One recent work in racial image generation transforms gender and age in the same system [90]. Their pipeline takes the input image, detects the faces and eyes, crops, normalizes, generates the face image by GRA-GAN, and estimates gender/race/age using a CNN. This pipeline creates a method to control for extra variation that is outside the scope of the GRA-GAN. They demonstrate that their GRA-GAN outperforms or has comparable results to top image-to-image GANs such as Pix2Pix [67], CycleGAN [85], Modified CycleGAN [91], and Enhanced CycleGAN [92]. They state that their network could be used to augment data to prevent training from overfitting to certain genders, races, or ages [90].

While many methods for balancing datasets through data generation have focused on training a new model for transferring race, the work by Jain et al. [93] uses an existing Style GAN to perform the generation. Previous methods identified the need to scale their method up with additional subjects and computational power, the work by Jain et al. [93] provides a method for transferring style to an image to transfer the race. Using this method, Jain et al. [93] pre-trained a facial recognition network on the augmented dataset, which outperforms corresponding models not pre-trained with their augmented dataset. Their approach demonstrates a solution to previous methods’ need to scale up the transformation approaches.

While racial transformation sounds promising to balance a dataset, there remain many problems. Some of these problems include whether or not the images of a person that has been racially transformed to a new race should be considered as positive, negative, or not even considered to the original image or other images of that same person [94]. In contrast, the approach of limiting the dataset to only use an equal number of images from each race is a more defined approach to the solution. However, it does limit the size of the dataset instead of increasing the size. The benefits and the problems of each of these approaches are an open research question.

One of the most recent image generation methods has generated an entire dataset for the use of overcoming the skew across race in facial recognition. DigiFace 1M provides 1 million synthetically generated faces that are balanced across race and adds various accessories to make the datasets encompass a larger spread of data. Using only their dataset, they were able to achieve a 95.82% on the LFW dataset without using any of the LFW training images. As opposed to other methods of using GANs or stable diffusion, DigiFace 1M was created using a computer graphics pipeline [29]. Training on this racially balanced dataset resulted in a high accuracy on the LFW dataset, demonstrating promise for racially aware facial recognition systems. An example from the DigiFace 1M dataset is provided in Figure 4.

4.3. Intersection of Racial Dataset Balancing and Gender

As we discuss racially balancing datasets through data generation, we note that similar responses have also been proposed for balancing across gender. Previously, we discussed the GRA-GAN approach that uses the same network to generate gender, race, and age [90]. While other works had transferred across race and age [92], the GRA-GAN was the first approach to also transfer across gender. Previous related approaches classified gender for tasks such as face aging [95].

The results of the GRA-GAN demonstrate the improvement over multiple biases when training to overcome the biases in the same network. Their work shows the large interrelation between racial and gender bias.

4.4. Discussion on Balancing Datasets through Data Generation

Balancing datasets through data generation to minimize racial bias is relatively new to the field of image generation. Preliminary methods published in 2020 demonstrated the success on samples of larger datasets [22], while recent publications have scaled to full datasets by using pre-existing networks [93]. In addition, the benefit of training a network to not only transfer across race but also gender and age demonstrates an improvement over only transferring across race [90].

The field of racially balancing datasets through data generation has followed the progression of GANs using both CycleGANs and FanGANS. Other image networks mitigating racial bias in related tasks [96] have used diffusion to generate images while balancing datasets through data generation for facial recognition has turned to using computer graphic pipelines [29]. One of the next steps for balancing datasets through data generation for facial recognition is to use latent diffusion models as various other fields of image generation have.

5. Network Improvements across Human Race

Efforts to mitigate the performance disparities across different races in facial recognition systems continue to evolve with various innovative approaches. This work focuses on approaches that include refining loss functions, optimizing training methodologies, modifying datasets, and novel network architectures. In addition, as various methods rely on race recognition within the overarching solution, we include a discussion on identifying race. As gender bias is interrelated with racial bias, we discuss network improvements that decrease both racial and gender bias. We provide a brief discussion on related tasks’ network improvements and an overall discussion on network improvements in facial recognition across human races.

5.1. Network Improvements on the RFW Dataset

This section focuses on improvements measured on the RFW dataset [15], a popular racially balanced test dataset. Table 5 compares various improvements discussed in this section. The SER (as defined in Section 3) is also reported in this table to facilitate straightforward comparison to methods trained on unbalanced datasets shown in Table 3.

Accompanying the RFW dataset, Wang et al. [15] provides an initial baseline method for the dataset. This baseline, the Information Maximization Adaptation Network (IMAN), has two main stages: pseudo-adaptation and a custom mutual information loss. The pseudo-adaptation focuses on pre-clustering to obtain initial improvement in the target domain. Then, the mutual information loss uses the distribution of the target classifier to obtain larger decision margins.

The CosFace solution presents a loss function that incorporates a cosine margin to the distances between intra-class and inter-class face representations [87]. This margin increases inter-class pair distances and decreases intra-class pair distances. Although CosFace was released prior to the RFW dataset, results are included from Gong et al. [100] in Table 5 that evaluated the CosFace loss with a ResNet34 architecture.

A pivotal improvement on the RFW task and facial recognition in general is the ArcFace (Additive Angular Margin) loss [61]. Instead of relying on direct distance measurements, the innovative ArcFace loss uses the angle of the arc between the two facial representations. This puts the facial representations along the exterior of the arc in multidimensional space. Inserting this defined embedding representation prior allows for a more robust facial representation. This has become widely adopted across facial recognition with the majority of the following methods within this section using the ArcFace loss as illustrated by the Loss-ArcFace column in Table 5. It is important to note that ArcFace is now a popular loss function used in various applications [105,106,107,108].

Along with updating loss functions, there is an additional focus on incorporating conventional methods with deep learning techniques, specifically hand-crafted features. Some of these hand-crafted features are crowd counting, camera angle, camera height, appearance, and scale of the individuals. The ACNN method applies these hand-crafted features to the neural network facial recognition approach [97]. This extra information outperforms a similar-sized CNN without the hand-crafted features. This approach was also released before the RFW dataset, Gong et al. [100] evaluated the ACNN method on the RFW dataset, and we include these results in Table 5.

The use of conventional methods to overcome racial bias in facial recognition shifted towards using probabilistic approaches. The Probabilistic Face Embeddings (PFE) approach learns a Gaussian latent space representation of the face [98]. While most facial recognition approaches use neural networks to separate the multi-dimensional representation of the face, the PFE uses a variational auto-encoder, similar to image generation methods such as latent diffusion models [66], to encode the face into a latent space representation with a mean and a variance. The latent space Gaussian representations are compared to determine whether the facial representations correspond to the same individual. Although this approach is not widely adopted, it remains competitive with other racially aware methods, as seen in Table 5.

Architectural advancements have been developed to further improve facial recognition across race. One such improvement, the debiasing adversarial network (DebFace), uses an image-to-feature encoder, four attribute classifiers, a distribution classifier, and a feature aggregation network [100]. The four classifiers—gender, age, race, and identity—turn potential biases into informed features, thus improving performance across underrepresented groups.

Continuing in architectural improvements, the Group Adaptive Classifier (GAC) method uses an adaptive classification within the network structure to obtain higher accuracies [100]. They use a group of adaptive classifiers that assign the individual to a demographic group. A demographic-specific kernel then performs the final classification. These demographic-specific kernels learn unique racial characteristics, thus improving accuracy.

Recent studies have explored significant enhancements in training methodologies. One notable approach, the Reinforcement Learning-based Race Balance Network (RL-RBN) [10], uses reinforcement learning as a novel method for training facial recognition. They deliberately pick the action, reward, and objective functions to ensure balanced learning across race. The action function permits three outcomes: keeping the margin between classes the same, shifting the margin to a larger value, and shifting the margin to a smaller value. The reward model aims to standardize distances and mitigate skew across racial groups. To achieve this the reward is calculated using the distance between the Caucasians and the race of the individual [10]. The objective function is a deep Q-Learning function [109,110] which learns the optimal policy for the agent [10]. While reinforcement learning is not frequently used in facial recognition tasks, this study underscores the importance of enhancing training methods beyond merely refining datasets, losses, and architectures.

Wang et al. [101] proposes another significant improvement in training methodology with the introduction of the Meta Balanced Network (MBN), a meta-learning method that improves fairness across skin tones. The MBN uses two loss functions: training loss and meta loss. The meta loss weights the training loss to ensure equal weight adjustment across all skin tones during backpropagation. While other studies emphasize the necessity of balanced training data, this approach is able to use skewed datasets without skewing the final result.

Coe and Atay [1] investigates the optimal network architecture to minimize racial bias in facial recognition systems. To demonstrate that each model learns different races in different proportions, they compare AlexNet [111], VGG16 [112], and ResNet50 [113]. Their results demonstrate that VGG16 outperforms both AlexNet and ReseNet50 in terms of accuracy across different races [1]. This indicates that the performance across race varies based on network architecture, despite being trained on the same data. As their approach did not report their results on the RFW dataset, the results are not included in Table 5.

A not-expected approach is seen as the researchers claim that training only on one race is not inherently disadvantageous [102]. They demonstrate that by training only using African faces, they achieved less skew across race than by training with a balanced dataset. They also found that they obtained higher accuracy across race by using additional images for an individual instead of adding more identities. To come to their findings they performed their testing on four recent top facial recognition models: VGGFace2 [43], CenterLoss [59], SphereFace [60], and ArcFace [61]. They note that training on one race decreases the overall size of the dataset and emphasize that having a larger dataset would result in even higher metrics [102]. Their experiments and findings result in a new and unique method for obtaining higher accuracies across race.

Serna et al. [20] proposed Sensitive Loss, a modification of a sensitive triplet generator and the triplet loss function [114]. Their approach emphasized how transfer learning a racially skewed facial recognition model can decrease the skew across race. To demonstrate the performance of the sensitive loss, they evaluated their loss function across three baseline models: VGG-Face [39], ResNet-50 [113], and ArcFace [61]. Their loss greatly improved the Equal Error Rate and decreased the standard deviation across the DiveFace dataset, RFW dataset, and the BUPT-Balancedface dataset.

An additional method returned to the architectural improvements in the Adversarial Information Network (AIN) approach. Here, they minimize the target classifier and maximize the feature extractor. Then, they use a graph neural network to find the likelihood of the target data [115]. As the AIN approach reported the RFW results in a non-conventional method, we do not include the AIN results in Table 5.

Research has shown that networks tend to focus on varying facial regions based on the individual’s race. To address this issue, Li et al. [103] introduced the Progressive Cross Transformer (PCT). This innovative approach employs a dual transformer arrecognition process.

Introduced in 2023, the Gradient Attention Balance Network (GABN) is a facial recognition model designed to analyze the same facial regions regardless of the race of the individual [104]. They use attention masks to enforce structured regions of interest, effectively minimizing racial disparities, as illustrated by the results in Table 5.Introduced in 2023, the Gradient Attention Balance Network (GABN) is a facial recognition model designed to analyze the same facial regions regardless of the race of the individual [104]. They use attention masks to enforce structured regions of interest, effectively minimizing racial disparities, as illustrated by the results in Table 5.

5.2. Racial Classification

Various facial recognition approaches require racial classification methods, either directly [10,20,99,101,102,104,115] or indirectly [100,103]. Thus, discussing racial classification methods and their improvements is crucial. In 2014, a review article [54] cited the highest accuracy for identifying race as nearly 99% [116], misleading some to consider race classification as a solved problem. However, this impressive accuracy used 3D imaging techniques to identify race [117], whereas this work focuses exclusively on race identification with monocular images—a significantly more challenging task.

5.2.1. Two-Class Race Classification

Instead of focusing on classifying across many races, initial research focused on distinguishing between two races. One such group focused on distinguishing Black and White individuals in one network and in another network distinguishing Chinese and Non-Chinese [118]. Their network was based on the CIFAR-10 network [119]. For their datasets, they collected images from various sources. From the MORPH-II dataset [120], they used 43,130 face images of Black and White individuals, from the Casia-Webface dataset [37], they used 101,771 images of Black and White individuals, from the CASIA-PEAL dataset [121] they chose 5429 images of Asian individuals, from the FERET dataset [122] they used 3407 images of Black and White individuals [118]. The reported overall accuracy was 100% for identifying Black individuals and 99.4% for White individuals [118]. On the Chinese and non-Chinese classification task, they achieved 91.6% accuracy for identifying Chinese individuals, and 93.5% for non-Chinese individuals. These high accuracies demonstrate a strong ability to identify race in a narrow scope of Chinese and non-Chinese individuals or Black and White individuals.

Some may argue that identifying just a few races is oversimplifying a complex task. Within one race or demographic category, there may be significant racial variation. To explore intra-racial performance, one group of researchers developed a network to identify a face as either from northern India or from southern India. Upon collecting and labeling a custom dataset, the method achieved human-level accuracy but with different error patterns [123]. Their findings demonstrate the ability of neural networks to classify race at a more precise level, even within a specific race.

Vo et al. [124] based their racial recognition model on the VGG architecture [112]. They scraped individuals’ profile pictures and race from Facebook. This was done to obtain images at different poses, lighting, accessories, and imaging conditions. They then used a Haar Cascade classifier [125] to crop, and label 6100 faces with 2892 Vietnamese and 3208 images in other races, creating their VNFaces dataset. They obtained an 88.87% overall accuracy using this model [124]. This dataset contains increased variation across lighting, accessories, and image conditions, resulting in a more difficult classification task.

5.2.2. Multiple-Class Race Classification

While various methods focus on differentiating between two races for race classification, many facial recognition models require multiple races for their approach. One method that identifies race across the four labeled races in the UTKFace dataset [42] based their network architecture on ResNet V1 [126] followed by L2 normalization and fully connected layers achieved an average race classification of 90.1% [127].

One set of researchers focused their neural network on identifying the race of individuals that are Chinese, Japanese, or Korean [128]. To collect a dataset to train on, they scrapped Twitter, specifically the followers of Asian celebrities. They used the username and the primary language of the profile description to label the race. Then, their approach predicted the correct race of 78.21% Chinese, 72.80% Japanese, and 73.80% Korean [128], demonstrating great success in race classification while also demonstrating that race classification remains unsolved.

As researchers published the FaceARG dataset, they also trained and compared four unique SOTA convolutional neural networks on the task of recognizing race [12]. They chose to include VGG19 [112], Inception ResNet v2 [126], SE-ResNet24 [129], and MobileNetV3 [130]. The results of each of these four networks are shown in Table 6. Their MobileNetV3 achieved their highest overall F1 score at 96.64 on the four-class human race classification [12]. To test the robustness of their system, they altered the images with experiments including eye blur, eye occlusion, nose blur, nose occlusion, mouth blur, mouth occlusion, grayscale, increasing brightness, decreasing brightness, and image blur. For overall accuracy in identifying race, all robustness experiments maintained over 90% accuracy, except for eye occlusion, which dropped to 80.74% accuracy [12]. All of this demonstrates the ability of CNNs to classify race at a fairly high accuracy, even with some important parts of the face occluded.

5.3. Related Tasks

While there are various network improvements that are being accomplished in facial recognition to decrease the variance across race, other tasks also benefit from focusing on race. One similar task is race, gender, and age classification. In the release of the FairFace dataset, Karkkainen and Joo [13] trained four different classification networks all with the same setup: DLib’s face detector [131], a ResNet-34 [113], and the ADAM optimizer [132]. The difference between the four networks was what dataset they were trained on: UTKFace [42], LFWA+, CelebA [133], and their new racially aware dataset, FairFace. They took these four trained networks and evaluated them. The cross-dataset accuracy findings on the White race vs. Not White race are shown in Table 7a for race classification, Table 7b for gender classification, and Table 7c for age classification. They note that the model trained on the FairFace dataset outperforms all others. The results emphasize the benefit that comes from focusing on race while training not only facial recognition or race classification tasks but also gender and age classification. Their findings demonstrate the need for generalization across race on more tasks than facial recognition. We acknowledge that the White vs. Not White definitions are used although the definition of White or Not White will vary across which country the labeler resides in or cultural backgrounds that the labeler has.

Another example of benefiting performance based on race is seen with the InclusiveFaceNet solution. Their model is used to detect face attributes. It does this by using the race and gender representations that are transfer-taught to the model. To transfer-learn race and gender, they train a FaceNet model [83] and extract features from the avgpool layer. This layer learns race and gender. Then, to further learn race, they obtained over 100,000 images from famous individuals [134,135,136]. Then, after combining these transfer learnings, they obtained higher results in face attribute detection across race and gender than the network without the transfer learnings [137].

Pastaltzidis et al. [138] explored the bias in the RWF-2000 dataset [139], a law enforcement project created for violent activity recognition. They note that certain demographics are over-represented in the training dataset. To counteract this bias, they implemented a unique data augmentation method that force-balances the dataset to be more representative of the population. They achieved this by modifying videos that over-represent minority groups, using body movement tracking to replace the individual with others from a different race. This resulted in a more balanced dataset, and their results show promise for using synthetically generative models to balance various datasets. These examples of network improvements across related tasks demonstrate the improvements that come from focusing on race during trainings.

These three studies demonstrate the benefit of a racially balanced dataset, race classification pre-training, and force balancing a dataset through augmentation are beneficial to more than only facial recognition. These improvements also transfer to other classification and detection tasks by decreasing the disproportionate accuracies between races. This improvement is critical for the integration of systems in real-world applications.

5.4. Intersection of Network Improvements across Race and Gender

Many of the network improvements across human races could also be applied to improving the bias across gender. Our work separates network improvements across race into four categories: loss/training, architecture, dataset modification, and the use of additional data. Each of these four categories not only decreases racial bias but also has been seen to decrease gender bias. The improvement in loss/training can be seen with Conti et al. [140] presenting the use of a von Mises–Fisher mixture model, which takes a trained facial recognition model and trains a shallow network with the fair von Mises–Fisher loss. The improvement across architecture can be seen with the work of Gwyn and Roy [141], which focuses on analyzing different architectures to identify the optimal architecture for minimizing gender bias. They find that VGG-16 is one of the architectures with the least amount of gender bias. This is the same conclusion that Coe and Atay [1] concluded was optimal for reducing racial bias. From these two studies, we see the VGG-16 network minimizing not only racial bias but also gender bias.

Just as dataset modifications decrease the racial bias in facial recognition, Tian et al. [142] demonstrates that data augmentation affects gender bias as well. Similar results are seen with the use of additional data. The BFW dataset [17], demonstrating that having additional data to balance datasets benefits not only race bias but also gender bias. As the discussion around minimizing racial bias through network improvements begins to grow, the overlap between minimizing gender bias also grows. The decrease in racial bias and gender bias are connected, allowing improvements from one to be modified to benefit the other.

5.5. Discussion on Network Improvements

General trends emerge in the analysis of the varying network improvements for decreasing the skew across race in facial recognition. In Table 8, we classify each improvement into one or more of the following categories: loss/training improvements, architectural changes, dataset modification, and using additional data.

The loss and training improvements began with the CosFace loss [87] and then was built on by the IMAN approach with its custom mutual information loss [15]. But the most commonly used loss for facial recognition, particularly for racially balanced facial recognition, is the ArcFace loss [61]. Other losses have been developed [104]; however, none have been as widely accepted as the ArcFace loss [61]. Along with adapting new loss functions, there was also a focus on improving training approaches such as new attempts to use reinforcement learning [10] or losses that force the network to focus on the same parts of the face regardless of race [104].

Most publications on improvements for mitigating the skew across race for facial recognition focus on improving the architecture of the neural network. Initial methods were built to learn the race and gender initially and then learn the facial recognition task [137]. These methods were followed by having multiple stages of the network to learn the race and the face separately [15] or within the same architecture [99]. Other methods learned a probabilistic latent space representation of the face [98] or compared different popular architectures on the task [1]. Most recently, the contributions have focused on architectures that can compare skin tone [101] and approaches that focus on maintaining the networks focus on the same parts of the face, regardless of race [104,115].

There have also been improvements in modifying the dataset with novel ideas such as only training on one race that improves cross-race results [102] or more common ideas of force balancing datasets [138]. Other approaches have attempted to include previous methodologies, such as using hand-picked features [97] and incorporating different tasks (gender, age, race, and identity) [99].

Overall, many of the methods are centered around inserting prior information within loss functions [15,61,87,104], modifying architectures to learn additional tasks [15,99,101,104,115], and modifying the dataset or incorporating additional data [97,99,102]. Each method contributes to mitigating the skew across race in facial recognition.

6. Future Work

The field of facial recognition focusing on substrata, such as the human race, is a growing field. There are many methods where future work could be created and added upon. As discussed previously, much work has been carried out to create racially balanced datasets. This approach offers multiple benefits. It primarily enhances the ability of a system to achieve high accuracy in realistic, generalized scenarios. This area of research is still lacking as the number and size of racially unbalanced datasets greatly dwarf those of racially balanced datasets. Future work could focus on making larger datasets with the intent to have equal sampling from the stratum of the human race.

Another future work was discussed previously in race transformation. This would be used to take an image of an individual in one race and transfer it to all other races that we are interested in. Then, doing that with all the images for that individual would create a new individual that could be used as an additional training sample. With high quality race transformation, one could take any racially unbalanced dataset and augment the dataset to have an equal distribution across all racial groups. This could be carried out by replacing GAN racial transformation networks with diffusion racial transformation networks. While GANs were originally used as the SOTA generative deep learning approach, diffusion has surpassed GANs in many categories and shows promise for the ability to create a race transformation network [143].

To better aid racially balanced performance across human races, race recognition using monocular images must be improved. With certain strata outperforming other strata in classification, it is difficult to scrap images for races with lower classification accuracies. To create the optimal racially balanced dataset, the race classification of human faces must also be improved.

As this work focuses on comparing racially unbalanced/balanced datasets, balancing datasets through data generation, and network improvements across the human race using monocular cameras there are limitations to our study. Some of these limitations include that bias across datasets is not limited to race, but rather also is connected or even interconnected with other biases such as gender or age. We recognize that gender [62,144,145] and age [146] are not covered in great detail within this survey. Future survey papers could center around the conjugation of race, gender, and age in facial recognition.

7. Conclusions

Understanding bias in deep learning is a widespread and evolving field, presenting complex challenges. We provide an analysis of race imbalance across many popular facial recognition datasets and the increasing trend of new facial recognition datasets to be racially balanced. We discuss the promising attempts to balance datasets through data generation accompanied by a discussion on the impact of data generation. We discuss various network improvements that aim to reduce racial bias in facial recognition systems. These include loss modifications, training methods, architecture improvements, data modification, and incorporating additional data. Finally, we provide a list of future work that researchers can follow. Overall, the skew across race in facial recognition is decreasing but requires further research to mitigate the problem fully.

Author Contributions

Conceptualization, A.S. and D.-J.L.; methodology, A.S.; software, A.S.; validation, A.S. and S.T.; formal analysis, A.S.; investigation, A.S.; resources, D.-J.L.; data curation, A.S.; writing—original draft preparation, A.S.; writing—review and editing, A.S., D.-J.L., S.T. and Z.S.; visualization, A.S.; supervision, D.-J.L.; project administration, D.-J.L.; funding acquisition, D.-J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

During the preparation of this work, the authors used Grammarly, Word Tune, and ChatGPT to rephrase specific key sentences for clarity and grammar. After using these tools/services, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Coe, J.; Atay, M. Evaluating impact of race in facial recognition across machine learning and deep learning algorithms. Computers 2021, 10, 113. [Google Scholar] [CrossRef]
Cavazos, J.G.; Phillips, P.J.; Castillo, C.D.; O’Toole, A.J. Accuracy comparison across face recognition algorithms: Where are we on measuring race bias? IEEE Trans. Biom. Behav. Identity Sci. 2020, 3, 101–111. [Google Scholar] [CrossRef] [PubMed]
Krishnapriya, K.; Albiero, V.; Vangara, K.; King, M.C.; Bowyer, K.W. Issues related to face recognition accuracy varying based on race and skin tone. IEEE Trans. Technol. Soc. 2020, 1, 8–20. [Google Scholar] [CrossRef]
Pagano, T.P.; Loureiro, R.B.; Araujo, M.M.; Lisboa, F.V.N.; Peixoto, R.M.; Guimaraes, G.A.d.S.; Santos, L.L.d.; Cruz, G.O.R.; de Oliveira, E.L.S.; Cruz, M.; et al. Bias and unfairness in machine learning models: A systematic literature review. arXiv 2022, arXiv:2202.08176. [Google Scholar]
Parikh, R.B.; Teeple, S.; Navathe, A.S. Addressing bias in artificial intelligence in health care. JAMA 2019, 322, 2377–2378. [Google Scholar] [CrossRef] [PubMed]
Ntoutsi, E.; Fafalios, P.; Gadiraju, U.; Iosifidis, V.; Nejdl, W.; Vidal, M.E.; Ruggieri, S.; Turini, F.; Papadopoulos, S.; Krasanakis, E.; et al. Bias in data-driven artificial intelligence systems—An introductory survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2020, 10, e1356. [Google Scholar] [CrossRef]
Fountain, J.E. The moon, the ghetto and artificial intelligence: Reducing systemic racism in computational algorithms. Gov. Inf. Q. 2022, 39, 101645. [Google Scholar] [CrossRef]
Abdurrahim, S.H.; Samad, S.A.; Huddin, A.B. Review on the effects of age, gender, and race demographics on automatic face recognition. Vis. Comput. 2018, 34, 1617–1630. [Google Scholar] [CrossRef]
Suresh, H.; Guttag, J.V. A framework for understanding unintended consequences of machine learning. arXiv 2019, arXiv:1901.10002. [Google Scholar]
Wang, M.; Deng, W. Mitigating bias in face recognition using skewness-aware reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 9322–9331. [Google Scholar]
Morales, A.; Fierrez, J.; Vera-Rodriguez, R.; Tolosana, R. Sensitivenets: Learning agnostic representations with application to face images. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2158–2164. [Google Scholar] [CrossRef]
Darabant, A.S.; Borza, D.; Danescu, R. Recognizing Human Races through Machine Learning—A Multi-Network, Multi-Features Study. Mathematics 2021, 9, 195. [Google Scholar] [CrossRef]
Karkkainen, K.; Joo, J. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1548–1558. [Google Scholar]
Hupont, I.; Fernández, C. Demogpairs: Quantifying the impact of demographic imbalance in deep face recognition. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; pp. 1–7. [Google Scholar]
Wang, M.; Deng, W.; Hu, J.; Tao, X.; Huang, Y. Racial faces in the wild: Reducing racial bias by information maximization adaptation network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 692–702. [Google Scholar]
Li, A.; Tan, Z.; Li, X.; Wan, J.; Escalera, S.; Guo, G.; Li, S.Z. CASIA-SURF CeFA: A Benchmark for Multi-modal Cross-ethnicity Face Anti-spoofing. arXiv 2020, arXiv:2003.05136. [Google Scholar]
Robinson, J.P.; Qin, C.; Henon, Y.; Timoner, S.; Fu, Y. Balancing biases and preserving privacy on balanced faces in the wild. IEEE Trans. Image Process. 2023, 32, 4365–4377. [Google Scholar] [CrossRef] [PubMed]
Ning, X.; Nan, F.; Xu, S.; Yu, L.; Zhang, L. Multi-view frontal face image generation: A survey. Concurr. Comput. Pract. Exp. 2023, 35, e6147. [Google Scholar] [CrossRef]
Page, M.J.; Moher, D.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. PRISMA 2020 explanation and elaboration: Updated guidance and exemplars for reporting systematic reviews. bmj 2021, 372, n160. [Google Scholar] [CrossRef] [PubMed]
Serna, I.; Morales, A.; Fierrez, J.; Obradovich, N. Sensitive loss: Improving accuracy and fairness of face representations with discrimination-aware deep learning. Artif. Intell. 2022, 305, 103682. [Google Scholar] [CrossRef]
Samaria, F.S.; Harter, A.C. Parameterisation of a stochastic model for human face identification. In Proceedings of the 1994 IEEE Workshop on Applications of Computer Vision, Sarasota, FL, USA, 5–7 December 1994; pp. 138–142. [Google Scholar]
Yucer, S.; Akçay, S.; Al-Moubayed, N.; Breckon, T.P. Exploring racial bias within face recognition via per-subject adversarially-enabled data augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Virtual, 14–19 June 2020; pp. 18–19. [Google Scholar]
Merriam-Webster. Race. Available online: https://www.merriam-webster.com/dictionary/race (accessed on 24 May 2024).
Merriam-Webster. Ethnicity. Available online: https://www.merriam-webster.com/dictionary/ethnicity (accessed on 24 May 2024).
Merriam-Webster. Ethnic. Available online: https://www.merriam-webster.com/dictionary/ethnic (accessed on 24 May 2024).
Office of Management and Budget. Revisions to OMB’s Statistical Policy Directive No. 15: Standards for Maintaining, Collecting, and Presenting Federal Data on Race and Ethnicity; Office of Information and Regulatory Affairs, Office of Management and Budget, Executive Office of the President: Washington, DC, USA, 2024.
Howell, A.J.; Buxton, H. Towards unconstrained face recognition from image sequences. In Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, Killington, VT, USA, 14–16 October 1996; pp. 224–229. [Google Scholar]
Huang, G.B.; Mattar, M.; Berg, T.; Learned-Miller, E. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. In Proceedings of the Workshop on faces in ‘Real-Life’ Images: Detection, Alignment, and Recognition, Marseille, France, 17–20 October 2008. [Google Scholar]
Bae, G.; de La Gorce, M.; Baltrušaitis, T.; Hewitt, C.; Chen, D.; Valentin, J.; Cipolla, R.; Shen, J. Digiface-1m: 1 million digital face images for face recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 3526–3535. [Google Scholar]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar] [CrossRef]
Fisher, R.A. Statistical methods for research workers. In Breakthroughs in Statistics: Methodology and Distribution; Springer: Berlin/Heidelberg, Germany, 1970; pp. 66–70. [Google Scholar]
Cowles, M.; Davis, C. On the origins of the. 05 level of statistical significance. Am. Psychol. 1982, 37, 553. [Google Scholar] [CrossRef]
Ortega-Garcia, J.; Fierrez, J.; Alonso-Fernandez, F.; Galbally, J.; Freire, M.R.; Gonzalez-Rodriguez, J.; Garcia-Mateo, C.; Alba-Castro, J.L.; Gonzalez-Agulla, E.; Otero-Muras, E.; et al. The multiscenario multienvironment biosecure multimodal database (bmdb). IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1097–1111. [Google Scholar] [CrossRef]
Kumar, N.; Berg, A.; Belhumeur, P.N.; Nayar, S. Describable visual attributes for face verification and image search. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1962–1977. [Google Scholar] [CrossRef] [PubMed]
Wolf, L.; Hassner, T.; Maoz, I. Face recognition in unconstrained videos with matched background similarity. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 21–25 June 2011; pp. 529–534. [Google Scholar] [CrossRef]
Yi, D.; Lei, Z.; Liao, S.; Li, S.Z. Learning face representation from scratch. arXiv 2014, arXiv:1411.7923. [Google Scholar]
Yang, S.; Luo, P.; Loy, C.C.; Tang, X. From facial parts responses to face detection: A deep learning approach. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 3676–3684. [Google Scholar]
Parkhi, O.M.; Vedaldi, A.; Zisserman, A. Deep Face Recognition. In Proceedings of the BMVC 2015—British Machine Vision Conference 2015, British Machine Vision Association, Swansea, UK, 7–10 September 2015. [Google Scholar]
Kemelmacher-Shlizerman, I.; Seitz, S.M.; Miller, D.; Brossard, E. The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4873–4882. [Google Scholar]
Guo, Y.; Zhang, L.; Hu, Y.; He, X.; Gao, J. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In Proceedings of the European Conference on Computer Vision, 2016, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 87–102. [Google Scholar]
Zhang, Z.; Song, Y.; Qi, H. Age progression/regression by conditional adversarial autoencoder. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5810–5818. [Google Scholar]
Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. Vggface2: A dataset for recognising faces across pose and age. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 67–74. [Google Scholar]
Maze, B.; Adams, J.; Duncan, J.A.; Kalka, N.; Miller, T.; Otto, C.; Jain, A.K.; Niggel, W.T.; Anderson, J.; Cheney, J.; et al. Iarpa janus benchmark-c: Face dataset and protocol. In Proceedings of the 2018 International Conference on Biometrics (ICB), Gold Coast, QLD, Australia, 20–23 February 2018; pp. 158–165. [Google Scholar]
Grother, P.; Ngan, M.; Hanaoka, K. Face Recognition Vendor Test (FVRT): Part 3, Demographic Effects; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2019.
Golzar, J.; Tajik, O.; Noor, S. Convenience Sampling. IJELS 2022, 1, 72–77. [Google Scholar] [CrossRef]
Thomee, B.; Shamma, D.A.; Friedland, G.; Elizalde, B.; Ni, K.; Poland, D.; Borth, D.; Li, L.J. YFCC100M: The new data in multimedia research. Commun. ACM 2016, 59, 64–73. [Google Scholar] [CrossRef]
Merler, M.; Ratha, N.; Feris, R.S.; Smith, J.R. Diversity in faces. arXiv 2019, arXiv:1901.10436. [Google Scholar]
Birmajer, D.; Blount, B.; Boyd, S.; Einsohn, M.; Helmreich, J.; Kenyon, L.; Lee, S.; Taub, J. Introductory Statistics. 2021. Available online: https://touroscholar.touro.edu/oto/26/ (accessed on 5 April 2024).
s-celeb-1m Challenge 3: Face Feature Test/Trillion Pairs. Originally. Available online: http://trillionpairs.deepglint.com/overview (accessed on 14 June 2020).
Google. Freebase Data Dumps. Originally. Available online: https://developers.google.com/freebase/data (accessed on 24 May 2024).
Face++ Research Toolkit. Available online: https://www.faceplusplus.com/ (accessed on 24 May 2024).
Merriam-Webster. Nationality. Available online: https://www.merriam-webster.com/dictionary/nationality (accessed on 24 May 2024).
Fu, S.; He, H.; Hou, Z.G. Learning race from face: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2483–2509. [Google Scholar] [CrossRef] [PubMed]
Microsoft Azure. Available online: https://www.azure.cn. (accessed on 24 May 2024).
Zhou, E.; Cao, Z.; Yin, Q. Naive-deep face recognition: Touching the limit of LFW benchmark or not? arXiv 2015, arXiv:1501.04690. [Google Scholar]
Baidu Cloud Vision Api. Available online: http://ai.baidu.com/ (accessed on 24 May 2024).
Amazon’s Rekognition Tool. Available online: https://aws.amazon.com/rekognition/ (accessed on 24 May 2024).
Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 499–515. [Google Scholar]
Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 212–220. [Google Scholar]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 July 2019; pp. 4690–4699. [Google Scholar]
Acien, A.; Morales, A.; Vera-Rodriguez, R.; Bartolome, I.; Fierrez, J. Measuring the gender and ethnicity bias in deep models for face recognition. In Proceedings of the Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 23rd Iberoamerican Congress, CIARP 2018, Madrid, Spain, 19–22 November 2018; Proceedings 23. Springer: Berlin/Heidelberg, Germany, 2019; pp. 584–593. [Google Scholar]
Sarridis, I.; Koutlis, C.; Papadopoulos, S.; Diou, C. Towards Fair Face Verification: An In-depth Analysis of Demographic Biases. arXiv 2023, arXiv:2307.10011. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 2256–2265. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 10684–10695. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Jin, Y.; Zhang, J.; Li, M.; Tian, Y.; Zhu, H.; Fang, Z. Towards the automatic anime characters creation with generative adversarial networks. arXiv 2017, arXiv:1708.05509. [Google Scholar]
Croitoru, F.A.; Hondru, V.; Ionescu, R.T.; Shah, M. Diffusion models in vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10850–10869. [Google Scholar] [CrossRef]
Peebles, W.; Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4195–4205. [Google Scholar]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Jain, N.; Manikonda, L.; Hernandez, A.O.; Sengupta, S.; Kambhampati, S. Imagining an engineer: On GAN-based data augmentation perpetuating biases. arXiv 2018, arXiv:1811.03751. [Google Scholar]
Ting, D.S.W.; Cheung, C.Y.L.; Lim, G.; Tan, G.S.W.; Quang, N.D.; Gan, A.; Hamzah, H.; Garcia-Franco, R.; San Yeo, I.Y.; Lee, S.Y.; et al. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. JAMA 2017, 318, 2211–2223. [Google Scholar] [CrossRef] [PubMed]
Silberzahn, R.; Uhlmann, E.L.; Martin, D.P.; Anselmi, P.; Aust, F.; Awtrey, E.; Bahník, Š.; Bai, F.; Bannard, C.; Bonnier, E.; et al. Many analysts, one data set: Making transparent how variations in analytic choices affect results. Adv. Methods Pract. Psychol. Sci. 2018, 1, 337–356. [Google Scholar] [CrossRef]
Wheatley, G. Quick Draw; Mathematics Learning: Bethany Beach, DE, USA, 2007. [Google Scholar]
Miyato, T.; Koyama, M. cGANs with projection discriminator. arXiv 2018, arXiv:1802.05637. [Google Scholar]
Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral normalization for generative adversarial networks. arXiv 2018, arXiv:1802.05957. [Google Scholar]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved training of wasserstein gans. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Dumoulin, V.; Shlens, J.; Kudlur, M. A learned representation for artistic style. arXiv 2016, arXiv:1610.07629. [Google Scholar]
Shmelkin, R.; Wolf, L.; Friedlander, T. Generating master faces for dictionary attacks with a network-assisted latent space evolution. In Proceedings of the 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), Jodhpur, India, 15–18 December 2021; pp. 1–8. [Google Scholar]
King, D.E. Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. 2009, 10, 1755–1758. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Mandis, I.S. Reducing Racial and Gender Bias in Machine Learning and Natural Language Processing Tasks Using a GAN Approach. Int. J. High Sch. Res. 2021, 3, 17–24. [Google Scholar] [CrossRef]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
He, L.; Wang, Z.; Li, Y.; Wang, S. Softmax dissection: Towards understanding intra-and inter-class objective for embedding learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10957–10964. [Google Scholar]
Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.; Li, Z.; Liu, W. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5265–5274. [Google Scholar]
Ge, J.; Deng, W.; Wang, M.; Hu, J. FGAN: Fan-Shaped GAN for Racial Transformation. In Proceedings of the 2020 IEEE International Joint Conference on Biometrics (IJCB), Houston, TX, USA, 28 September–1 October 2020; pp. 1–7. [Google Scholar] [CrossRef]
Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8789–8797. [Google Scholar]
Kim, Y.H.; Nam, S.H.; Hong, S.B.; Park, K.R. GRA-GAN: Generative adversarial network for image style transfer of Gender, Race, and age. Expert Syst. Appl. 2022, 198, 116792. [Google Scholar] [CrossRef]
Kim, Y.H.; Lee, M.B.; Nam, S.H.; Park, K.R. Enhancing the accuracies of age estimation with heterogeneous databases using modified CycleGAN. IEEE Access 2019, 7, 163461–163477. [Google Scholar] [CrossRef]
Kim, Y.H.; Nam, S.H.; Park, K.R. Enhanced cycle generative adversarial network for generating face images of untrained races and ages for age estimation. IEEE Access 2020, 9, 6087–6112. [Google Scholar] [CrossRef]
Jain, A.; Memon, N.; Togelius, J. Zero-shot racially balanced dataset generation using an existing biased stylegan2. In Proceedings of the 2023 IEEE International Joint Conference on Biometrics (IJCB), Ljubljana, Slovenia, 25–28 September 2023; pp. 1–18. [Google Scholar]
Sumsion, A.; Torrie, S.; Sun, Z.; Lee, D.J. Overcoming deep learning subclass imbalances: Comparing the transfer of identity across a racial transformation. Electron. Imaging 2023, 35, 1–6. [Google Scholar] [CrossRef]
Yang, C.; Lv, Z. Gender based face aging with cycle-consistent adversarial networks. Image Vis. Comput. 2020, 100, 103945. [Google Scholar] [CrossRef]
Lui, N.; Chia, B.; Berrios, W.; Ross, C.; Kiela, D. Leveraging Diffusion Perturbations for Measuring Fairness in Computer Vision. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 14220–14228. [Google Scholar]
Kang, D.; Dhar, D.; Chan, A. Incorporating side information by adaptive convolution. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Shi, Y.; Jain, A.K. Probabilistic face embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6902–6911. [Google Scholar]
Gong, S.; Liu, X.; Jain, A.K. Jointly de-biasing face recognition and demographic attribute estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 330–347. [Google Scholar]
Gong, S.; Liu, X.; Jain, A.K. Mitigating face recognition bias via group adaptive classifier. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 3414–3424. [Google Scholar]
Wang, M.; Zhang, Y.; Deng, W. Meta balanced network for fair face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8433–8448. [Google Scholar] [CrossRef] [PubMed]
Gwilliam, M.; Hegde, S.; Tinubu, L.; Hanson, A. Rethinking common assumptions to mitigate racial bias in face recognition datasets. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 4123–4132. [Google Scholar]
Li, Y.; Sun, Y.; Cui, Z.; Shan, S.; Yang, J. Learning fair face representation with progressive cross transformer. arXiv 2021, arXiv:2108.04983. [Google Scholar]
Huang, L.; Wang, M.; Liang, J.; Deng, W.; Shi, H.; Wen, D.; Zhang, Y.; Zhao, J. Gradient attention balance network: Mitigating face recognition racial bias via gradient attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 38–47. [Google Scholar]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef] [PubMed]
Chan, E.R.; Lin, C.Z.; Chan, M.A.; Nagano, K.; Pan, B.; De Mello, S.; Gallo, O.; Guibas, L.J.; Tremblay, J.; Khamis, S.; et al. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 16123–16133. [Google Scholar]
Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
Wang, M.; Deng, W. Deep face recognition: A survey. Neurocomputing 2021, 429, 215–244. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems 25 (NIPS 2012), Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Hoffer, E.; Ailon, N. Deep metric learning using triplet network. In Proceedings of the Similarity-Based Pattern Recognition: Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, 12–14 October 2015; Proceedings 3. Springer: Berlin/Heidelberg, Germany, 2015; pp. 84–92. [Google Scholar]
Wang, M.; Deng, W. Adaptive Face Recognition Using Adversarial Information Network. IEEE Trans. Image Process. 2022, 31, 4909–4921. [Google Scholar] [CrossRef]
Manesh, F.S.; Ghahramani, M.; Tan, Y.P. Facial part displacement effect on template-based gender and ethnicity classification. In Proceedings of the 2010 11th International Conference on Control Automation Robotics & Vision, Singapore, 7–10 December 2010; pp. 1644–1649. [Google Scholar]
Toderici, G.; O’malley, S.M.; Passalis, G.; Theoharis, T.; Kakadiaris, I.A. Ethnicity-and gender-based subject retrieval using 3-D face-recognition techniques. Int. J. Comput. Vis. 2010, 89, 382–391. [Google Scholar] [CrossRef]
Wang, W.; He, F.; Zhao, Q. Facial ethnicity classification with deep convolutional neural networks. In Proceedings of the Chinese Conference on Biometric Recognition, Chengdu, China, 14–16 October 2016; pp. 176–185. [Google Scholar]
Krizhevsky, A.; Hinton, G. Convolutional deep belief networks on cifar-10. Unpubl. Manuscr. 2010, 40, 1–9. [Google Scholar]
Ricanek, K.; Tesafaye, T. Morph: A longitudinal image database of normal adult age-progression. In Proceedings of the 7th International Conference on Automatic Face and Gesture Recognition (FGR06), Southamtpon, UK, 10–12 April 2006; pp. 341–345. [Google Scholar]
Gao, W.; Cao, B.; Shan, S.; Chen, X.; Zhou, D.; Zhang, X.; Zhao, D. The CAS-PEAL large-scale Chinese face database and baseline evaluations. IEEE Trans. Syst. Man Cybern.-Part A Syst. Hum. 2007, 38, 149–161. [Google Scholar]
Phillips, P.J.; Moon, H.; Rizvi, S.A.; Rauss, P.J. The FERET evaluation methodology for face-recognition algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1090–1104. [Google Scholar] [CrossRef]
Katti, H.; Arun, S. Can you tell where in India I am from? Comparing humans and computers on fine-grained race face classification. arXiv 2017, arXiv:1703.07595. [Google Scholar]
Vo, T.; Nguyen, T.; Le, C. Race recognition using deep convolutional neural networks. Symmetry 2018, 10, 564. [Google Scholar] [CrossRef]
Lyons, M.; Akamatsu, S.; Kamachi, M.; Gyoba, J. Coding facial expressions with gabor wavelets. In Proceedings of the Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 14–16 April 1998; pp. 200–205. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the Thirty-first AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Das, A.; Dantcheva, A.; Bremond, F. Mitigating bias in gender, age and ethnicity classification: A multi-task convolution neural network approach. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Wang, Y.; Feng, Y.; Liao, H.; Luo, J.; Xu, X. Do they all look the same? Deciphering chinese, japanese and koreans by fine-grained deep learning. In Proceedings of the 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), Miami, FL, USA, 10–12 April 2018; pp. 39–44. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
King, D.E. Max-Margin Object Detection. arXiv 2015, arXiv:1502.00046. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 3730–3738. [Google Scholar]
Vrandečić, D.; Krötzsch, M. Wikidata: A free collaborative knowledgebase. Commun. ACM 2014, 57, 78–85. [Google Scholar] [CrossRef]
Rothe, R.; Timofte, R.; Van Gool, L. Deep expectation of real and apparent age from a single image without facial landmarks. Int. J. Comput. Vis. 2018, 126, 144–157. [Google Scholar] [CrossRef]
Somandepalli, K. Prediction Race from Face for Movie Data; University of Southern California: Los Angeles, CA, USA, 2017. [Google Scholar]
Ryu, H.J.; Adam, H.; Mitchell, M. Inclusivefacenet: Improving face attribute detection with race and gender diversity. arXiv 2017, arXiv:1712.00193. [Google Scholar]
Pastaltzidis, I.; Dimitriou, N.; Quezada-Tavarez, K.; Aidinlis, S.; Marquenie, T.; Gurzawska, A.; Tzovaras, D. Data augmentation for fairness-aware machine learning: Preventing algorithmic bias in law enforcement systems. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, 21–24 June 2022; pp. 2302–2314. [Google Scholar]
Cheng, M.; Cai, K.; Li, M. RWF-2000: An open large scale video database for violence detection. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 4183–4190. [Google Scholar]
Conti, J.R.; Noiry, N.; Clemencon, S.; Despiegel, V.; Gentric, S. Mitigating gender bias in face recognition using the von mises-fisher mixture model. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 4344–4369. [Google Scholar]
Gwyn, T.; Roy, K. Examining Gender Bias of Convolutional Neural Networks via Facial Recognition. Future Internet 2022, 14, 375. [Google Scholar] [CrossRef]
Tian, F.; Liu, W.; Zhao, S.; Liu, J. Face Recognition Fairness Assessment based on Data Augmentation: An Empirical Study. In Proceedings of the 2022 IEEE 22nd International Conference on Software Quality, Reliability, and Security Companion (QRS-C), Guangzhou, China, 5–9 December 2022; pp. 315–318. [Google Scholar]
Avrahami, O.; Fried, O.; Lischinski, D. Blended latent diffusion. ACM Trans. Graph. (TOG) 2023, 42, 1–11. [Google Scholar] [CrossRef]
Palmer, M.A.; Brewer, N.; Horry, R. Understanding gender bias in face recognition: Effects of divided attention at encoding. Acta Psychol. 2013, 142, 362–369. [Google Scholar] [CrossRef] [PubMed]
Ng, C.B.; Tay, Y.H.; Goi, B.M. A review of facial gender recognition. Pattern Anal. Appl. 2015, 18, 739–755. [Google Scholar] [CrossRef]
Terhörst, P.; Kolf, J.N.; Huber, M.; Kirchbuchner, F.; Damer, N.; Moreno, A.M.; Fierrez, J.; Kuijper, A. A comprehensive study on face recognition biases beyond demographics. IEEE Trans. Technol. Soc. 2021, 3, 16–30. [Google Scholar] [CrossRef]

Figure 1. Dataset Racial Distributions: Visualization of racial distributions across prevalent facial recognition datasets. Consistent with the literature, we report only the races included in the RFW dataset [15].

Figure 2. Example Images of MasterFace: MasterFace is an adversarial approach to the facial recognition task where a high proportion of generated images pass for the majority of faces on many facial recognition models [81]. The (a–i) rows demonstrate 9 different sets of Masterfaces. We include this figure to demonstrate the highest proportion of any race generated is Caucasian faces, demonstrating the skew across race in many facial recognition models. Reprinted/adapted with permission from Ref. [81]. 2021, IEEE.

Figure 3. Example Images of Transferring Race: The top images demonstrate successful race transfers while the bottom images demonstrate failures. The green and red bounding boxes demonstrate the original image. As observed in the bottom images, the GAN struggles with extreme head poses and varying lighting [22].

Figure 4. DigiFace 1M Example Image: DigiFace 1M [29], one of the most recent image generation datasets, generates the face and then changes lighting, background, and pose.

Table 1. Dataset Distribution Information: Descriptions of various racially balanced and unbalanced datasets. The datasets are sorted chronologically by release year, then alphabetically by dataset name. For those papers that did not originally report the race distribution, the distribution was taken from the work of Serna et al. [20].

Dataset	Year	Number of Images (In Thousands)	Number of Identities (In Thousands)	Average Images Per Identity	Caucasian	African	Indian	Asian	Racially Balanced/Unbalanced
LFW [28]	2008	13	5.7	2	77.6%	12.9%	** Combined In African	9.4%	Unbalanced
BioSecure [34]	2009	2.7	0.667	4	86.1%	5.2%	** Combined In African	8.8%	Unbalanced
PubFig [35]	2011	58	0.2	294	85.0%	12.0%	** Combined In African	3.0%	Unbalanced
YouTube Faces [36]	2011	621	1.6	390	77.2%	11.7%	** Combined In African	10.9%	Unbalanced
CASIA [37]	2014	500	10.5	48	82.0%	12.9%	** Combined In African	5.2%	Unbalanced
CelebA [38]	2015	203	10.2	20	75.4%	14.6%	** Combined In African	9.9%	Unbalanced
VGGFace [39]	2015	2600	2.6	1000	82.3%	12.7%	** Combined In African	5.0%	Unbalanced
MegaFace [40]	2016	4700	660	7	70.3%	10.9%	** Combined In African	18.7%	Unbalanced
MSCeleb1M [41]	2016	10,000	100	100	71.6%	16%	** Combined In African	12.2%	Unbalanced
UTKFace [42]	2017	24	-	-	46.2%	37.8%	** Combined In African	16.0%	Unbalanced
IJB-C [44]	2018	21	3.5	6	70.5%	17.8%	** Combined In African	11.6%	Unbalanced
VGGFace2 [43]	2018	3300	9	370	76.1%	16.8%	** Combined In African	7%	Unbalanced
DemogPairs [14]	2019	10.8	0.6	18	33.3%	33.3%	** Combined In African	33.3%	Balanced
FRVT2018 [45]	2019	2700	1200	2	64.9%	27.3%	** Combined In African	1.6%	Unbalanced
RFW [15]	2019	40	12	3.3	25%	25%	25%	25%	Balanced
BUPT-Balancedface [10]	2020	1300	28	46.4	25%	25%	25%	25%	Balanced
BUPT-Globalface [10]	2020	2000	38	52.6	38%	13%	18%	31%	Unbalanced
CeFA [16]	2020	23.5	1.6	14.7	N/A	33.3%	33.3%	33.3%	Balanced
DiveFace [11]	2020	125	24	5.2	33.3%	33.3%	** Combined In African	33.3%	Balanced
FaceARG [12]	2021	175	-	-	24.42%	24.02%	25.94%	25.60%	Balanced
FairFace [13]	2021	108	-	-	* 29.7%	* 29.5%	* 26.6%	* 14.2%	* Unbalanced
BFW [17]	2023	20	0.8	25	25%	25%	25%	25%	Balanced

* The FairFace dataset [13] distributions are 14.1% Black 19.0% Caucasian, 14.1% East Asian, 14.2% Indian, 15.3% Latino, 10.7% Middle Eastern, and 12.5% Southeast Asian. Despite its emphasis on capturing diverse racial distributions, we classify this dataset as unbalanced due to the maximum difference of 8.3 percentage points. ** The Indian column is classified along with the African column.

Table 2. Dataset Overlap: To properly use the racially conscious datasets as a test dataset, there needs to be no overlap between the training dataset and the test dataset. This table lists what racially balanced datasets contain overlap with popular racially unbalanced facial recognition datasets. We note that BUPT-Balancedface and BUPT-Globalface do not supply what percentage came from MS-Celeb-1M.

Dataset	Overlaps with Other Datasets	Percentage Overlap
DemogPairs [14]	VGGFace2, VGGFace, and CWF	0.13%, 0.02% and 1.27%
RFW [15]	MS-Celeb-1M	0.4%
BUPT-Balancedface [10]	MS-Celeb-1M	Not Available
BUPT-Globalface [10]	MS-Celeb-1M	Not Available
DiveFace [11]	Megaface dataset MF2	2.7%

Table 3. Commercial API comparison: This table compares commercial APIs and popular algorithms trained on racially unbalanced datasets and evaluated on racially unbalanced (Labeled Faces in the Wild (LFW)) and racially balanced (Racial Faces in the Wild (RFW)) datasets as reported by Wang et al. [15]. The algorithms that are trained on racially unbalanced datasets have high accuracies on unbalanced datasets. Their generalization to balanced datasets such as RFW is limited. These accuracies and skewed error ratio (SER) values can be compared to the results of networks trained on racially balanced approaches discussed in Section 5. Reprinted/adapted with permission from Ref. Wang et al. [15]. 2019, IEEE.

	Model	LFW Verification	RFW
		Accuracy	Caucasian	Indian	Asian	African	SER (↓)
Commercial API	Microsoft [55]	-	87.60%	82.83%	79.67%	75.83%	1.95
	Face++ [52,56]	99.5%	93.90%	88.55%	92.47%	87.50%	2.05
	Baidu [57]	-	89.13%	86.53%	90.27%	77.97%	2.26
	Amazon [58]	-	90.45%	87.20%	84.87%	86.27%	1.58
	mean	-	90.27%	86.28%	86.82%	81.89%	1.96
Algorithms trained on racially unbalanced datasets	Center-loss [59]	99.0%	87.18%	81.92%	79.32%	78.00%	1.72
	Sphereface [60]	99.42%	90.80	87.02	82.95	82.28	1.93
	Arcface [61]	99.40%	92.15%	88.00%	83.98%	84.93	2.04
	VGGface2 [43]	99.0%	89.90%	86.13%	84.93%	83.38	1.64
	mean	99.21%	90.01%	85.77%	82.80%	82.15%	1.83

Table 4. Dataset Gender Distributions: In the discussion of racial bias, gender bias is closely related. The gender distribution across multiple datasets is presented in this table. The table is sorted initially by the difference in percentage points between the overall Male and Female distributions (labeled in the table as Δ) and then by publication year. Although certain datasets are racially balanced, many remain unbalanced across gender.

Dataset	Racially	Caucasian		African		Indian		Asian		Overall		Δ
	Balanced	Female	Male	Female	Male	Female	Male	Female	Male	Female	Male	$\| F - M \|$
LFW [28]	✗	18.7%	58.9%	3.3%	9.6%	–	–	2.2%	7.2%	24.2%	75.7%	51.5%
RFW [15]	✓	7.9%	17.2%	1.3%	24.4%	9.1%	14.7%	7.7%	17.6%	26.1%	73.9%	47.8%
YouTube Faces [36]	✗	20.3%	56.9%	4.0%	7.7%	–	–	3.0%	7.9%	27.3%	72.5%	45.2%
FRVT2018 [45]	✗	16.5%	48.4%	7.4%	19.9%	–	–	0.4%	1.2%	24.3%	69.5%	45.2%
MSCeleb1M [41]	✗	19.2%	52.4%	3.9%	12.1%	–	–	4.5%	7.7%	27.6%	72.2%	44.6%
VGGFace2 [43]	✗	30.2%	45.9%	6.3%	10.5%	–	–	3.6%	3.4%	40.1%	59.8%	19.7%
CASIA [37]	✗	33.2%	48.8%	5.7%	7.2%	–	–	2.6%	2.6%	41.5%	58.6%	17.1%
PubFig [35]	✗	35.5%	49.5%	5.5%	6.5%	–	–	1.0%	2.0%	42.0%	58.0%	16.0%
IJB-C [44]	✗	30.2%	40.3%	6.0%	11.8%	–	–	6.2%	5.4%	42.4%	57.5%	15.1%
BioSecure [34]	✗	36.0%	50.1%	2.1%	3.1%	–	–	4.5%	4.3%	42.6%	57.5%	14.9%
MegaFace [40]	✗	30.3%	40.0%	4.7%	6.2%	–	–	8.1%	10.6%	43.1%	56.8%	13.7%
FaceARG [12]	✓	14.5%	9.8%	10.3%	13.8%	15.7%	10.3%	16.2%	9.5%	56.7%	43.3%	13.4%
CeFA [16]	✓	–	–	–	–	–	–	–	–	43.9%	56.1%	12.2%
CelebA [38]	✗	41.5%	33.9%	8.2%	6.4%	–	–	5.5%	4.4%	55.2%	44.7%	10.5%
UTKFace [42]	✗	20.0%	26.2%	16.3%	21.5%	–	–	8.9%	7.1%	45.2%	54.8%	9.6%
FairFace [13]	✗	12.3%	17.4%	14.8%	14.7%	6.8%	7.3%	13.1%	13.6%	47.0%	53.0%	6%
VGGFace [39]	✗	38.6%	43.7%	6.9%	5.8%	–	–	2.9%	2.1%	48.4%	51.6%	3.2%
DiveFace [11]	✓	19.7%	20.2%	14.0%	15.0%	–	–	16.3%	14.8%	50.5%	49.5%	1.0%
DemogPairs [14]	✓	16.7%	16.7%	16.7%	16.7%%	–	–	16.7%	16.7%	50%	50%	0%
BFW [17]	✓	12.5%	12.5%	12.5%	12.5%	12.5%	12.5%	12.5%	12.5%	50.0%	50.0%	0%

Table 5. RFW Accuracies Across Network Improvements: This table presents the accuracies, in percentages, achieved by state-of-the-art (SOTA) methods on the RFW dataset. Each method is trained on the BUPT-Balancedface dataset and uses the ArcFace loss, except for IMAN and CosFace approaches. The highest accuracy in each column is bolded and enclosed in brackets. “STD” denotes the standard deviation and “SER” represents the skewed error ratio. “↓” describes that a smaller value is an improvement while “↑” demonstrates that a larger value is an improvement.

Method	Loss- ArcFace	White	Black	East Asian	South Asian	Average (↑)	STD (↓)	SER (↓)
IMAN [15]	✓	93.92	92.98	90.60	90.98	92.12	1.38	1.55
CosFace [87]	X	95.12	93.93	92.98	92.93	93.74	0.89	1.45
ArcFace [61]	✓	96.18	94.67	93.72	93.98	94.67	0.96	1.64
ACNN [97]	✓	96.12	94.00	93.67	94.55	94.59	0.94	1.63
PFE [98]	✓	96.38	95.17	94.27	94.60	95.10	0.80	1.58
DebFace [99]	✓	95.95	93.67	94.33	94.78	94.68	0.83	1.56
GAC [100]	✓	96.20	94.77	94.87	94.98	95.21	0.58	1.37
RL-RBN [10]	✓	96.27	94.68	94.82	95.00	95.19	0.63	1.43
MBN [101]	✓	96.25	95.38	95.32	94.85	95.45	0.51	1.37
Rethinking [102]	✓	89.1	85.5	71.8	75.8	80.55	7.01	2.59
PCT [103]	✓	97.00	[96.22]	95.73	96.38	96.33	0.52	1.42
Sensitive Loss [20]	X	[97.23]	95.82	[96.50]	[96.95]	[96.63]	0.53	1.51
GABN [104]	✓	95.78	94.71	94.51	95.21	95.05	[0.49]	[1.30]

Table 6. Race Classification Results: This table presents the results of human race classification obtained by four networks on the FaceARG dataset [12]. The maximum value of each metric within its respective row is highlighted in bold and enclosed in brackets.

Race	Metric	VGG-19	Inception ResNet-v2	SeNet	MobileNetV3
Afro-American	P_r	96.68	96.88	95.27	[96.97]
Afro-American	R_e	97.96	98.20	98.32	[98.44]
Afro-American	F₁	97.32	97.54	96.77	[97.70]
Asian	P_r	98.27	98.43	97.94	[98.52]
Asian	R_e	97.76	97.92	96.96	[98.20]
Asian	F₁	97.32	98.18	97.45	[98.36]
Caucasian	P_r	94.91	95.61	95.43	[96.20]
Caucasian	R_e	94.00	[95.00]	94.28	94.12
Caucasian	F₁	94.45	[95.30]	94.85	95.15
Indian	P_r	94.37	94.51	[94.95]	94.89
Indian	R_e	94.52	94.32	94.00	[95.80]
Indian	F₁	94.44	94.41	94.47	[95.34]
Overall	P_r	94.06	96.36	95.90	[96.64]
Overall	R_e	96.06	96.36	95.89	[96.64]
Overall	F₁	96.06	96.36	95.89	[96.64]

Table 7. (a) Race Classification: Race classification accuracies are presented as percentages for White and Not White categories, emphasizing the differences across racial groups [13]. We note that the balanced dataset achieves the highest accuracies for most test datasets. The highest accuracy in each column is highlighted in bold and enclosed in brackets. (b) Gender Classification Across Race: Gender classification accuracies are presented as percentages for White and Not White categories, emphasizing that race has lower accuracies across various tasks [13]. We note that the balanced dataset achieves the highest accuracies for most datasets tested. The highest accuracy in each column is highlighted in bold and enclosed in brackets. (c) Age Classification Across Race: Age classification accuracies are presented as percentages for White and Not White categories, emphasizing that race has lower accuracies across various tasks [13]. The highest accuracy in each column is highlighted in bold and enclosed in brackets. Reprinted/adapted with permission from Ref. [13]. 2021, IEEE.

(a)
			Tested on (White Accuracy—Not White Accuracy):
			FairFace (Balanced)			UTKFace (Unbalanced)		LFWA+ (Unbalanced)
Trained	FairFace (Balanced)		[93.7%]–[75.4%]			93.6–80.1%		[97.0%]–[96.0%]
on:	UTKFace (Unbalanced)		80.0–69.3%			91.8%–[83.9%]		92.5–88.7%
	LFWA+ (Unbalanced)		87.9–54.1%			[94.7%]–38.0%		96.1–86.6%
(b)
				Tested on (White Accuracy—Not White Accuracy):
				FairFace (Balanced)		UTKFace (Unbalanced)	LFWA+ (Unbalanced)		CelebA (Unbalanced)
Trained	FairFace (Balanced)		[94.2%]–[94.4%]			[94.0%]–[93.9%]	92.0%–[93.0%]		[98.1%]–[98.1%]
on:	UTKFace (Unbalanced)			86.0–82.3%		93.5–92.5%	91.6–90.8%		96.2–96.2%
	LFWA+ (Unbalanced)			76.1–73.8%		84.2–83.3%	[93.0%]–89.4%		94.0%–94.0%
	CelebA (Unbalanced)			81.2–78.1%		88.0–88.6%	90.5–90.1%		97.1–92.1%
(c)
					Tested on (White Accuracy—Not White Accuracy):
					FairFace (Balanced)			UTKFace (Unbalanced)
Trained		FairFace (Balanced)				[59.7%]–[60.7%]		56.5–61.6%
on:		UTKFace (Unbalanced)				41.3–41.8%		[57.6%]–[61.7%]

Table 8. Network Improvements Discussion: While there are various improvements that are made in overcoming racial bias in facial recognition, we classify these improvements into four categories: loss/training, architecture, dataset modification, and additional data. We also define which methods are improved from further race classification research. The table is sorted by publication year and then alphabetically by method. The checkmark (✓) defines when it is categorized into each column.

Method	Year	Loss/Training	Architecture	Dataset Modification	Additional Data	Benefits from Race Classification
ACNN [97]	2017				✓
InclusiveFaceNet [137]	2017		✓		✓	✓
CosFace [87]	2018	✓
ArcFace [61]	2019	✓
IMAN [15]	2019	✓	✓			✓
PFE [98]	2019		✓
DebFace [99]	2020		✓		✓	✓
RL-RBN [10]	2020	✓				✓
GAC [100]	2021		✓			✓
MBN [101]	2021	✓	✓			✓
PCT [103]	2021		✓			✓
Rethinking [102]	2021	✓		✓		✓
Using VGG16 [1]	2021		✓
AIN [115]	2022		✓			✓
Fairness aware augmentation [138]	2022			✓		✓
Sensitive Loss [20]	2022	✓				✓
GABN [104]	2023	✓	✓			✓

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sumsion, A.; Torrie, S.; Lee, D.-J.; Sun, Z. Surveying Racial Bias in Facial Recognition: Balancing Datasets and Algorithmic Enhancements. Electronics 2024, 13, 2317. https://doi.org/10.3390/electronics13122317

AMA Style

Sumsion A, Torrie S, Lee D-J, Sun Z. Surveying Racial Bias in Facial Recognition: Balancing Datasets and Algorithmic Enhancements. Electronics. 2024; 13(12):2317. https://doi.org/10.3390/electronics13122317

Chicago/Turabian Style

Sumsion, Andrew, Shad Torrie, Dah-Jye Lee, and Zheng Sun. 2024. "Surveying Racial Bias in Facial Recognition: Balancing Datasets and Algorithmic Enhancements" Electronics 13, no. 12: 2317. https://doi.org/10.3390/electronics13122317

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Surveying Racial Bias in Facial Recognition: Balancing Datasets and Algorithmic Enhancements

Abstract

1. Introduction

2. Methods

2.1. Datasets Methods

2.2. Dataset Balancing through Generation Methods

2.3. Network Improvements across Human Race Methods

2.4. Terminology

3. Facial Recognition Dataset Comparison

3.1. Unbalanced Datasets

3.2. Balanced Datasets

3.3. Evaluating Networks Trained on Unbalanced Datasets

3.4. Intersection of Racially Balanced Datasets and Gender Bias

3.5. Discussion on Racially Balanced Datasets

4. Balancing Datasets through Data Generation

4.1. Understanding the Impact of Data Generation

4.2. Cross-Race Transformation

4.3. Intersection of Racial Dataset Balancing and Gender

4.4. Discussion on Balancing Datasets through Data Generation

5. Network Improvements across Human Race

5.1. Network Improvements on the RFW Dataset

5.2. Racial Classification

5.2.1. Two-Class Race Classification

5.2.2. Multiple-Class Race Classification

5.3. Related Tasks

5.4. Intersection of Network Improvements across Race and Gender

5.5. Discussion on Network Improvements

6. Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI