**1. Introduction**

In light of increased reliance on ML in highly consequential applications such as pretrial risk assessment [1,2], occupation classification [3,4], and money lending [5], there is growing concern for the fairness of ML-powered systems [4,6–10]. Unequal performance across individuals and groups subject to a system may have unintended negative consequences for those who experience underperformance [6], potentially depriving them of opportunities, resources, or even freedoms.

Face verification (FV) and face recognition (FR) technologies are widely deployed in systems such as biometric authentication [11], face identification [12], and surveillance [13]. In FV, the input data are two face images and the classifications may be genuine (positive class) or imposters (negative class) [8]. FV/FR typically use a similarity measure (often cosine similarity) applied to a pair of face embeddings produced by the model [12]. There has been recent interest in assessing bias via these face embeddings [10].

Figure 1 presents a low-dimensional depiction of face embeddings generated by FaceNet [12], which clearly groups same-race and same-gender faces closely together, indicating that the model learned to identify the similarities between same-race, samegender faces. Exploring the connection between embedded clusters of protected groups and biased performance [10] is an open area of research.

In this paper, we (1) identify and quantify sources of bias in a pretrained FaceNet model using statistical and cluster-based measures, and (2) analyze the connection between cluster quality and biased performance.

**Citation:** Frisella, M.; Khorrami, P.; Matterer, J.; Kratkiewicz, K.; Torres-Carrasquillo, P. Quantifying Bias in a Face Verification System. *CSFM* **2022**, *3*, 6. https://doi.org/10.3390/ cmsf2022003006

Academic Editors: Kuan-Chuan Peng and Ziyan Wu

Published: 20 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Figure 1.** A two-dimensional t-SNE [14] visualization of Balanced Faces in the Wild (BFW) [8] embeddings, colored by race and gender. Clusters roughly correspond to race and gender, with varied densities (e.g., Asian clusters are tighter than White clusters). Note that t-SNE embeddings are not completely representative of actual relationships due to information loss during dimensionality reduction.

### **2. Related Work**

### *2.1. Sources of Bias*

We define bias in an ML system as follows. For a more complete discussion of sources of bias, see the work by Suresh and Guttag [15].

**Historical Bias** arises when injustice in the world conflicts with values we want encoded in a model. Since systemic injustice creates patterns reflected in data, historical bias can exist despite perfect sampling and representation.

**Representation Bias** arises when training data under-represent a subset of the target population and the model fails to optimize for the under-represented group(s).

**Measurement Bias** arises when data are a noisy proxy for the information we desire, e.g., in FV, camera quality and discretized race categories contribute to measurement bias.

**Aggregation Bias** arises when inappropriately using a "one-size-fits-all" model on distinct populations, as a single model may not generalize well to all subgroups.

**Evaluation Bias** arises when the evaluation dataset is not representative of the target population. An evaluation may purport good performance, but miss a disparity for populations under-represented in the benchmark dataset.

**Deployment Bias** arises from inconsistency between the problem that a model is intended to solve and how it is used to make decisions in practice, as there is no guarantee that measured performance and fairness will persist.

### *2.2. Statistical Fairness Definitions*

We first identify attributes of the data for which the system must perform fairly. An attribute may be any qualitative or quantitative descriptor of the data, such as name, gender, or image quality for a face image. A "sensitive" attribute defines a mapping to advantaged and disadvantaged groups [6], breaking a dataset into "unprotected" and "protected" groups. For example, if race is the sensitive attribute, the dataset is broken into an unprotected group, White faces, and protected groups, other-race faces.

We define fairness according to the equal metrics criteria [6,15–18]: a fair model yields similar performance metric results for protected and unprotected subgroups. Other fairness definitions include group-independent predictions [6,15,19,20] (a fair model's decision is not influenced by group membership with respect to a sensitive attribute), individual fairness [6,15,21–23] (individuals who are similar with respect to their attributes have similar outcomes), and causal fairness [6,15,24–26] (developing requirements on a causal graph that links data/attributes to outcomes).

We quantify fairness according to the equal metrics definition using statistical fairness metrics (see Table 1). The metrics use the definitions represented by the confusion matrix in Table 3 of Verma and Rubin [7].

 **Metric Description Definition References** Overall Accuracy Equality Equal prediction accuracy across protected and unprotected groups *P*(*d* = *<sup>Y</sup>*|*<sup>A</sup>*1) = *P*(*d* = *<sup>Y</sup>*|*<sup>A</sup>*2) = ··· = *P*(*d* = *Y*|*AN*) Berk et al. [27] Mitchell et al. [6] Verma and Rubin [7] Predictive Equality Equal FPR across protected and unprotected groups *P*(*d* = 1|*Y* = 0, *<sup>A</sup>*1) = *P*(*d* = 1|*Y* = 0, *<sup>A</sup>*2) = ··· = *P*(*d* = 1|*Y* = 0, *AN*) Chouldechova [17] Corbett-Davies et al. [18] Mitchell et al. [6] Verma and Rubin [7] Equal Opportunity Equal FNR across protected and unprotected groups *P*(*d* = 0|*Y* = 1, *<sup>A</sup>*1) = *P*(*d* = 0|*Y* = 1, *<sup>A</sup>*2) = ··· = *P*(*d* = 0|*Y* = 1, *AN*) Chouldechova [17] Hardt et al. [16] Kusner et al. [24] Mitchell et al. [6] Verma and Rubin [7] Conditional Use Accuracy Equality Equal PPV and NPV \* across protected and unprotected *P*(*Y* = 1|*d* = 1, *<sup>A</sup>*1) = *P*(*Y* = 1|*d* = 1, *<sup>A</sup>*2) = ··· = Berk et al. [27] Mitchell et al. [6]

*P*(*Y* = 1|*d* = 1, *AN*)

*P*(*Y* = 0|*d* = 0, *<sup>A</sup>*1) =

*P*(*Y* = 0|*d* = 0, *<sup>A</sup>*2) = ··· =

*P*(*Y* = 0|*d* = 0, *AN*)

*AVG*(*Y* = <sup>1</sup>|*<sup>A</sup>*1) =

*AVG*(*Y* = <sup>1</sup>|*<sup>A</sup>*2) = ··· =

*AVG*(*Y* = 1|*AN*)

*AVG*(*Y* = <sup>0</sup>|*<sup>A</sup>*1) =

*AVG*(*Y* = <sup>0</sup>|*<sup>A</sup>*2) = ··· =

*AVG*(*Y* = 0|*AN*)

AND

Verma and Rubin [7]

Kleinberg et al. [28]

Kleinberg et al. [28]

Mitchell et al. [6]

Verma and Rubin [7]

Mitchell et al. [6]

Verma and Rubin [7]

**Table 1.** Selected statistical fairness metrics. Notation [7,16]: **A**—sensitive attribute, **Y**—actual classification, **d**—predicted classification, and **S**—similarity score. \* PPV/NPV: Positive (Negative)PredictiveValue.

### *2.3. Bias in the Embedding Space*

Balance for the Positive

Balance for the Negative

Class

Class

groups

Equal avg. score *S* for the posi-

Equal avg. score *S* for the neg-

ative class across protected

and unprotected groups

unprotected groups

tive class across protected and

Instead of solely considering model performance across protected and unprotected groups, Gluge et al. [10] assess bias in FV models by investigating the face embeddings produced by the model. The intuition behind this approach is that the "other-race effect" observed in human FV, where people are able to distinguish between same-race faces better than other-race faces, may have an analog in machine FV that is observable in how a model clusters face embeddings according to sensitive attributes such as race, gender, or age.

Gluge et al. [10] attempt to measure bias with respect to a sensitive attribute by quantifying how well embeddings are clustered according to that attribute. They hypothesize that a "good" clustering of embeddings (i.e., well-separated clusters) into race, gender, or age groups may indicate that the model is very aware of race, gender, or age differences, allowing for discrimination based on the respective attribute. They investigate the connection between quality of clustering and bias using cluster validation measures.

Their results do not support a connection between well-defined sensitive attribute clusters and bias; rather, they sugges<sup>t</sup> that a worse clustering of embeddings into sensitive attribute groups yields biased performance (i.e., unequal recognition rates across groups). They conjecture that between-cluster separation (i.e., how well race, gender, or age groups are separated from each other) may be less important than the within-cluster distribution of embeddings (i.e., how well each individual race, gender, or age group is clustered), intuiting that a cluster's density indicates how similar or dissimilar its embeddings are according to their separation from each other. Thus, a dense cluster may purport false matches more frequently than a less dense cluster. We extend [10] by investigating this conjecture.
