**Abbreviations**

The following abbreviations are used in this manuscript:


### **Appendix A. Pair Generation**

**Racial Faces in the Wild** Table A1 displays the breakdown of positive and negative pairs for the RFW testing split for each race subgroup. Positive and negative pairs are same-race faces (there is no gender attribute for this dataset).

**Table A1.** The test set percentages of positive and negative pairs generated per subgroup for RFW.


**Janus-C** Table A2 details the Janus-C test set's positive and negative pairs across skin tone and gender subgroups. All pairs are same-skin-tone and same-gender faces. Because Janus-C is not balanced over sensitive attributes, we had to vary positive and negative pair generation for each skin tone and gender subgroup. The drastically different number of faces across skin tones and genders make it difficult to achieve parity in the number of pairs for these subgroups while maintaining a large enough sample for testing. This should be considered when interpreting Janus-C results.


**Table A2.** The test set percentages of positive and negative pairs generated per subgroup for Janus-C.

**VGGFace2 Test Set** Table A3 shows the breakdown across gender subgroups of positive and negative pairs for the VGG testing split. All pairs are same-gender faces (VGGFace2 does not have a race attribute). The VGGFace2 test set is not balanced over its sensitive attribute, so we had to vary positive and negative pair generation by gender subgroup. Because VGGFace2 has less inequality than Janus-C in number of faces per subgroup, we achieved positive to negative pair ratios much closer to 25:75.

**Figure A1.** Statistical fairness metric results for BFW subgroups. A = Asian; I = Indian; B = Black; W = White; F = Female; M = Male.

**Table A3.** The test set percentages of positive and negative pairs generated per subgroup for the VGGFace2 test set.


### **Appendix B. Statistical Fairness Metric Experiments**

Figure A1 documents statistical metric results for BFW data that are not included in the main paper, while Figures A2 and A3 document results for RFW and VGGFace2, respectively.

**Figure A2.** Statistical fairness metric results for RFW race subgroups. A = Asian; I = Indian; B = Black; W = White.

**Figure A3.** Statistical fairness metric results for VGGFace2 test set gender subgroups. F = Female; M = Male.

We attempt to take advantage of the skin tone attribute in Janus-C to assess performance deficits relating specifically to skin color. We hypothesize that an FV system may perform worse on darker faces than lighter faces due to factors such as lighting or image

quality. We attempt to measure this by running two experiments: one with a Gaussian blur filter applied to the images and one without.

We compare blurred and non-blurred image results, expecting a greater drop in performance for blur with darker skin tones, indicating that darker faces likely appear in lower-quality images to begin with (a form of measurement bias). Figure A4 documents the results of these Janus-C experiments. We do not include these results in the main paper because (1) the inconsistent ratios of positive and negative pairs make it difficult to compare results across skin tones, and (2) we do not see significant performance changes after adding blur (the changes fall within the margin of error).

**Figure A4.** Statistical fairness metric results for Janus-C skin tone subgroups. Dark blue bars represent original data; light blue bars represent blurred data. Skin tone groups are labelled from 1 (lightest skin) to 6 (darkest skin).

### **Appendix C. Clustering Metrics**

Tables A4–A6 display clustering metric results for RFW, VGGFace2, and Janus-C, respectively. As stated in the main paper, these results do not add support to the connection between cluster quality and model performance. However, they provide a quantification of embedding clustering according to various sensitive attributes that is useful for understanding each dataset's clustered embeddings.

**Table A4.** Clustering metric results for RFW. ↑ means that a higher value indicates better clustering and ↓ means that a lower value indicates better clustering.


**Table A5.** Clustering metric results for the VGGFace2 test set.



**Table A6.** Clustering metric results for Janus-C.

### **Appendix D. Clustering Visualizations**

Figures A5–A7 document intra-cluster visualizations for RFW, VGGFace2, and Janus-C, respectively. For each dataset and sensitive attribute, we include pairwise distance distributions, centroid distance distributions, and persistent homology 0th class death distributions.

**Figure A5.** Intra-cluster visualizations for RFW. Pairwise distance distribution (**left**); centroid distance distribution (**middle**); persistent homology 0th class deaths distribution (**right**).

**Figure A6.** Intra-cluster visualizations for the VGGFace2 test set. Pairwise distance distribution (**left**); centroid distance distribution (**middle**); persistent homology 0th class deaths distribution (**right**).

**Figure A7.** Intra-cluster visualizations for Janus-C. Pairwise distance distribution (**left**); centroid distance distribution (**middle**); persistent homology 0th class deaths distribution (**right**).

Trends in RFW and Janus-C skin tone intra-cluster visualizations are similar to trends in BFW race intra-cluster visualizations; White faces (or lighter faces in Janus-C; skin tone group 1) belong to less dense and more dispersed clusters than other-race faces.

Trends in VGGFace2 and Janus-C gender intra-cluster visualizations are similar to trends in BFW gender intra-cluster visualizations; there is little difference in clustering between male and female faces.

### *Intra-Cluster Distribution T-Tests*

In the main paper, we describe the calculation of *p*-values for intra-cluster distribution t-tests, used to determine if the means of two subgroups' distributions are significantly different. *p*-values below the alpha-level of 0.05 validate observations from the intra-cluster visualizations, namely that White faces are less densely clustered in the embedding space than other-race faces. Tables A5–A8 document corrected *p*-values of the *t*-tests for BFW, RFW, VGGFace2, and Janus-C subgroup pairs, respectively.

**Table A7.** Corrected *p*-values of the 2-sample independent t-test results for BFW race (top) and gender (bottom) subgroup pairs. A: Asian; I: Indian; B: Black; W: White; F: Female; M: Male.


**Table A8.** Corrected *p*-values of the 2-sample independent t-test results for RFW race subgroup pairs. Top: race subgroup results; bottom: gender subgroup results. A: Asian; I: Indian; B: Black; W: White.


**Table A9.** Corrected *p*-values of the 2-sample independent t-test results for VGGFace2 test set gender subgroup pairs. F: Female; M: Male.



**Table A10.** Corrected *p*-values of the 2-sample independent t-test results for Janus-C skin tone (top) and gender (bottom) subgroup pairs. Results are for non-blurred data. Skin tone groups are labelled from 1 (lightest skin) to 6 (darkest skin). F: Female; M: Male.
