**1. Introduction**

Social stereotypes may be present in the semantics of the corpora used to pre-train large language models, including Transformer based models. These models run the risk of learning those stereotypes and later on propagating them in the tasks for which they are used. Taking into account the dangers that may arise from such incidents, this study explores ways of detecting stereotypical biases related to gender in a Transformer model's representations, in addition to quantifying and measuring such biases when they manifest in a downstream task.

Word embeddings like Word2Vec [1] assign words to fixed vectors that do not take into account the context of the whole input sentence. Conversely, contextual embeddings move beyond word-level semantics by mapping words to representations that take into account how the surroundings of a word can alter its semantics. In this way, contextual embeddings are capable of capturing polysemy.

It is common to use cosine similarity based methods to measure bias in non-contextualized embeddings [2,3]. Nevertheless, the mutable nature of the contextualized embeddings can render all cosine similarity based methods inapplicable or inconsistent for Transformer based models [4,5].

**Citation:** Katsarou, S.; Rodriguez-Galvez, B.; Shanahan, J. Measuring Gender Bias in Contextualized Embeddings. *CSFM* **2022**, *3*, 3. https://doi.org/10.3390/ cmsf2022003003

Academic Editors: Kuan-Chuan Peng and Ziyan Wu

Published: 11 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

### **2. Related Work**

### *2.1. Bias Detection in Non-Contextual Word Embeddings*

It has been shown that a global bias direction can exist in a word embedding space. Moreover, gender neutral words can be linearly separated from gendered words [3]. Those two properties constitute the foundation of seminal works by Caliskan et al. [6] and Bolukbasi et al. [3], who introduce word analogy tests and word association tests as bias detection methods. In a word analogy test, given two related words, e.g., man : king, the goal is to generate a word *x* that is in a similar (usually linear) relation to a given word, e.g., woman. In this particular example, the correct answer would be *x* = queen, since man − woman ≈ king − queen. The results in [3] indicate that word embeddings like he or man are associated with higher-status jobs like doctor, whereas gendered words like she or woman are associated with different professions such as homemaker and nurse. In word association tests, there is a pleasant and an unpleasant attribute and the distances between each one of them and a word, e.g., he, are measured. Ideally, if the model is unbiased towards gender, the subtraction of these two distances should be equal to the corresponding one produced by the word she.

### *2.2. Bias Detection in Contextualized Word Embeddings*

The association between certain targets (e.g., gendered words) and attributes (e.g., career-related words) for BERT [7] has been computed by utilizing the same task BERT uses as a learning objective during pre-training [5]. That is, the model is first fed sentences in which specific tokens are masked. Then, the model is given a sentence in which the attribute is masked, and the probability that it is associated to he is measured. This is defined as the target probability. Then, the model is passed a sentence where both the target and the attribute are masked, aiming to measure the prior probability of how likely the gendered word would be in BERT. The same procedure is repeated for gendered words of the opposite sex, and the difference between the normalized predictions of the two targets is computed.

Nangia et al. [8] and Nadeem et al. [9] collect examples of minimally different pairs of sentences, in which one sentence stereotypes a group, and the second sentence has less stereotyping of the same group. As a result, in all examples there are two parts of each sentence: the unmodified part, which is composed of the tokens that overlap between the two sentences in a pair, and the modified part, which contains the non-overlapping tokens. Nadeem et al. [9] estimate the probability of the unmodified tokens conditioned on the modified tokens, Pr(*U* | *M*, *<sup>θ</sup>*), by iterating over the sentence, masking a single token at a time, measuring its log likelihood, and accumulating the result in a sum. Nangia et al. [8], on the other hand, estimate the probability of the modified tokens conditioned on the unmodified ones, Pr(*M* | *U*, *<sup>θ</sup>*). Both methods measure the degree to which the model prefers stereotyping sentences over less stereotyping sentences by comparing probabilities across the pairs of sentences. The difference between them lies in that the first one computes the posterior probability and the second one computes the likelihood.

Webster et al. [10] present four different bias-detection methods that focus on gender bias. These include a co-reference resolution method, a classification task, and a template of sentences with masked tokens similar to that of [5]. Finally, they present a remarkable method where they collect sentences from STS-B that start with "A man" or "A woman", and form two sentence pairs per profession, one using the word "man" and the other using the word "woman". If a model is unbiased, it should give equal estimates of similarity for the two pairs. Note that these approaches do not really quantify the biases encoded in the contextualized embeddings. Instead, they measure the extent to which the biases manifest in downstream tasks or in the probabilities associated with the model preferring male over female targets for specific attributes. Moreover, the majority of recent approaches focus on detecting biases on encoder-only Transformers such as BERT, neglecting decoder-only or encoder-decoder architectures.

### Bias Detection in Contextualized Embeddings Using Non-Contextualized Word Embeddings

.

Dhamala et al. [11] recently studied how to measure various kinds of societal biases in sentences produced by generative models by using a collection of prompts that the authors created: the BOLD dataset. After prompting the model with the beginning of a sentence, they let it complete the sentence by generating text. For example, given the prompt "A flight nurse is a registered", the model might complete the sentence like: "A flight nurse is a registered nurse who is trained to provide medical care to her patients as they transport in air-crafts".

BOLD comes with a set of five evaluation metrics, designed to capture biases in the generated text from various angles. Amongst those metrics, the most relevant to this work is the weighted average of gender polarity, defined as

$$\text{Gender-Wavg} := \frac{\sum\_{i=1}^{n} \text{sign}(b\_i) b\_i^2}{\sum\_{i=1}^{n} |b\_i|},\tag{1}$$

where *bi* := *w i*· - *gw*- *i*|*g*|| and *g* := *she* - − *he* -

Initially, they compute the gender polarity of each word *wi* present in a generated sentence, *bi*, and then they proceed to compute the weighted average over all words present in the sequence. An important detail is that all word vectors *wi* are not the ones that the language model creates; instead, they are mapped to the corresponding vectors in the Word2Vec space [11]. Vectors created by the language model are not used at all in this approach. The goal of the Gender-Wavg metric is to detect if a sentence is polarized towards the male or female gender rather than calculating the bias of the language model's embedding space.

In contrast, Guo and Caliskan [12] propose a method for detecting intersectional bias in contextualized English word embeddings from ELMo, BERT, GPT, and GPT-2. First, they utilize Word Embedding Association Test (WEAT) with static word embeddings to identify words that represented biases associated with intersectional groups. This is done by measuring the Word Embedding Factual Association Test (WEFAT) association score, defined as:

$$\text{res}(\vec{w}, \mathcal{A}, \mathcal{B}) = \frac{\triangle\_{\vec{w} \in \mathcal{A}} [\cos(\vec{w}, \vec{a})] - \triangle\_{\vec{b} \in \mathcal{B}} [\cos(\vec{w}, \vec{b})]}{\triangle\_{\vec{x} \in \mathcal{A} \cap \mathcal{B}} [\cos(\vec{w}, \vec{x})]^{1/2}},\tag{2}$$

where E ˆ - *a*∈A and V ˆ - *a*∈A represent, respectively, the empirical mean and empirical variance operators; A and B are sets of vectors encompassing concepts, e.g., male and female; *w* - ∈ W; and W is a set of target stimuli, e.g., occupations. Association scores are used to identify words that are associated with intersectional groups uniquely in addition to words that are associated with both intersectional groups and their constituent groups. Once these words have been identified, the authors then extend WEAT to contextualized embeddings by calculating a distribution of effect sizes ES(X , Y, A, Y) among the sets of target words X and Y, and the sets of concepts or attributes A and B. These effect sizes are measured across samples of 10,000 embeddings for each combination of targets/attributes, and a random effects model is applied to generate a weighted mean of effect sizes. This approach finds that stronger levels of bias are associated with intersectional group members than with their constituent groups, and the degree of overall bias is negatively correlated with the degree of contextualization in the model.

### *2.3. Bias Detection in Swedish Language Models*

Sahlgren and Olsson [13] identified gender bias present in both contextualized and static Swedish embeddings, though the contextual models they studied (BERT and ELMo) appeared less susceptible. They also showed that existing debiasing methods, proposed by Bolukbasi et al. [3], not only failed to mitigate bias in Swedish language models but possibly worsened existing stereotypes present in static embeddings. Similarly, Prècenth [14] found evidence of gender bias in static Swedish language embeddings, and introduced several methods for addressing Swedish distinctions not present in English (e.g., farmor "paternal grandmother" and mormor "maternal grandmother" vs grandmother). While there is a dearth of research related specifically to bias in Swedish, or even North Germanic, language embeddings, some research exists for the Germanic language family more broadly. Kurpicz-Briki [15] identified bias in static German language embeddings using Word Embedding Association Test, and traced the origin of some gender biases to 18th century stereotypes that still persist in modern embeddings. Matthews et al. [16] compare bias in static embeddings across 7 languages (Spanish, French, German, English, Farsi, Urdu, and Arabic), and attempt to update Bolukbasi et al. [3]'s methodology for languages that have grammatical gender or gendered forms of the same noun (e.g., wissenschaftler "male scientist" vs wissenschaftlerin "female scientist" in German). Additionally, Bartl et al. [17] evaluated whether existing techniques for identifying bias in contextualized English embeddings could apply to German. While they confirmed Kurita et al. [5]'s results with respect to English, the method was unsuccessful when applied to German, illustrating not only the need for language-specific bias detection methods but also that linguistic relatedness cannot be used as a predictor of successful applicability.

Further research is needed in evaluating cross-language bias measurement approaches, as bias can be influenced by etymology, morphology, and both syntactic and semantic context, which vary significantly across languages.

### **3. Methods**

Our method to measure gender bias in contextualized embeddings is twofold: first, we implement an extrinsic approach, in which word embeddings are assessed with respect to their contribution to a downstream task. We also follow an intrinsic approach, in which we directly evaluate the embeddings with respect to a reference gender direction and detect relations between representations of different professions.

Gender bias can be a nuanced social phenomenon that includes genders beyond the woman and man binary. Nevertheless, in this work we exclusively study the correlation of professions with respect to binary gender.

#### *3.1. Extrinsic Evaluation of Gender Bias in T5 and mT5*

The downstream task used in this work is semantic text similarity. We use Text-to-Text Transfer Transformer (T5) and multilingual T5 (mT5), and we fine-tune two mT5 models: one on the English STS-B dataset (http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark, accessed on 12 December 2021) and one on the machine translated variant of it for Swedish [18]. We do not have to fine-tune T5 on this task as it has undergone both unsupervised and multi-task supervised pre-training that includes the same dataset. To conduct our experiments, we need to bring stereotypical biases to manifestation during inference for all three models. To this end, we create a new dataset by adapting the STS-B dataset.

### 3.1.1. Dataset Creation

To measure the impact of gender correlations on a semantic text similarity application, we build on the test set of the STS-B corpus, and create a new dataset, inspired by the counterfactual data augmentation method as introduced in [19]. We only use the test set as a base for creating the final dataset, since the training and development sets have already been seen by mT5 during fine-tuning.

A standard example of the STS-B dataset includes a pair of sentences that is labeled after a scalar that denotes their degree of similarity. To transform STS-B into a dataset that can assess gender correlations, we collected all sentences from the test set that started with "A man" or "A woman". To ensure that all references to gender were eliminated in the final dataset, any sentences that included gendered words other than man or woman, like pronouns (his, her, hers, etc.), were discarded; as a result man or woman were the only words present in each sample that could disclose gender information. We then extended the dataset by substituting the gendered subject with an occupation, iterating over fifty different occupations. The final dataset consists of 149 rows and 52 columns. We replaced the gendered words, man and woman, with he and she, since they resemble a more natural use of language. The same process is applied for the Swedish variant of the STS-B dataset.

### 3.1.2. Experimental Design

The trained models considered pairs of sentences that featured the same sample twice: one including a gendered word (either she or he) and one including an occupation word. For example, out of the source sentence "The nurse is walking", we would create two pairs of sentences to pass to the model:"He is walking" and "The nurse is walking", and secondly, "She is walking" and "The nurse is walking". The models predicted a similarity score for all 149 pairs for both genders. Computing the average similarity score over all samples yielded one average similarity score per gender. If unbiased, the male and female average similarity scores should be similar for all professions. The way our dataset was created provides a clean environment in which all sentences that include professions are gender agnostic. The model is thereby coerced to a manifestation of gender correlations with profession that can only be attributed to inherent model bias, rather than to bias residuals found in the sentence. This ensures the validity and reliability of this method. All experiments were conducted using small, base, and large versions of both mT5 and T5 models, for English and Swedish. With respect to mT5, since we fine-tuned the model before making predictions, we re-ran the fine-tuning process using three different random seeds before proceeding to the inference phase. This was done for two reasons: to add statistical significance to the results and to address potential instability problems that can be caused by fine-tuning large models on small datasets.

### *3.2. Intrinsic Evaluation of Gender Bias in T5*



The mutable nature of a Transformer's contextualized embeddings is an obstacle to evaluating them intrinsically. Another caveat is that the model itself is changing every time according to the task it is being fine-tuned on. This is the first work to apply an intrinsic approach to evaluate the contextualized embeddings of a Transformer with respect to gender bias by alleviating both problems.

As a workaround to the potential instability caused by the necessary changes in a model's architecture associated with different downstream tasks, we use T5: a model that can work out-of-the-box for a number of tasks without having to make any architecture modifications. Nevertheless, mT5 was not pre-trained on many tasks the same way as T5. Thereby, we chose not to include mT5 in this experimental process, as it would have to be fine-tuned on STS-B first, and that would render the model more specific to this task and the results less general.

As a workaround to the problem of the embeddings not being fixed, we extended the gender polarity metric to consider multiple values per profession. These values compose a distribution, rather than strictly focusing on a single value, as has been the case with previous work. Our goal is to measure the gender polarity in the embeddings produced by T5. To this end, we were inspired by Dhamala et al. [11], who were the first to use *bi* as a metric under the setting of a Transformer model:

$$b\_i = \frac{\vec{w}\_i \cdot \vec{g}}{||\vec{w}\_i|| ||\vec{g}||}\,'\tag{3}$$

where *g* = *she* − *he* . Nevertheless, the authors avoided the direct use of the contextual embeddings *w*- *i* when computing the bias *bi* and chose to map them to the Word2Vec space first. The motivation behind their choice is that there was no theoretical foundation in literature to sugges<sup>t</sup> that a constant gender direction, *g* = *she* - − *he* - , exists in the embedding space of a Transformer model. Thus, they settled for the fixed embedding space of Word2Vec which can safely establish a well defined *g*.

In this work, we make the hypothesis that a versatile Transformer model like T5, which already holds the knowledge of various downstream tasks due to the multi-task pre-training procedure it has undergone, can still establish a gender direction, *g* = *she* - − *he* - . We hypothesize that this gender direction is stable enough to allow for T5's contextual embeddings to be used in computing *bi*. This way, we avoid losing information by mapping the embeddings to the Word2Vec space and create a solution that is tailored to a Transformer model. To validate this hypothesis, we let T5 produce contextualized embeddings of he and she out of all 149 sentences of our dataset. That is, we consider the hidden state of the model's last encoder block for each of these sentences. We used the small, base, and large version of T5. Then, we compute the Euclidean distances between all 149 he and she pairs as well as their corresponding angles. For the large version of T5, we find that the Euclidean distance has a mean and standard deviation of 2.79 ± 0.22 and the angle has a mean and standard deviation of 0.68 ± 0.04 radians. The small value of the standard deviations, compared with the mean values, suggests that the dispersion between the 149 he and she angle values is small. This indicates that there might exist a well defined, and perhaps constant, gender direction *g* between he and she in the T5 embedding space. We use the average vector *g* = 1149 ∑<sup>149</sup> *<sup>i</sup>*=<sup>1</sup>(*she* - *i* − *he*- *i*) as the gender direction and compute gender polarity *bi* for 'he and she, and nine selected occupations: nurse, engineer, surgeon, scientist, receptionist, programmer, teacher, officer, and homemaker. The selection of those occupations is based on the results obtained by the extrinsic evaluation, which selected the professions that are more prone to be correlated with one of the two genders. We obtain a distribution of 149 bias *bi* values for every profession instead of a single bias *bi* value per occupation, as would be the case with Word2Vec embeddings.
