**4. Results**

#### *4.1. Extrinsic Evaluation of Gender Bias in T5 and mT5*

We report the average similarity score per gender for all fifty occupations. Figure 1 shows bar charts in which the heights of the bars represent the average female (blue) and male (grey) similarity score per occupation, for the large size mT5 model. Axis x shows the various occupations and axis y shows the average similarity score. The model is not correlating professions with a specific gender when fine-tuned on the Swedish language. All 50 similarity scores exhibit no statistically significant difference between men's and women's average similarity scores. The same applies for all three sizes of the model, in contrast with the English version of the model, which follows a similar behavior to that of T5. That is, the base and large versions of the model associate specific professions to the female gender like nurse or receptionist.

Figure 2 presents the average difference between mean similarity scores for men and women over the 50 occupations. Mean differences tend to grow larger for larger sizes of the model. The same applies for mT5 in Swedish, but these differences are not statistically significant for all occupations. Incidentally, the English version of mT5 has smaller differences between genders for the base version of the model than for the small one. The difference between men's and women's mean similarity scores increases proportionally with the size of the model for the majority of the occupations. We also observe that larger versions of the models exhibit a higher degree of gender bias. Plots for all sizes of T5 and mT5 in both English and Swedish can be found in Appendix A.

**Figure 1.** (**a**) Average similarity scores per occupation. Language: English, (**b**) Average similarity scores per occupation. Language: Swedish. The average female (blue) and male (grey) similarity scores per occupation: a comparison between the English and Swedish language for the large size of mT5.

**Figure 2.** (**a**) Mean difference between gender similarity scores per model size. Model: mT5. Language: English. (**b**) Mean difference between gender similarity scores per model size. Model: mT5. Language: Swedish. (**c**) Mean difference between gender similarity scores per model size. Model: T5. Language: English. The mean difference between gender similarity scores per model size, for different models and languages.

### *4.2. Intrinsic Evaluation of Gender Bias in T5*

Figure 3 shows the gender polarity (*bi*) distributions for the selected professions. Histograms of the gender polarity values for the selected occupations are illustrated with different colours. The graph compares the three different sizes of T5. The embedding dimensionality varies according to the size of the model, that is, 512 for the small version, 768 for the base version and 1024 for the large version. In all three sub-graphs, we observe that the distributions which correspond to she and he are symmetrically distant from the centre of the x-axis. Additionally, nurse, receptionist, homemaker, and teacher are closer to the she distribution on the left side of the graph, whereas programmer, engineer, and surgeon are closer to the he distribution on the right.

**Figure 3.** (**a**) The 149 *bi* values per occupation for the small size of T5. Embedding dimensionality: 512. (**b**) The 149 *bi* values per occupation for the base size of T5. Embedding dimensionality: 768. (**c**) The 149 *bi* values per occupation for the large size of T5. Embedding dimensionality: 1024. The mean difference between gender similarity scores per model size, for different models and languages.

By comparing all three sub-graphs in Figure 3, we notice that the gulf between the various occupation distributions grows larger as the model's size increases; there is a high

overlap of the distributions for the small size of T5, which indicates that the occupations are less gender polarized. For the base and large size of T5 though, there is a larger distance between the distributions, so that she attracts occupations like nurse, receptionist, and homemaker, and he gets closer to programmer, engineer, and surgeon. Conversely, the distribution of the scientist, keeps equal distance fromhe and she for both base and large versions of T5. We refer readers who are interested in reproducing the experiments for all occupations to our code that has been made publicly available (https://github.com/ Stellakats/Master-thesis-gender-bias, accessed on 12 December 2021).
