1. Introduction
In recent years, we have witnessed a surge in the utilization of speech-recognition technologies and voice interaction [
1,
2]. One of the domains where voice interaction is being employed is that of virtual assistants. Prominent commercial voice assistants include Amazon Alexa, Google Assistant, and Apple’s Siri [
3], with Alexa being the most prevalent, occupying approximately 70% of the market share [
4].
In addition to the aforementioned commercial virtual assistants and open-source software like Home Assistant [
5], a variety of automatic speech recognition (ASR) tools are available, enabling the implementation of a system that transcribes speech to text (STT) [
6]. This capability can be harnessed for the development of voice interaction-based systems, such as virtual assistants. Notably, among the speech-recognition systems, Whisper [
7] stands out, as its introduction has prompted similar open-source tools like Coqui STT [
8] to discontinue their projects due to the improvements offered by this new tool. Whisper is free software.
From the perspective of human–device interaction through speech, it is crucial to consider several key concepts that set it apart from other forms of interaction [
9]. However, in our context, we will focus on the importance of the device accurately understanding the individual in their language, dialectal variation, and accent. Furthermore, it is essential to ensure that there is no significant difference in performance when these devices are used by both females and males.
Currently, it is estimated that approximately 8.1 billion people inhabit the world [
10]. Nevertheless, there is no single language that is spoken or understood by the entire global population, not even by a majority. According to data published by Ethnologue [
11], the most widely spoken language is English, considering both native speakers and those who speak it as a second language. English is spoken by approximately 1.456 billion people, which roughly equates to 18% of the world’s population. If we consider the top 10 most spoken languages (English, Mandarin Chinese, Hindi, Spanish, Standard Arabic, Bengali, French, Russian, Portuguese, and Urdu), we will encompass approximately 66% of the global population, leaving over 2.7 billion people (34%). The top 200 languages spoken account for approximately 88% of the world’s population [
8].
Access to information and human knowledge by all individuals, regardless of their language, is paramount and should be regarded as a fundamental right. Thanks to the Internet, access to a portion of information and human knowledge has become more democratic, yet much remains to be accomplished [
12].
According to UNESCO, roughly 781 million people worldwide are illiterate, with approximately two thirds of them being women [
13]. Hence, voice interaction technology must be adept enough to interpret a wide range of languages, dialectal variations, and accents, regardless of gender.
Previous studies have revealed historical automatic speech recognition (ASR) system biases [
14,
15]. These biases hinder effective communication for certain groups of people when using voice recognition systems [
16]. Some of these biases [
17,
18] may be attributed to cultural, social, medical, or other differences, making gender and dialectal variations or accents of the interacting individuals two significant sources of potential bias within ASR systems [
19,
20].
In this context, it is relevant to consider potential gender biases [
21,
22] in speech-recognition tools and virtual assistants, both in the responses provided by an assistant and in the actual voice recognition. This study will specifically focus on speech recognition.
In general terms, gender identification through voice primarily relies on fundamental frequency [
23]. On average, female voices have a fundamental frequency of approximately one octave higher than male voices [
24]. Fundamental frequency refers to the lowest vibration frequency of the vocal cords during sound production. Typically, female vocal cords tend to be shorter and thinner than male vocal cords, resulting in a higher fundamental frequency in female voices and a lower one in male voices [
25].
As previously mentioned, in addition to the various languages, different dialects exist within languages. In our case, we will focus on the dialects of the Spanish language. One of the current classifications identifies eight dialectal regions of the Spanish language [
26], with five in the Americas, two in Europe, and one in Africa. The dialectal regions encompass the following areas: America, which includes the Caribbean, Mexican-Central American, Andean, Austral, and Chilean regions; Europe, consisting of the Northern Iberian Peninsula (Septentrional) and the Southern Iberian Peninsula (Meridional); and Africa, which comprises the Canary Islands.
This research aims to ascertain whether there is any significant bias concerning gender or the main accents of the Spanish language. To achieve this, audio clips from the Common Voice 14 dataset by Mozilla [
27] are analyzed using both Alexa and Whisper.
Section 3 introduces the tools and datasets employed for the analyses.
Section 4 presents the outcomes of the tests conducted. In
Section 5, a discussion of the obtained results is provided. Finally,
Section 6 presents the conclusions drawn from this research.
4. Results
After analyzing the various samples to calculate error rates related to automatic speech-recognition analysis, such as the WER [
40], we proceed to compute various statistics, such as the mean, median, standard deviation, and variance. Additionally, we calculated the 95% confidence intervals to ensure that the results obtained were conclusive.
A preprocessing step was performed on the character strings to obtain a WER that is as realistic as possible. This involved removing punctuation marks, exclamation marks, question marks, etc., and converting all strings to lowercase. This ensured that the WER calculation would not be affected by different interpretations between Alexa and Whisper beyond the simple transcription of the words heard.
Table 14 displays the gender differences for each variant of Alexa and Whisper analyzed by weighted means of WER. In our study, we employ a weighted arithmetic mean to analyze gender differences for each variant of Alexa and Whisper. This weighting approach assigns weights to each value based on the number of elements available for each option, thus ensuring a fair and representative assessment of the observed differences.
For better visualization of the results obtained, the following figures are presented, showing a comparison of the WER means and WER medians for females and males for each of the analyzed accents.
Figure 1 illustrates the comparison of means for females.
Figure 2 shows the comparison of means for males.
Figure 3 and
Figure 4 display comparisons of medians for females and males, respectively. Additionally, in
Figure 5, the weighted mean by gender is shown for each of the analyzed variants of Alexa and Whisper.
All data and results obtained during the research are available in the project repository. These datasets include transcribed and correct texts and WER for each phrase. Other error measures such as Character Error Rate (CER), Match Error Rate (MER), Word Information Lost (WIL), and Word Information Preserved (WIP) are also included in the datasets. The data are divided based on the tool used, i.e., Alexa, Whisper Base, or Whisper Large-v2, and within these categories, the data is further divided by gender and accent [
41].
5. Discussion
After analyzing the results, it is observed that, as a general rule, the Alexa variant for U.S. Spanish, here identified as Alexa US, performs the worst among the three Alexa variants for any of the analyzed accents. In the case of the other two Alexa variants, for Mexican Spanish and Spanish from Spain, identified here as Alexa MX and Alexa ES, respectively, the comparison of means reveals a similar performance, with a slight improvement in Alexa ES compared to Alexa MX, except for the case of females from the Canary Islands, where Alexa MX performs better. In the case of median comparisons, we see virtually the same result, except for several cases with the same median for both Alexa MX and Alexa ES. Additionally, there are two cases where Alexa MX outperforms Alexa ES, and these cases are for both genders of the Canary accent, although the difference is more pronounced for females.
When comparing the results obtained in the different Alexa variants with the results of Whisper (Base), it is generally observed that Whisper performs significantly better than Alexa for any of the analyzed accents, except for the variant spoken in the Southern Iberian Peninsula and for males. Median values for Alexa are around 30%, while Whisper has about 15% for WER. Generally, Whisper (Base) makes approximately half the WER errors as Alexa with the Spanish language.
In the case of Whisper (Large-v2), it is observed that the results significantly improve compared to Whisper (Base), with a mean below 10% for both women and men. Particularly indicative are the median values of Whisper (Large-v2) for both genders, as they achieve medians of 0% for all cases except for the Spanish of the Southern Iberian Peninsula.
These data reveal a clearly identifiable outlier for Whisper, both for the Base model and the Large-v2 model, in the accent of Spanish from the southern part of the Iberian Peninsula when spoken by males. After analyzing the Common Voice 14 dataset, it is observed that thousands of contributions could be from the same person, which may have influenced the final result of the analysis for this accent.
Taking into account that the selected sample for the male gender and the accent of the Southern Iberian Peninsula was 30,698 audio segments for Whisper (Base), 11,600 for Whisper (Large-v2), and 2150 for Alexa, it is possible that this influence did not affect the analysis of Alexa as much as it did for Whisper. If the analysis results for females had also deteriorated for this accent, it would not have been assumed that something was affecting the male results.
In the case of using the same random sample for Whisper (Base) for the Southern Iberian Peninsula accent as used for Alexa, the results obtained are shown in
Table 15. Clearly, there is bias in the dataset affecting the results. The median values for Whisper (Base) are 16.67%, which is closer to the results obtained for females for this accent. For Whisper (Large-v2), Similar results were obtained, with a median of 0.00%, which is quite similar to the results obtained for females. These data can be seen in
Table 16.
If the weighted means obtained previously and shown in
Table 14 and
Figure 5 are recalculated using this correction for this accent, the data obtained is shown in
Table 17 and
Figure 6.
Let us analyze the weighted means shown in
Table 17 and
Figure 6. We can observe that in all cases, for the same type of Alexa variant, female voices are slightly better recognized than male voices, meaning the mean WER is lower for women than for men. On the other hand, for Whisper, the opposite is true—the mean WER is slightly lower for men than for women.
More specifically, we see that in Alexa, the difference between the mean WER for female and male voices is 3.24 pp (percentage points), 2.69 pp, and 2.88 pp, respectively, for Alexa MX, Alexa ES, and Alexa US, in favor of female voices. In the case of Whisper, these differences are 1.19 pp and 1.58 pp for the Base and Large-v2 models, respectively, but in favor of male voices.
Table 18 and
Figure 7 show the average data for each of the Alexa and Whisper variants, regardless of gender.
In these weighted mean data for each accent, each variant of Alexa and Whisper was analyzed, and it can be observed that the Alexa US model performs the worst in all cases compared to the other variants. In all other cases, the Alexa ES variant is slightly better than the Alexa MX variant, except for the Canarian accent, where it is the opposite.
It is noticeable that in some instances, the difference in the weighted means WER by accent; for example, for Alexa, it can reach up to 6.76 pp. This is the case when comparing the Spanish Northern and Spanish Southern accents. For Alexa MX, the difference between the Northern and Southern Spanish accents can be as high as 7.64 pp. For Alexa US, 7.98 pp between the same accents. These differences are observed across the various accents as well.
In Whisper (Base), approximately 8 pp differences are observed between the Northern Spanish accent and the Caribbean, Mexican, and Southern Spanish accents. In Whisper (Large-v2), a difference of around 4 pp between the Canarian and Caribbean accents is noted.
When analyzing the confidence intervals (CI 95%) calculated for each of the available options, it is observed that, overall, the sample fits appropriately and is accurately represented by these statistical measures. A wider confidence interval is notable in the case of Whisper Large-v2 for females, especially for Caribbean, Central American, and Central Spanish accents, although it is also significantly high for the Northern Spanish accent, albeit to a lesser extent. Although these values encompass a broader probable range than desired, they are not sufficiently large to invalidate this portion of the results. In the remaining cases, for both genders and all analyzed tools, the confidence intervals are below 2, and even in many cases below 1, with a slight exception for Canary Islands females and Alexa variants, where we obtain 2.88 for MX, 3.13 for ES, and 3.49 for US. However, considering that the mean WER values for these variants are 35.46, 42.52, and 51.19, respectively, it is concluded that these values are still significant.
In previous studies, the analysis of gender and accent bias within the same language has been explored. For instance, examining the DeepSpeech (STT) model using data from the Mozilla Common Voice project revealed bias between different English variants, specifically US and Indian English. However, the study concluded that there is no significant evidence of gender bias in the DeepSpeech model [
33].
Another study that assessed YouTube’s ASR system in transcribing voice-to-text in platform-uploaded videos estimated gender bias as disadvantageous to women compared to men. It also identified accent bias among different studied variants, particularly disadvantaging the Scottish accent [
31].
A separate study demonstrated that underrepresenting, for example, the female gender in the dataset used to train an ASR system results in a higher WER for that gender, indicating the presence of gender bias [
34].
6. Conclusions
Based on the observed data, it can be asserted that Alexa performs better in recognizing female speech in Spanish, while, conversely, Whisper exhibits better performance for male speech. However, with a mean difference of 2.94 pp for Alexa (favoring females) and 1.39 pp for Whisper (favoring males), we consider these differences not significant enough to conclude the existence of gender bias in Alexa and Whisper for the Spanish language or at least not a bias that significantly influences the everyday functionality of these tools. This small difference could be attributed to sample error, especially considering the WER for both Alexa and Whisper.
More significant differences are observed regarding accents, reaching up to 8 pp in some cases. As a general rule for Alexa, there appears to be a bias in favor of the Northern Spanish accent and against the Southern Spanish accent primarily, as well as other accents such as Caribbean, Central American, and Canarian. For Whisper (Base), there seems to be a potential bias in favor of the northern accent compared to the southern accent in Spain. In the case of Whisper (Largev2), this bias is mainly in favor of the Canarian accent as opposed to the Caribbean accent.
An interesting fact regarding the data obtained with Whisper (Large-v2) is that the Canarian is precisely the most accurately recognized despite being the dialectal variant with the fewest speakers. The Canary Islands have just under 2.2 million inhabitants [
42], compared to the nearly 500 million people with Spanish as their native language [
43].
Concerning the three Alexa models tailored for Mexican Spanish, Spanish from Spain, and Spanish from the United States, there appears to be no compelling reason to maintain these three models as separate entities for the Spanish language. It would be preferable to the optimized model, similar to the approach taken by Whisper. Subsequently, if the goal is to enable Alexa to speak in different dialects, employing a TTS system trained specifically for each accent in question would be more suitable.
One of the fundamental limitations of this project arises from the fact that the audio files included in the Common Voice dataset are derived from segments of read texts, in contrast to the natural way of interacting with a voice-activated virtual assistant, which typically involves conversations. Another limitation is that the number of available audio fragments is not comparable between those read by males and females, and a similar issue exists with the distribution across different accents.
Concerning future research directions, an obvious avenue would be to extend the study to other languages and their respective dialects, enabling a more comprehensive comparison. Another potential research extension could involve a specific comparison for Alexa, examining the speech-recognition error rate when using a set of predefined words versus a set of non-defined words, as addressed in this study. In many cases, Alexa Skills employs predefined sets of words to facilitate the interaction flow.
Analyzing automatic speech recognition with other datasets and for the various languages available for each system would enable us to determine whether progress is genuinely being made in eliminating gender bias in ASR systems or if, in this case, these results are specific to the Spanish language and these two particular tools.