Deep Speech Synthesis and Its Implications for News Verification: Lessons Learned in the RTVE-UGR Chair

Calderón-González, Daniel; Ábalos, Nieves; Bayo, Blanca; Cánovas, Pedro; Griol, David; Muñoz-Romero, Carlos; Pérez, Carmen; Vila, Pere; Callejas, Zoraida

doi:10.3390/app14219916

Open AccessArticle

Deep Speech Synthesis and Its Implications for News Verification: Lessons Learned in the RTVE-UGR Chair

by

Daniel Calderón-González

¹

,

Nieves Ábalos

²

,

Blanca Bayo

³,

Pedro Cánovas

³

,

David Griol

^1,*

,

Carlos Muñoz-Romero

²,

Carmen Pérez

³

,

Pere Vila

³

and

Zoraida Callejas

^1,4

¹

Department of Software Engineering, University of Granada, 18071 Granada, Spain

²

Monoceros Labs, 28014 Madrid, Spain

³

Corporación Radiotelevisión Española (RTVE), 28223 Madrid, Spain

⁴

Research Centre for Information and Communications Technologies (CITIC-UGR), 18071 Granada, Spain

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(21), 9916; https://doi.org/10.3390/app14219916

Submission received: 10 July 2024 / Revised: 17 October 2024 / Accepted: 24 October 2024 / Published: 30 October 2024

(This article belongs to the Topic Artificial Intelligence Models, Tools and Applications)

Download

Browse Figures

Versions Notes

Abstract

This paper presents the multidisciplinary work carried out in the RTVE-UGR Chair within the IVERES project, whose main objective is the development of a tool for journalists to verify the veracity of the audios that reach the newsrooms. In the current context, voice synthesis has both beneficial and detrimental applications, with audio deepfakes being a significant concern in the world of journalism due to their ability to mislead and misinform. This is a multifaceted problem that can only be tackled adopting a multidisciplinary perspective. In this article, we describe the approach we adopted within the RTVE-UGR Chair to successfully address the challenges derived from audio deepfakes involving a team with different backgrounds and a specific methodology of iterative co-creation. As a result, we present several outcomes including the compilation and generation of audio datasets, the development and deployment of several audio fake detection models, and the development of a web audio verification tool addressed to journalists. As a conclusion, we highlight the importance of this systematic collaborative work in the fight against misinformation and the future potential of audio verification technologies in various applications.

Keywords:

audio deepfake; speech synthesis; voice conversion; news verification; misinformation

1. Introduction

Speech synthesis, which involves transforming text into audio, has advanced considerably in recent years thanks to the adoption of deep learning algorithms and the availability of large numbers of hours of good quality audio recordings [1]. The synthesis approach that exploits these latest advances is known as “Deep speech synthesis” [2,3,4], which generates artificial voices that are increasingly more natural and customizable for both machines and humans. On the one hand, it facilitates communication with assistants and robots, making their utterances more intelligible and expressive by including subtle prosody elements [5]. On the other hand, they are also being used to give voice to people with speech production problems [6], since it is possible to adapt the voices generated to different styles and speakers [7].

A specific aspect of speech synthesis is voice cloning, which allows for generating a synthetic voice that copies a person’s voice [8]. This has many positive applications, such as the ease of generating and editing audio content, by for example generating voices with different patterns for different characters in storytelling [9] or improving the quality of a recording [10], the possibility of producing audio recordings of a target person with different languages and accents [11], or the generation of personalized voices for people with disabilities, among others [12]. However, it can also be used for malicious purposes such as for impersonation.

Traditionally, to produce a good clone, a large number of hours of audio and considerable computing power were needed. However, there are various techniques that allow for sufficiently natural cloning with fewer resources, e.g., by reusing voices from third parties and adapting them to the characteristics of the target voice [13]. This makes it possible to obtain quite natural recordings with relative ease, especially from people for whom there is audiovisual content accessible on the Internet, such as celebrities or politicians. In the communication sector, the rise of new tools to synthesize voices with artificial intelligence (AI), many of them free and easily accessible [14], has aggravated the problem of false audios or “audio deepfakes” [15], posing a significant threat to the integrity of information [16]. For example, in journalism, the dissemination of manipulated audio can undermine public trust in the media, spreading misinformation and affecting public opinion on critical issues [17,18]. To give some examples, in 2019, a scammer created a deepfake voice posing as a German executive to send a transfer of 220,000 euros to a Hungarian provider (https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402, accessed on 23 October 2024). More recently, in January 2024, a call with Biden’s fake voice asked citizens not to vote in New Hampshire (https://www.nbcnews.com/video/listen-fake-biden-robocall-tells-new-hampshire-not-to-vote-in-primary-202609733664, accessed on 23 October 2024).

The ability to distinguish between authentic and manipulated content is therefore more important than ever in the fields of security and communication; thus, audio verification becomes especially relevant [19]. In journalism, ensuring the authenticity of an audio is essential to maintain the credibility and trust of the public [20]. In the legal field, manipulated audio can have significant implications in trials and legal procedures, and can lead to decisions based on falsified evidence. In the security sector, unverified audio can be used for scams, phishing, and biometric spoofing [21]. Several studies indicate that human beings still have great confidence in our ability to distinguish fake from real audio, which has proven misleading, since there is scientific evidence that in many cases even professionals are not able to do so [21,22,23]. Therefore, it is necessary to complement our criteria with detection tools that help us determine whether an audio is real or not.

To address these challenges, in recent years the scientific community has been researching and developing various audio verification techniques, datasets, and tools [21,24]. These include spectral analysis, which examines the unique characteristics of sound to identify irregularities, and voice biometrics, which uses the unique characteristics of a person’s voice to verify their identity. Additionally, artificial intelligence and machine learning play a very important role in detecting audio “deepfakes” that may not be identified by humans [25]. However, despite their importance, there is limited research dedicated specifically to audio deepfakes compared to the attention received by video or image deepfakes [25]. This highlights the need for greater attention and resources in this area to develop more effective detection methods. Furthermore, in a recent report on disinformation [26], the OECD warns that the scale, speed of generation, and improvement in the quality of deepfakes require the adoption of effective measures that include the development of infrastructure and technological tools in public administrations that allow for the integrity of the information to be safeguarded and verified.

As a coordinated multidisciplinary effort to face these challenges, the IVERES project (“Identification, Verification and Answer. The democratic state faces the challenge of self-interested disinformation”) was proposed (https://iveres.es/, accessed on 23 October 2024), financed by the Spanish Ministry of Science and Innovation within the R&D&I program oriented to the challenges of society. IEVERES has recently been awarded the National Prize for Computer Engineering. The RTVE-UGR Chair has collaborated in this project, with the purpose of addressing the challenges related to deep audio fakes.

The RTVE-UGR Chair in deep speech synthesis and conversational AI and its applications in news verification (https://catedrartve.ugr.es/, accessed on 23 October 2024) is a collaboration between the Spanish Radio and Television Corporation (RTVE) and the University of Granada (UGR), with the collaboration of Monoceros Labs. The chair was established in 2022 to promote research, technological development, innovation, and training in the field of speech and conversational technologies in the media. Among its objectives, those specifically related to audio fakes are to:

Develop and apply advanced methodologies to verify the authenticity of audio recordings, identifying manipulations, deepfakes, and any type of fake audio.
Test and evaluate existing fake audio detection models to determine their effectiveness.
Generation of audio datasets using artificial intelligence techniques to train the selected models with voices of interest.
Train and optimize models to improve their ability to detect fake audio.
Create useful and accessible tools for journalists that facilitate the quick and effective verification of audios.
Train journalists in the use of these tools.

One of the most relevant elements of the chair is its multidisciplinary nature and the importance given to collaborative work and the creation of synergies between journalists from RTVE, companies specialized in deep speech synthesis such as Monoceros Labs, and the research carried out by the University of Granada. The remainder of this paper contextualizes our contribution to the state of the art and presents how this collaborative work allows us to promote the creation of robust resources adapted to the needs of verification teams, showing that the union of diverse capabilities and perspectives can contribute significantly to the integrity and veracity of content in the digital age. Section 2 presents a detailed state of the art of the verification and detection of deep audio fakes and its main applications in journalism. Section 3 and Section 4 describe our approach of multidisciplinary collaboration toward deep audio detection and the results achieved following our pipeline. Finally, Section 6 offers the conclusions we have drawn and suggests several directions for future research.

2. State of the Art

The RTVE-UGR Chair presents a multidisciplinary vision of the detection of deepfakes of audio addressing the subject from a journalistic as well as a technological point of view; therefore, the state of the matter and related work are presented below from both perspectives separately, as well as reviewing the current initiatives that, like IVERES and the RTVE-UGR Chair, address this challenge from an integrative perspective.

2.1. Verification of Deep Audio Fakes: A Challenge for Journalism

As previously described in the Introduction Section, concern about deepfakes has increased significantly, boosting the relevance of the journalistic work of verifying information and news. In Spain, organizations such as Verifica RTVE, the EFE Verifica agency, Maldita.es, and Newtral play a crucial role in this context, constantly working to identify and expose false or manipulated content [27]. Internationally, we can also find institutions such as the International Fact-Checking Network (IFCN) who have been pioneers in this field, providing resources and standards that help newsrooms maintain the accuracy of information. These groups specialize in analyzing and verifying information, offering an essential service to maintain the integrity of public discourse and trust in the media.

Internationally, several verification organizations offer tools and resources to combat disinformation and fake news. For example, the Duke Reporters’ Lab (https://reporterslab.org/, accessed on 23 October 2024), a journalism research center at Duke University’s Sanford School of Public Policy, offers tools such as ClaimReview for tagging articles on social media and search engines, and MediaReview, Squash, and FactStream for data verification of images, videos, and speeches, respectively. Additionally, this laboratory has established important collaborations with other large fact checkers such as PolitiFact [28] and FactCheck.org, creating a robust verification ecosystem.

Likewise, the International Fact-Checking Network (IFCN), created by the Poynter Institute in 2015, promotes good journalistic practices regarding disinformation and verification of information. Organizations such as these establish codes of principles that all members must follow, thus ensuring a consistent and transparent methodology in fact-checking. This network also facilitates collaboration between its members, allowing for the exchange of advanced technologies and methodologies in the detection of disinformation.

In Spain, initiatives such as Verifica RTVE and EFE Verifica stand out in the fight against misinformation and fake news. EFE Verifica is dedicated to verifying the information that circulates on the Internet and social networks [29]. On the other hand, Verifica RTVE not only focuses on news verification, but is also actively involved in educating the public on how to identify fake news and better understand media in the digital age [30,31]. For example, Verifica RTVE has an openly accessible website in which they specify all the tools that can be used to verify all types of content (https://www.rtve.es/noticias/verificartve/, accessed on 23 October 2024).

Traditionally, the focus has been almost exclusively on video and image deepfakes, with less emphasis on audio due to its, until recently, more modest presence in newsrooms [32]. However, with the growing appearance of fake audio, newsrooms are incorporating technological tools to help journalists detect these fakes. Tools such as the voice classifier from ElevenLabs, AI or Not, Voice Classifier from Play.ht, VerificAudio from PRISA media, and Resemble Detect from ResembleAI are examples of how technology is supporting journalistic work. These systems analyze the characteristics of voice recordings and determine whether they have been generated by AI, providing additional verification opinion that journalists can use to confirm the authenticity of the audio received.

Nevertheless, these tools can only be considered a complement to journalistic work: as no tool has a zero error rate, it is still necessary to understand the context in which deepfakes are generated in order to decide on the veracity of the information content. This highlights the need for continuous training of journalists to be acquainted with deepfake techniques and the adequate use of automated detectors [31], so that they are not vulnerable to increasingly sophisticated disinformation tactics [33].

Our work aims to address these gaps by defining an approach to multidisciplinary collaboration, wherein journalists and technologists contribute their complementary vision to the development and adoption of audio deepfake detection models and tools specifically designed for the journalistic environment.

2.2. Detection of Audio Fakes: A Technological Challenge

Within the technological field, fake audios have been studied for decades. Various initiatives and competitions, such as the ASVspoof Challenge or Audio Deepfake Detection (ADD), have promoted the development and evaluation of solutions to address voice spoofing [34,35]. In them, a challenge is posed to the participating research groups who compare the effectiveness of their solutions on a common set of data. ASVspoof focuses on voice spoofing detection, while ADD specifically addresses audio deepfakes [36].

The challenges posed by these issues have varied over time and reflect the main challenges or concerns of the moment regarding audio synthesis. Thus, the first challenges focused on “audio spoofing”, with which attackers attempt to deceive biometric authentication systems by generating audio that imitates the voice of a person authorized to access a system, which has serious implications in terms of security and privacy. The ASVspoof 2015 antispoofing challenge pioneered the introduction of databases containing examples of TTS (Text-to-Speech) and VC (Voice Conversion) attacks, establishing an initial benchmark for detection techniques [37]. On the other hand, replay attacks were also discussed (replay attacks), which simulate a more practical and common scenario wherein an attacker plays an authentic voice recording (ASVspoof 2017). This posed new challenges in terms of detection, as the reproduced audios can be of high quality and more difficult to differentiate from the original recordings than those obtained from synthesis or conversion [38]. Subsequently, tests were introduced in both logical access (LA; speech synthesis/conversion attacks) and physical access (PA; replay attacks) scenarios. With the participation of 63 research teams, ASVspoof 2019 marked significant progress in protecting against fake audio threats and supplantation [39].

With the development of AI and deep audio synthesis, the challenges have changed. ASVspoof 2021 was the first edition that did not provide training data, thus reflecting the more realistic conditions wherein spoofing attacks and deepfakes cannot be predicted with certainty [34]. In 2022, the specific challenge for detecting audio deepfakes, ADD, also appeared. ADD2022 stood out for its focus on completely and partially fake audio under realistic conditions, such as noise and background music, as well as the generation of audio that fools detection models [36]. In ADD2023, unlike its previous edition, it was not only focused on the binary classification of audios as real or fake, but also tried to locate manipulated segments within partially fake audios and the identification of detection techniques [35].

These competitions promote innovation and encourage the creation of new audio deepfake detection methods, provide benchmarks for an objective and standardized evaluation of different approaches, and encourage collaboration within the scientific community. By offering common datasets and metrics, they facilitate direct comparison of these tools and highlight the relevance of continued research in this field.

For this reason, the results of these challenges are a good indicator of which are the most cutting-edge techniques of the state of the art. There are several techniques that were presented as part of the challenges and that have subsequently been adopted with good results in the academic or industrial field. For example, ASSERT, developed by the Johns Hopkins Whiting School of Engineering, combines compression and residual networks to detect fake audio with high precision [40]. Similarly, Attentive Filtering Networks incorporates attentive filtering techniques in combination with residual networks to improve the detection of replay attacks, achieving an error rate of 8.99% in ASVspoof 2017 [41]. Finally, the CRIM system for ASVSpoof2021 combines residual neural networks, time delay networks, and techniques such as HOSP, improving robustness in adverse conditions and providing effective solutions to counteract voice spoofing [42].

On the other hand, deep learning and biometric technologies have improved the accuracy of fake voice detection. ID R&D (https://www.idrnd.ai/id-rd-ranks-first-in-detecting-synthetic-speech-in-global-asvspoof-challenge/, accessed on 23 October 2024) achieved the highest accuracy in ASVspoof 2019 with an error rate of 0.22%, using advanced synthetic voice detection techniques under the logical access (LA) condition. RawGAT-ST and RawNet focused on the analysis of speech patterns and spectro-temporal features using graph attention networks and raw audio analysis, respectively. RawGAT-ST achieved a 1.06% error rate at ASVspoof 2019 [43], while RawNet achieved the second best overall score [44].

Finally, other tools have focused on improving robustness and detection in challenging environments. The UR Channel-Robust Synthetic Speech Detection System, presented at ASVspoof 2021, focused on robustness to variability in the transmission channel, achieving an error rate of 5.46% in the logical access condition [45]. DKU-CMRI and Biometric Vox apply outlier detection models and advanced neural network architectures to address challenges at ASVspoof2021, achieving competitive error rates in logical and physical access detection tasks [46,47].

In this context, the tool FastAudio [48] which was presented at ASVspoof2021, was selected for our project, due to its open source nature and its good results observed after different tests.

Entities like ASVspoof and ADD play a crucial role in the innovation and development of technologies to detect synthetic voices and audio deepfakes, critical in the fight against disinformation and fake news. As we have seen, in recent years, advances in audio deepfake detection have been significant thanks to the use of advanced AI techniques. However, challenges continue to exist and new challenges arise that limit the capabilities of these models.

One of the most complicated aspects of detecting audio deepfakes is the variety of types of manipulations. For example, detecting speech conversions (VCs) turns out to be more complex compared to text-to-speech synthesis (TTS). VC attacks usually present greater diversity, which makes their detection difficult [49]. On some occasions, these detection models also face adverse conditions such as noise in the audio, echo, or variability in the transmission channel that also hinder the detection process.

Finally, another important limitation that we detected in the process of developing technology for audio deepfake detection is the scarcity of datasets. Most models are trained with a set of very similar and limited datasets, which does not reflect the diversity of situations found in audio deepfake attacks in the real world. Furthermore, finding fake audio datasets from specific people or in languages different from English (in our case Spanish) is even more complicated. Finally, training these models also requires great computing capacity and, in turn, economic resources, which is another major limitation for many projects [50].

2.3. Deep Audio in News Verification: Multidisciplinary Initiatives

The IVERES project seeks to promote cooperation between universities, technology companies, and the media to develop innovative tools and methods in the fight against disinformation. Within this framework, the RTVE-UGR Chair plays a crucial role, combining academic and practical resources to create a robust audio verification environment.

In this context of developing tools for detecting fake audio, multidisciplinary projects such as the RTVE-UGR Chair highlight their fundamental value. These collaborations, which integrate the knowledge and skills of computer scientists and journalists, both researchers and practitioners, are essential to face current and complex challenges such as audio verification.

Several projects funded by the European Union (EU) have adopted this multidisciplinary vision to address news verification, and some of the most relevant are the following:

InVID: Project that developed a set of software tools to help journalists verify and legitimize amateur video and social media content. Includes features such as social media analysis, video forensics, and digital rights management [51,52].
WeVerify: This project expanded and improved existing verification tools, including a web browser plugin, a deepfake detector, and a domain credibility service [53].
FANDANGO: In this project, the spread of fake news on social networks was analyzed using big data and artificial intelligence techniques. It also developed tools to detect and combat disinformation [54].
AI4Media: Focuses on research and training in artificial intelligence for the media sector. Its objective is to develop tools and resources that help the media combat misinformation and improve the quality of information [55].
IBERIFIER (https://iberifier.eu/observatorio/, accessed on 23 October 2024): It is a digital media observatory focused on Spain and Portugal, supported by the European Commission, and is part of the European Digital Media Observatory (EDMO). It is coordinated by the University of Navarra and agglutinates twelve universities, five organizations dedicated to news verification, several news agencies, and six multidisciplinary research centers. Its objectives include research into the characteristics and trends of the digital media ecosystem, the development of technologies for the detection of disinformation, the verification of disinformation, the preparation of reports on disinformation threats, and the promotion of media literacy initiatives.

In the RTVE-UGR Chair, we seek to integrate the strengths observed and address the weaknesses found in these projects by being more specific about the roles and collaborative tasks. Given the current challenges, describing specific approaches to multidisciplinary work to combat deepfakes represents a great opportunity to innovate and improve audio verification practices [24]. We envisioned the following objectives for the Chair:

Incorporate emerging technologies: Explore new applications of artificial intelligence in analysis and detection of fake audio, ensuring that this technology is useful and effective.
Promote multidisciplinary collaboration: We seek to contribute multidisciplinary approaches and strengthen our collaboration with other universities in the IVERES project as well as with the rest of the multidisciplinary team of the RTVE-UGR chair.
Training and dissemination: We aim to train journalism professionals in the use of tools for news verification as well as the ability to understand and evaluate the ethical implications of these technologies and disseminate the advances of this chair.

With these objectives, the RTVE-UGR Chair, in addition to audio verification, seeks to equip the media with knowledge about the use of these tools while offering assistance and training to verification teams. This effort is guided by the established literature, which emphasizes that automated fake detection should not be blindly trusted and must always be cross-checked by human evaluators [56].

3. Our Approach to Multidisciplinary Collaboration Toward Deep Audio Detection

The RTVE-UGR Chair team includes staff from UGR, RTVE, and Monoceros Labs. The UGR leads the research and training of deepfake detection models. Monoceros Labs contributes with its experience in generating voice clones and synthetic voices, essential for training and improving these models. RTVE guides the applicability and relevance of the developments, ensuring that they adjust to the real needs of journalists. We believe that this collaboration between academia, industry, and the media promotes innovative and effective solutions for the verification of audios, taking into account both the technical challenges and the journalistic implications.

The collaborative process of this multidisciplinary team has been enhanced through biweekly virtual meetings and some in-person workshops throughout the project. These meetings have been essential for the exchange of ideas, problem solving, and strategic decision making, ensuring cohesion and constant progress toward the objectives of the chair. The alternation between virtual and in-person meetings has facilitated fluid and effective communication, allowing for the active participation of all members, regardless of their geographical location (with members in Granada, Madrid, and Barcelona).

To document the progress of the project and ensure detailed tracking of progress, we implemented a comprehensive documentation system. Before and after each meeting, presentations are prepared and minutes are written, which are shared with all team members through a project management platform. This documentation not only serves to keep the entire team informed about the current status of the project, but also makes it easier to review past decisions and plan for the future.

Table 1 lists all the roles that have been part of this multidisciplinary team and the tasks performed within the RTVE-UGR Chair. As can be observed, eight roles have contributed their complementary experience within the Chair.

The RTVE-UGR Chair has defined a comprehensive methodology to profit from our multidisciplinary project to generate training and test datasets encompassing cloned voices, train artificial intelligence models to detect fake audio, and facilitate the use of these models and third-party tools through a web application integrating the whole verification ecosystem. An overview of this process is presented in Figure 1, organized into the three main stages: research and identification of voices, datasets and models, and web tool. Detailed descriptions of each step are presented below in their corresponding subsections. Prior to this, it was necessary to establish a common understanding of the concepts of audio deepfake and deepfake detection. Recent studies have highlighted that even the definition of deepfake can vary significantly across different disciplines [57].

3.1. Research and Identification of Voices

First, we started with a review of related projects to analyze the status of the topic and identify potential contributions of the RTVE-UGR Chair to advance the state of the art. Once we had clear objectives, we performed a systematic search and testing of tools that were available to detect fake voices. In addition to this, we also located and gathered publicly available corpora of both real and fake audios, which we later complemented with our own generated samples to train and test our models and tools (see Section 3.2 and Section 4.1). Our multidisciplinary team of computer science (CSR) and journalism (JR) researchers and experts in speech synthesis and voice cloning (SSVC) were responsible for these tasks.

The datasets available in the literature are generic, whereas we were interested not only in obtaining a generalistic fake audio detector, but also in being able to discern real from fake for the specific voices of relevant persons with specific models. Thus, for the collection and generation of audios for these voices, we first needed to know which personalities were frequent subjects of fakes, and therefore their specific models would be more useful for the verification team. The list of personalities was provided by VerificaRTVE, the RTVE News Verification team (NVJ), on the basis of their own experience in the newsroom. The persons included in the list included the King of Spain Felipe VI, the Prime Minister Pedro Sánchez, the Minister Yolanda Díaz, and the President of the main opposition party Alberto Núñez Feijóo.

3.2. Datasets and Models

As shown in Figure 2, we worked with a data collection composed of two different corpora: the ASVspoof 2019 and our own corpus. ASVspoof contains more than 1.5 million audios, 145,669 real and 1,420,604 fake, and was used together with our own data to generate the general fake detection model. Our own corpus was created by Monoceros Labs, as described below, and contains 1402 real and 2097 fake audios for our voices of interest.

For the collection and generation of our corpus, for each voice the experts in speech synthesis and voice cloning (SSVC) selected high-quality real voice samples, extracted mainly from speeches and parliamentary interventions. Special emphasis has been placed on selecting audios without background noise and, as far as possible, already transcribed. This careful selection ensures a good basis for the efficient and accurate creation of fake voices, facilitating the synthesis and training process.

From these recordings, Monoceros Labs generated models for each person of interest that were used to produce fake audio samples generated with different techniques, as described in more detail in Section 4.1.

Once the corpora were compiled, we proceeded to train the models. In this phase, computer science researchers (CSRs) and experts in speech synthesis and voice cloning (SSVC) intervened. Before training our own models for the selected persons of interest, we evaluated the selected FastAudio tool by replicating the results of their paper using the public audios of ASVspoof2019, which allowed us to verify the effectiveness of the initial configuration.

The next step was to train the different proposed models with a combination of different training and test partitions including varied mixtures of the general ASVspoof data and our own samples. This iterative approach allowed us to refine the models until we obtained satisfactory results.

Finally, the Backend developer (BD) and Systems engineer (SE) were in charge of deploying the models on the server to make them accessible from the web audio verification tool developed.

3.3. Web Audio Verification Tool

Within the RTVE-UGR Chair, we developed a web audio verification tool addressed to journalists. This web tool provides seamless connection to several detection models through APIs. We provide access to our own models (a general detection model and a model for each person of interest) and third-party audio verification models.

At the moment, we integrate the solution by Loccus.ai, which offered us access to its API for testing. The decision to formalize an agreement with Loccus was based on the good results obtained during the testing phase.

The integration of Loccus has enriched our workflow by allowing us to contrast audios using two different tools, which reinforces the acceptability of our results.

The Web developer (WD), Backend developer (BD), Systems engineer (SE), and experts in speech synthesis and voice cloning (SSVC) participated in this stage. Moreover, all journalists involved in the project participated in the testing and evaluation of this tool.

4. Results Achieved Following Our Pipeline

In the previous section, we described our workflow; next, we will describe in more detail the results achieved with it.

4.1. Datasets and Models Used

As mentioned in previous sections, our main datasets are composed, on the one hand, of the set of audios, both real and fake, provided by ASVspoof, and on the other hand, by a set of fake audios corresponding to persons of interest and a small compilation of fake audios found on social networks or received in the RTVE newsroom.

The audio corpus generated by Monoceros Labs to train the fake audio classifier contains both real audio from public interventions and audio generated with synthetic voice models. For the latter, they adopted two different techniques: creating voice synthesis models (TTS) to obtain audio that imitates all the characteristics of the prosody of the voice, including intonation; and creating voice conversion (VC) models, to obtain audio that imitates only the timbre of the voice. We adopted these different techniques in order to obtain detection models that are able to identify different types of fakes.

The cleaning and preparation of the data prior to the creation of the voice models has been the minimum necessary, as has been the training of the models, in order to obtain a result of similar quality to the possible results obtained by non-experts, which are the most common types of fakes that arrive to the newsrooms. In addition, Monoceros Labs has worked with these audio data and voice models in a private and secure manner, with its own and local infrastructure, without access to public clouds, to ensure that these results are not used for purposes other than those designed by the Chair.

Although the audio provided by Monoceros Labs was already of high quality, a normalization process was carried out to standardize the sound with that of the authentic recordings. All audio was converted to FLAC format, with a 16 kHz specification and a mono channel, to meet the requirements of our detection model. Subsequently, the audios were tagged in files following specific guidelines to indicate its nature (real or false) to the detector in the training phase.

Regarding the model for detecting fake audio, we selected FastAudio, a learnable audio front-end designed to improve fake audio detection. This technology differentiates itself from traditional fixed front-ends by being able to learn and adapt specifically to the characteristics of spoofing threats, resulting in greater effectiveness in identifying audio fakes.

Technologically, FastAudio replaces fixed filter banks, which are typically used in other audio processing systems, with learnable filter layers. This allows the model to dynamically adjust the characteristics it extracts from the input audio during the training process, thus being able to focus on the most appropriate characteristics of the audio to classify them as fake or true. In addition, FastAudio uses advanced machine learning techniques such as short-time Fourier transforms (STFTs) to process the audio signal and apply logarithmic compression, imitating the non-linearity in the perception of sound intensity of the human ear, that is, techniques to simulate how the human ear perceives sound [48].

4.2. Description of the Resulting Detection Models

After replicating the results described in the original FastAudio study, we took advantage of their training tool to use it with our data as described below. For our tests, we used FastAudio to train models using the configuration provided in its official repository (https://github.com/magnumresearchgroup/Fastaudio (accessed on 23 October 2024), without modifying the model’s hyperparameters or architecture. The only changes made were to the training and testing datasets.).

Implemented Models

We developed four specific models, each trained exclusively with the data of the person of interest. This make it possible to fine-tune the model’s detection ability to adapt to each person’s vocal characteristics.

Additionally, we created a general model that integrates the real and fake audios of the four aforementioned personalities and other fake audios collected from social networks. This model seeks to provide the possibility of classifying voices different from those considered as persons of interest.

The development of these models follows the following pipeline:

Audio Labeling: Each audio sample is labeled as “authentic” or “fake”. This process is crucial to ensure the quality of training and evaluation of the model, since the model will need to know whether the audio is real or fake at the time of training.
Pre-processing: Audios are processed to normalize features such as volume and to extract relevant features using FastAudio. This step is critical to transforming data into a form that models can process effectively.
Model training: The model is trained with 80% of the data we have, labeled and preprocessed.
Testing: the models are tested with the remaining 20% of the data to evaluate their accuracy and generalization ability.
Deployment: once validated, the models are deployed on a server, where we can access them through an API to perform verifications sending an audio from the web tool.

Figure 3 shows a diagram of this pipeline.

4.3. Description of the Website Created as a Frontend for the Models

To facilitate access to our fake audio detection models, we designed and implemented a React-based web interface. This technology has been selected for its efficiency and flexibility, allowing for the creation of a fluid user experience.

The web interface (Figure 4) was designed with a focus on simplicity, allowing non-technical users such as journalists to easily use the tool. The page offers a very simple functionality to upload audio files. Once the audio is uploaded, the user can select which public figure it corresponds to or if they want to use the general model and then carry out the verification. The website allows for the verification of the audios with our models or the third-party Loccus tool. Finally, the results are presented visually, indicating whether the model considers that the audio is real or fake and the confidence in this prediction.

A user manual is provided within the same website, offering detailed instructions on how to use each of the platform’s features, as well as explanations of the results presented. A back-end was also implemented and manages web requests using Flask. This system is responsible for directing requests to the appropriate FastAudio models or the Loccus API, as selected by the user.

The tool pipeline would be the following:

Audio upload: The user uploads the audio file through the interface.
Model selection: The user chooses whether to apply a specific model or the general one.
Results display: The results are returned to the user through the web interface.

Figure 5 shows a diagram of the previously explained pipeline.

A preliminary evaluation by RTVE journalists shows a high acceptance of the tool, which is currently being used for audio content verification.

5. Evaluation of the Deepfake Detection Models

In order to evaluate the effectiveness of the fake audio detection models integrated into our detection tool, we conducted a series of experiments using the following datasets:

ASVspoof2019: A dataset provided by the ASVspoof challenge 2019, consisting of 108,978 fake audios and 12.483 real audios (https://zenodo.org/records/4837263, accessed on 23 October 2024).
ASVspoof2021: The evaluation dataset from the ASVspoof challenge 2021, containing 335,497 audios (https://doi.org/10.5281/zenodo.4837263, accessed on 23 October 2024). These audios are unlabeled, but the challenge provides an evaluation tool to measure model performance through submission to the ASVspoof evaluation server.
RTVE-UGR dataset: A dataset generated by Monoceros Labs, containing 519 real and 1.097 fake audios.

We conducted four experiments to evaluate our models under different scenarios, as summarized in Table 2.

For each experiment, we split the datasets into training and testing subsets and computed the standard evaluation metrics: accuracy, precision, and recall. To avoid confusions between the concept of “positive” and “negative”, we would like to clarify the following:

True Positive (TP) refers to a fake audio correctly identified.
False Positive (FP) refers to a real audio incorrectly identified as fake.
False Negative (FN) refers to a fake audio incorrectly identified as real.
True Negative (TN) refers to a real audio correctly identified.

The following subsections present the results achieved.

5.1. Experiment 1: Replicating FastAudio Experimental Results

In this experiment, we trained the FastAudio classifier using the ASVspoof2019 dataset and tested it on ASVspoof2021. We employed the FastAudio repository’s (https://github.com/magnumresearchgroup/Fastaudio?tab=readme-ov-file, accessed on 23 October 2024) evaluation script (eval.py) to calculate the quality of the results. Our results show similar trends to those reported in the FastAudio paper [48], wherein the system achieved a min t-DCF of 0.2119 and an EER of 5.27%.

5.2. Experiment 2: Testing an Antispoof Model with Our Dataset

In the second experiment, we trained the model on ASVspoof2019 and tested it on the RTVE-UGR dataset. As shown in Table 3, the classifier did not perform well in this scenario. All audios were assigned similar scores, making it difficult to distinguish between real and fake audios.

The accuracy was 68%, with a precision of 68% and a recall of 100% for fake audios. This indicates that the model correctly identified all fake audios but misclassified all real audios as fake. Possible reasons include differences in audio quality and language (English vs. Spanish), suggesting that the model did not generalize well to the RTVE-UGR dataset.

Given these poor results, we opted to train specific models using the voices of interest that we had created for the Chair (see Experiment 3).

5.3. Experiment 3: Evaluation of the Specific Models Developed for Individual Speakers of Interest

In this experiment, we trained the FastAudio classifier using 80% of the RTVE-UGR dataset and tested it on the remaining 20%.

We repeated this experiment with two voices of interest, those of the King of Spain Felipe VI and the vice-president of Spain, Yolanda Díaz. The results obtained are presented in Table 4 and Table 5, respectively.

The model exhibited improved performance compared to Experiment 2. The scores assigned to real audios were significantly higher than those for fake audios, allowing for better differentiation.

The accuracy increased to 99%, the precision was 98.7%, indicating a very low rate of false positives. The recall remained at 100%.

The model achieved an accuracy of 100%, precision and recall were also 100%, indicating no misclassifications. These outcomes are very positive but may be the result of overfitting due to the model being trained and tested on similar data. Although the fake samples for each target speaker were generated using two different techniques (deep speech synthesis and voice conversation), the model may have learned artifacts introduced by the fake audio generation methods used for this dataset. We have informally tested the models with a reduced number of external fakes that arrived to the RTVE verification team and the prediction of the model corresponded was accurate (it matched the opinion of the expert human verifiers). Nevertheless, the ability of the model to generalize to other fake audio types must be further tested. This is an important challenge when using fake detection models created for specific speakers, as it is very complicated to gather sufficient fake samples to train and test the models.

5.4. Experiment 4: Evaluation of a General Model Developed with Our Dataset

After conducting experiments with specific models tailored to individual speakers, we combined all the speakers in the RTVE-UGR dataset to train a general audio deepfake detection model.

This model was then tested on a dataset consisting of 610 test samples, of which 354 were fake and the rest were real. As can be observed in Table 6, the model demonstrated strong performance in detecting fake audios, correctly classifying all the fake samples. However, it misclassified four real audios as fake, while correctly identifying the remaining real audios.

The overall accuracy of the model was 99.3%, with a precision of 98.9% and a recall of 100% for fake audios.

Across Experiments 3 and 4, we observed that the model performs exceptionally well when trained and tested on the same dataset, even more so when the data are normalized. While our models are effective in a controlled environment, their generalizability to other datasets or real-world scenarios with unobserved generation techniques must be further tested.

To address this limitation, future work will focus on incorporating audios from additional speakers, languages, and fake audio generation methods.

6. Conclusions and Future Work

Despite the current interest in the adoption and development of technologies for news verification, there is still not enough research on best practices of professional culture and multidisciplinary structures for the deployment of such technologies in the newsroom [58]. In this paper, we have presented the RTVE-UGR Chair multidisciplinary endeavor for the fight against disinformation and how we have organized the collaboration between different profiles within computing and journalism to develop effective tools for news verification, highlighting the importance and value of this multidisciplinary approach.

State of the art on verification journalism demographics is mainly based on profiles that combine hybrid skills; for example, data journalists. However, as identified in [59], journalism remains by far the primary discipline of these professionals, with a remarkable gap in confidence between journalistic and computer science skills. In contrast, we argue in favor of a real multidisciplinary approach with different specialists with complementary backgrounds in journalism and computer science adopting specific roles. This is in line with the recent findings [60] emphasizing that collaboration between developers and journalists have a positive effect on both, as developers must work within the framework of the journalistic logic, while journalists acquire new technical competences.

The generated voices and trained models provide a solid foundation for the detection of audio deepfakes, which is a growing need in this era of generative AI wherein information can be easily manipulated. The integration of third-party tools into our platform has enhanced the functionality and effectiveness of our application, providing the media with greater security when verifying audio and being able to work with various detection models.

During the development of the RTVE-UGR Chair, we learned valuable lessons regarding the importance of news verification, specifically audio verification, and how it affects the advancement of voice generation technologies in society. A significant challenge was the training of audio verification models in Spanish and for the persons of interest selected by the Chair, given the limited availability and specificity of resources in this language. To overcome this, we adopted a multidisciplinary and collaborative approach, involving both academic experts and industry professionals. This approach not only advanced the technology, but also highlighted the importance of public-private collaboration in technological innovation projects. In addition, we learned that continuous education and outreach are key to keeping pace with rapid technological advances and ensuring a practical and ethical application of these developments. This is in line with the results of the comprehensive study of AI guidelines in 17 countries presented in [61], where it is stated that the collaborative synergy between journalists and technologist is also a way to ensure that AI complements and augments journalistic efforts rather than supplanting them.

The future of the RTVE-UGR Chair focuses on the expansion and continuous improvement of our capabilities:

Developing more models: We will continue to develop models with more voices of persons of interest, expanding our dataset and improving the accuracy of our models.
Improved detection capabilities: We are working on introducing more voices generated using different fake voice creation methods to improve the detection of tampering and spoofing.
Audio analysis by fragments: Another area of future development is the analysis of audio by short fragments, such as 4-second intervals, to detect possible manipulations within long audios. This technique will allow us to identify false fragments inserted into authentic recordings, increasing the precision and reliability of our verification tools.
Multidisciplinary collaboration: We will continue to strengthen collaboration with universities, research centers, and technology companies to enrich our methodologies and tools, ensuring a diversity of perspectives and knowledge.

Author Contributions

Conceptualization, D.C.-G., N.Á., B.B., P.C., D.G., C.M.-R., C.P., P.V. and Z.C.; Methodology, D.C.-G., N.Á., B.B., P.C., D.G., C.M.-R., C.P., P.V. and Z.C.; Software, D.C.-G., N.Á., B.B., D.G., C.M.-R. and Z.C.; Validation, D.C.-G., B.B., P.C., C.P. and P.V.; Formal Analysis, D.C.-G., N.Á., B.B., P.C., D.G., C.M.-R., C.P., P.V. and Z.C.; Investigation, D.C.-G., D.G., N.Á., C.M.-R., P.V. and Z.C.; Resources, D.C.-G., N.Á., B.B., P.C., D.G., C.M.-R., C.P., P.V. and Z.C.; Data Curation, D.C.-G., N.Á., D.G., C.M.-R. and Z.C.; Writing—Original Draft Preparation, D.C.-G., N.Á., B.B., P.C., D.G., C.M.-R., C.P., P.V. and Z.C.; Writing—Review and Editing, D.C.-G., N.Á., B.B., P.C., D.G., C.M.-R., C.P., P.V. and Z.C.; Visualization, D.C.-G. and Z.C.; Supervision, N.Á., P.C., D.G., C.M.-R., C.P., P.V. and Z.C.; Project Administration, P.C., D.G., C.P., P.V. and Z.C.; Funding Acquisition, P.C., C.P. and P.V. All authors have read and agreed to the published version of the manuscript.

Funding

This study is part of the project IVERES (PLEC2021-008176), financed by MCIN/AEI/10.13-039/501100011033 and by the European Union “NextGenerationEU”/PRTR and the RTVE-UGR Chair in Deep Speech Synthesis and Conversational AI and its Applications in News Verification (https://catedrartve.ugr.es/). IVERES has been funded by the Spanish Ministry of Science and Innovation belonging to the 2021 call, R&D&I projects in strategic lines in public-private collaboration, of the state program of R&D&I oriented to the challenges of society in the framework of the State Plan for Scientific and Technical Research and Innovation 2017–2020.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The ASVSpoof dataset is available in https://zenodo.org/records/4837263. The real recordings of the voices of the Spanish politicians used in the RTVE-UGR Chair were downloaded from the Spanish Congress website https://www.congreso.es/es/archivo-audiovisual. The synthetic voices generated in the Chair cannot be shared following research ethics commitments.

Acknowledgments

The authors would like to thank Antonio José Espósito for his implication in the successful deployment of the models. We are also grateful to Loccus.ai, especially Manel Terraza, for their collaboration to integrate their voice fake detection software into the verification tool of the RTVE-UGR Chair.

Conflicts of Interest

Authors Blanca Bayo, Pedro Cánovas, Carmen Pérez and Pere Vila were employed by the company RTVE. Nieves Ábalos and Carlos Muñoz-Romero were employed by the company Monoceros Labs. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Tan, X. Neural Text-to-Speech Synthesis; Springer: Berlin/Heidelberg, Germany, 2023. [Google Scholar] [CrossRef]
Cai, Z.; Yang, Y.; Li, M. Cross-lingual multi-speaker speech synthesis with limited bilingual training data. Comput. Speech Lang. 2023, 77, 101427. [Google Scholar] [CrossRef]
Eren, E.; Demiroglu, C. Deep learning-based speaker-adaptive postfiltering with limited adaptation data for embedded text-to-speech synthesis systems. Comput. Speech Lang. 2023, 81, 101520. [Google Scholar] [CrossRef]
Mehrish, A.; Majumder, N.; Bharadwaj, R.; Mihalcea, R.; Poria, S. A review of deep learning techniques for speech processing. Inf. Fusion 2023, 99, 101869. [Google Scholar] [CrossRef]
James, J.; Balamurali, B.T.; Watson, C.I.; MacDonald, B. Empathetic Speech Synthesis and Testing for Healthcare Robots. Int. J. Soc. Robot. 2021, 13, 2119–2137. [Google Scholar] [CrossRef]
Angrick, M.; Luo, S.; Rabbani, Q.; Candrea, D.N.; Shah, S.; Milsap, G.W.; Anderson, W.S.; Gordon, C.R.; Rosenblatt, K.R.; Clawson, L.; et al. Online speech synthesis using a chronically implanted brain–computer interface in an individual with ALS. Sci. Rep. 2024, 14, 9617. [Google Scholar] [CrossRef]
Xie, Q.; Tian, X.; Liu, G.; Song, K.; Xie, L.; Wu, Z.; Li, H.; Shi, S.; Li, H.; Hong, F.; et al. The Multi-Speaker Multi-Style Voice Cloning Challenge 2021. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 8613–8617. [Google Scholar] [CrossRef]
Luong, H.T.; Yamagishi, J. NAUTILUS: A Versatile Voice Cloning System. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2967–2981. [Google Scholar] [CrossRef]
Ijiga, O.M.; Idoko, I.P.; Enyejo, L.A.; Akoh, O.; Ugbane, S.I.; Ibokette, A.I. Harmonizing the voices of AI: Exploring generative music models, voice cloning, and voice transfer for creative expression. World J. Adv. Eng. Technol. Sci. 2024, 11, 372–394. [Google Scholar] [CrossRef]
Hu, W.; Zhu, X. A real-time voice cloning system with multiple algorithms for speech quality improvement. PLoS ONE 2023, 18, e0283440. [Google Scholar] [CrossRef]
Chadha, A.; Kumar, V.; Kashyap, S.; Gupta, M. Deepfake: An Overview. In Proceedings of the Second International Conference on Computing, Communications, and Cyber-Security, Singapore, 2–4 December 2020; pp. 557–566. [Google Scholar] [CrossRef]
Nguyen, T.T.; Nguyen, Q.V.H.; Nguyen, D.T.; Nguyen, D.T.; Huynh-The, T.; Nahavandi, S.; Nguyen, T.T.; Pham, Q.V.; Nguyen, C.M. Deep learning for deepfakes creation and detection: A survey. Comput. Vis. Image Underst. 2022, 223, 103525. [Google Scholar] [CrossRef]
Sadekova, T.; Gogoryan, V.; Vovk, I.; Popov, V.; Kudinov, M.; Wei, J. A Unified System for Voice Cloning and Voice Conversion through Diffusion Probabilistic Modeling. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 3003–3007. [Google Scholar] [CrossRef]
Rodríguez-Ortega, Y.; Ballesteros, D.M.; Renza, D. A Machine Learning Model to Detect Fake Voice. In Proceedings of the ICAI 2020, Ota, Nigeria, 29–31 October 2020; Florez, H., Misra, S., Eds.; Springer: Cham, Switzerland, 2020; pp. 3–13. [Google Scholar] [CrossRef]
Lyu, S. Deepfake Detection: Current Challenges and Next Steps. In Proceedings of the 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar] [CrossRef]
Helmus, T.C. Artificial Intelligence, Deepfakes, and Disinformation: A Primer; Technical Report; RAND Corporation: Santa Monica, CA, USA, 2022. [Google Scholar]
Gambín, A.F.; Yazidi, A.; Vasilakos, A.; Haugerud, H.; Djenouri, Y. Deepfakes: Current and future trends. Artif. Intell. Rev. 2024, 57, 64. [Google Scholar] [CrossRef]
Gregory, S. Fortify the Truth: How to Defend Human Rights in an Age of Deepfakes and Generative AI. J. Hum. Rights Pract. 2023, 15, 702–714. [Google Scholar] [CrossRef]
Naitali, A.; Ridouani, M.; Salahdine, F.; Kaabouch, N. Deepfake Attacks: Generation, Detection, Datasets, Challenges, and Research Directions. Computers 2023, 12, 216. [Google Scholar] [CrossRef]
Diakopoulos, N.; Johnson, D. Anticipating and addressing the ethical implications of deepfakes in the context of elections. New Media Soc. 2021, 23, 2072–2098. [Google Scholar] [CrossRef]
Mcuba, M.; Singh, A.; Ikuesan, R.A.; Venter, H. The Effect of Deep Learning Methods on Deepfake Audio Detection for Digital Investigation. Procedia Comput. Sci. 2023, 219, 211–219. [Google Scholar] [CrossRef]
Almutairi, Z.; Elgibreen, H. A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions. Algorithms 2022, 15, 155. [Google Scholar] [CrossRef]
Khanjani, Z.; Watson, G.; Janeja, V.P. Audio deepfakes: A survey. Front. Big Data 2023, 5, 1001063. [Google Scholar] [CrossRef]
Akhtar, Z.; Pendyala, T.L.; Athmakuri, V.S. Video and Audio Deepfake Datasets and Open Issues in Deepfake Technology: Being Ahead of the Curve. J. Forensic Sci. 2024, 4, 289–377. [Google Scholar] [CrossRef]
Wang, Y.; Huang, H. Audio–visual deepfake detection using articulatory representation learning. Comput. Vis. Image Underst. 2024, 248, 104133. [Google Scholar] [CrossRef]
OECD. Facts Not Fakes: Tackling Disinformation, Strengthening Information Integrity; Organisation for Economic Co-Operation and Development: Paris, France, 2024. [Google Scholar]
Guo, Z.; Schlichtkrull, M.; Vlachos, A. A Survey on Automated Fact-Checking. Trans. Assoc. Comput. Linguist. 2022, 10, 178–206. [Google Scholar] [CrossRef]
Díaz-Lucena, A.; Hidalgo-Cobo, P. Verification Agencies on TikTok: The Case of MediaWise and Politifact. Societies 2024, 14, 59. [Google Scholar] [CrossRef]
López-Marcos, C.; Vicente-Fernández, P. Fact Checkers Facing Fake News and Disinformation in the Digital Age: A Comparative Analysis between Spain and United Kingdom. Publications 2021, 9, 36. [Google Scholar] [CrossRef]
Valero-Pastor, J. Plataformas, Consumo Mediático y Nuevas Realidades Digitales: Hacia Una Perspectiva Integradora; Dykinson: Madrid, Spain, 2021. [Google Scholar]
Tejedor, S.; Vila, P. Exo Journalism: A Conceptual Approach to a Hybrid Formula between Journalism and Artificial Intelligence. Journal. Media 2021, 2, 830–840. [Google Scholar] [CrossRef]
Gao, Y.; Wang, X.; Zhang, Y.; Zeng, P.; Ma, Y. Temporal Feature Prediction in Audio–Visual Deepfake Detection. Electronics 2024, 13, 3433. [Google Scholar] [CrossRef]
Schäfer, K.; Choi, J.E.; Zmudzinski, S. Explore the world of audio deepfakes: A guide to detection techniques for non-experts. In Proceedings of the 3rd ACM International Workshop on Multimedia AI Against Disinformation, Phuket, Thailand, 10–13 June 2024; pp. 13–22. [Google Scholar] [CrossRef]
Yamagishi, J.; Wang, X.; Todisco, M.; Sahidullah, M.; Patino, J.; Nautsch, A.; Liu, X.; Lee, K.A.; Kinnunen, T.; Evans, N.; et al. ASVspoof 2021: Accelerating progress in spoofed and deepfake speech detection. In Proceedings of the 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, Online, 16 September 2021; pp. 47–54. [Google Scholar] [CrossRef]
Yi, J.; Tao, J.; Fu, R.; Yan, X.; Wang, C.; Wang, T.; Zhang, C.Y.; Zhang, X.; Zhao, Y.; Ren, Y.; et al. ADD 2023: The Second Audio Deepfake Detection Challenge. arXiv 2023, arXiv:2305.13774. [Google Scholar] [CrossRef]
Yi, J.; Fu, R.; Tao, J.; Nie, S.; Ma, H.; Wang, C.; Wang, T.; Tian, Z.; Bai, Y.; Fan, C.; et al. ADD 2022: The First Audio Deep Synthesis Detection Challenge. arXiv 2022, arXiv:2202.08433. [Google Scholar] [CrossRef]
Wu, Z.; Kinnunen, T.; Evans, N.; Yamagishi, J.; Hanilçi, C.; Sahidullah, M.; Sizov, A. ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge. In Proceedings of the Interspeech 2015, Dresden, Germany, 6–10 September 2015; pp. 2037–2041. [Google Scholar] [CrossRef]
Kinnunen, T.; Sahidullah, M.; Delgado, H.; Todisco, M.; Evans, N.; Yamagishi, J.; Lee, K.A. The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 2–6. [Google Scholar] [CrossRef]
Todisco, M.; Wang, X.; Vestman, V.; Sahidullah, M.; Delgado, H.; Nautsch, A.; Yamagishi, J.; Evans, N.; Kinnunen, T.; Lee, K.A. ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection. arXiv 2019, arXiv:1904.05441. [Google Scholar] [CrossRef]
Lai, C.I.; Chen, N.; Villalba, J.; Dehak, N. ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks. arXiv 2019, arXiv:1904.01120. [Google Scholar] [CrossRef]
Lai, C.I.; Abad, A.; Richmond, K.; Yamagishi, J.; Dehak, N.; King, S. Attentive Filtering Networks for Audio Replay Attack Detection. arXiv 2018, arXiv:1810.13048. [Google Scholar] [CrossRef]
Kang, W.H.; Alam, J.; Fathan, A. CRIM’s System Description for the ASVSpoof2021 Challenge. In Proceedings of the 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, Online, 16 September 2021; pp. 100–106. [Google Scholar] [CrossRef]
Tak, H.; Jung, J.w.; Patino, J.; Kamble, M.; Todisco, M.; Evans, N. End-to-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection. arXiv 2021, arXiv:2107.12710. [Google Scholar] [CrossRef]
Tak, H.; Patino, J.; Todisco, M.; Nautsch, A.; Evans, N.; Larcher, A. End-to-End anti-spoofing with RawNet2. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6369–6373. [Google Scholar] [CrossRef]
Chen, X.; Zhang, Y.; Zhu, G.; Duan, Z. UR Channel-Robust Synthetic Speech Detection System for ASVspoof 2021. arXiv 2021, arXiv:2107.12018. [Google Scholar] [CrossRef]
Cáceres, J.; Font, R.; Grau, T.; Molina, J. The Biometric Vox System for the ASVspoof 2021 Challenge. In Proceedings of the 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, Online, 16 September 2021; pp. 68–74. [Google Scholar] [CrossRef]
Wang, X.; Qin, X.; Zhu, T.; Wang, C.; Zhang, S.; Li, M. The DKU-CMRI System for the ASVspoof 2021 Challenge: Vocoder based Replay Channel Response Estimation. In Proceedings of the 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, Online, 16 September 2021; pp. 16–21. [Google Scholar] [CrossRef]
Fu, Q.; Teng, Z.; White, J.; Powell, M.; Schmidt, D.C. FastAudio: A Learnable Audio Front-End for Spoof Speech Detection. arXiv 2021, arXiv:2109.02774. [Google Scholar] [CrossRef]
Wang, X.; Yamagishi, J.; Todisco, M.; Delgado, H.; Nautsch, A.; Evans, N.; Sahidullah, M.; Vestman, V.; Kinnunen, T.; Lee, K.A.; et al. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Comput. Speech Lang. 2020, 64, 101114. [Google Scholar] [CrossRef]
Masood, M.; Nawaz, M.; Malik, K.M.; Javed, A.; Irtaza, A.; Malik, H. Deepfakes generation and detection: State-of-the-art, open challenges, countermeasures, and way forward. Appl. Intell. 2023, 53, 3974–4026. [Google Scholar] [CrossRef]
Teyssou, D. Applying Design Thinking Methodology: The InVID Verification Plugin. In Video Verification in the Fake News Era; Mezaris, V., Nixon, L., Papadopoulos, S., Teyssou, D., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 263–279. [Google Scholar] [CrossRef]
Teyssou, D.; Leung, J.M.; Apostolidis, E.; Apostolidis, K.; Papadopoulos, S.; Zampoglou, M.; Papadopoulou, O.; Mezaris, V. The InVID Plug-in: Web Video Verification on the Browser. In Proceedings of the First International Workshop on Multimedia Verification, New York, NY, USA, 23–27 October 2017; pp. 23–30. [Google Scholar] [CrossRef]
Marinova, Z.; Spangenberg, J.; Teyssou, D.; Papadopoulos, S.; Sarris, N.; Alaphilippe, A.; Bontcheva, K. Weverify: Wider and Enhanced Verification for You Project Overview and Tools. In Proceedings of the 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), London, UK, 6–10 July 2020; pp. 1–4. [Google Scholar] [CrossRef]
Nucci, F.; Boi, S.; Magaldi, M. Artificial Intelligence Against Disinformation: The FANDANGO Practical Case. In Proceedings of the First International Forum on Digital and Democracy. Towards A Sustainable Evolution, Venice, Italy, 10–11 December 2020. [Google Scholar]
Tsalakanidou, F.; Papadopoulos, S.; Mezaris, V.; Kompatsiaris, I.; Gray, B.; Tsabouraki, D.; Kalogerini, M.; Negro, F.; Montagnuolo, M.; de Vos, J.; et al. The AI4Media Project: Use of Next-Generation Artificial Intelligence Technologies for Media Sector Applications. In Proceedings of the Artificial Intelligence Applications and Innovations, Crete, Greece, 25–27 June 2021; Maglogiannis, I., Macintyre, J., Iliadis, L., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 81–93. [Google Scholar] [CrossRef]
Pawlicka, A.; Pawlicki, M.; Kozik, R.; Andrychowicz-Trojanowska, A.; Choraś, M. AI vs linguistic-based human judgement: Bridging the gap in pursuit of truth for fake news detection. Inf. Sci. 2024, 679, 121097. [Google Scholar] [CrossRef]
Whittaker, L.; Mulcahy, R.; Letheren, K.; Kietzmann, J.; Russell-Bennett, R. Mapping the deepfake landscape for innovation: A multidisciplinary systematic review and future research agenda. Technovation 2024, 123, 102784. [Google Scholar] [CrossRef]
de Lima-Santos, M.F. ProPublica’s Data Journalism: How Multidisciplinary Teams and Hybrid Profiles Create Impactful Data Stories. Media Commun. 2022, 10, 5–15. [Google Scholar] [CrossRef]
Bisiani, S.; Abellan, A.; Arias Robles, F.; García-Avilés, J.A. The Data Journalism Workforce: Demographics, Skills, Work Practices, and Challenges in the Aftermath of the COVID-19 Pandemic. J. Pract. 2023, 1–21. [Google Scholar] [CrossRef]
Mtchedlidze, J. Technical Expertise in Newsrooms: Understanding Data Journalists’ Roles and Practices. Journal. Media 2024, 5, 1316–1328. [Google Scholar] [CrossRef]
Mathias-Felipe de Lima-Santos, W.N.Y.; Dodds, T. Guiding the way: A comprehensive examination of AI guidelines in global media. AI Soc. 2024. [Google Scholar] [CrossRef]

Figure 1. Stages followed in our methodology and participating roles.

Figure 2. Datasets used/generated by the RTVE-UGR Chair.

Figure 3. Pipeline of the model creation process.

Figure 4. Interface of the audio deepfake detection tool developed by the RTVE-UGR Chair.

Figure 5. Verification process pipeline.

Table 1. Roles within the multidisciplinary team of the RTVE-UGR Chair.

Roles	Experience	Tasks Performed
Computer science researchers (CSR)	Experts in speech technologies, AI, machine learning and signal processing	Research and development of deepfake detection models
Experts in speech synthesis and voice cloning (SSVC)	Specialists in speech synthesis and audio technologies	Creating synthetic voices to generate fake samples for training
Web developer (WD)	Experience in frontend development and UX/UI design	Design and maintenance of the user interface
Backend developer (BD)	Experience in databases, servers and APIs	Implementation of the backend logic and deployment of the models on the server
Systems engineer (SE)	Specialization in IT infrastructure and networks	Technical support and server maintenance
News verification journalists (NVJ)	Experience in verification journalism and fact checking	Study the incidence of voice fakes, identify persons who are frequent targets of fakes, test and feedback on the use of tools
Generalistic journalists (GJ)	Various specializations in journalism different from verification	Test and feedback on the use of tools
Journalism researchers (JR)	Specialization in journalistic investigation	Analysis on the adequacy and potential impact of the tools developed

Table 2. Experiments.

Experiment id	Training Set (lang.)	Test Set (lang.)	Type of Test
Exp. 1	ASVspoof2019 (en)	ASVspoof2021 (en)	Unknown speaker, general model
Exp. 2	ASVspoof2019 (en)	RTVE-UGR (es)	Unknown speaker, general model
Exp. 3	RTVE-UGR (es)	RTVE-UGR (es)	Known speaker, specific model
Exp. 4	RTVE-UGR (es)	RTVE-UGR (es)	Known speaker, general model

Table 3. Results of Experiment 2.

	Fake Audios	Real Audios
True Positives (TP)	1098	—
False Negatives (FN)	0	—
False Positives (FP)	—	520
True Negatives (TN)	—	0

Table 4. Results of Experiment 3 (King of Spain, Felipe VI).

	Fake Audios	Real Audios
True Positives (TP)	220	—
False Negatives (FN)	0	—
False Positives (FP)	—	3
True Negatives (TN)	—	101

Table 5. Results of Experiment 3 (Vice-president of Spain, Yolanda Díaz).

	Fake Audios	Real Audios
True Positives (TP)	97	—
False Negatives (FN)	0	—
False Positives (FP)	—	0
True Negatives (TN)	—	102

Table 6. Results of Experiment 4 (general model for all speakers of interest).

	Fake Audios	Real Audios
True Positives (TP)	354	—
False Negatives (FN)	0	—
False Positives (FP)	—	4
True Negatives (TN)	—	252

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Calderón-González, D.; Ábalos, N.; Bayo, B.; Cánovas, P.; Griol, D.; Muñoz-Romero, C.; Pérez, C.; Vila, P.; Callejas, Z. Deep Speech Synthesis and Its Implications for News Verification: Lessons Learned in the RTVE-UGR Chair. Appl. Sci. 2024, 14, 9916. https://doi.org/10.3390/app14219916

AMA Style

Calderón-González D, Ábalos N, Bayo B, Cánovas P, Griol D, Muñoz-Romero C, Pérez C, Vila P, Callejas Z. Deep Speech Synthesis and Its Implications for News Verification: Lessons Learned in the RTVE-UGR Chair. Applied Sciences. 2024; 14(21):9916. https://doi.org/10.3390/app14219916

Chicago/Turabian Style

Calderón-González, Daniel, Nieves Ábalos, Blanca Bayo, Pedro Cánovas, David Griol, Carlos Muñoz-Romero, Carmen Pérez, Pere Vila, and Zoraida Callejas. 2024. "Deep Speech Synthesis and Its Implications for News Verification: Lessons Learned in the RTVE-UGR Chair" Applied Sciences 14, no. 21: 9916. https://doi.org/10.3390/app14219916

APA Style

Calderón-González, D., Ábalos, N., Bayo, B., Cánovas, P., Griol, D., Muñoz-Romero, C., Pérez, C., Vila, P., & Callejas, Z. (2024). Deep Speech Synthesis and Its Implications for News Verification: Lessons Learned in the RTVE-UGR Chair. Applied Sciences, 14(21), 9916. https://doi.org/10.3390/app14219916

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Speech Synthesis and Its Implications for News Verification: Lessons Learned in the RTVE-UGR Chair

Abstract

1. Introduction

2. State of the Art

2.1. Verification of Deep Audio Fakes: A Challenge for Journalism

2.2. Detection of Audio Fakes: A Technological Challenge

2.3. Deep Audio in News Verification: Multidisciplinary Initiatives

3. Our Approach to Multidisciplinary Collaboration Toward Deep Audio Detection

3.1. Research and Identification of Voices

3.2. Datasets and Models

3.3. Web Audio Verification Tool

4. Results Achieved Following Our Pipeline

4.1. Datasets and Models Used

4.2. Description of the Resulting Detection Models

Implemented Models

4.3. Description of the Website Created as a Frontend for the Models

5. Evaluation of the Deepfake Detection Models

5.1. Experiment 1: Replicating FastAudio Experimental Results

5.2. Experiment 2: Testing an Antispoof Model with Our Dataset

5.3. Experiment 3: Evaluation of the Specific Models Developed for Individual Speakers of Interest

5.4. Experiment 4: Evaluation of a General Model Developed with Our Dataset

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI