Audio Deepfake Detection: What Has Been Achieved and What Lies Ahead
Abstract
:1. Introduction
2. Overview of Audio Deepfake Methodologies
2.1. Speech Synthesis
2.2. Voice Conversion
3. Dataset
3.1. ASVspoof Challenge
3.2. ADD Challenge
3.3. Additional Datasets
Name | Language | Year | Fake Types | Key Improvement |
---|---|---|---|---|
FoR [84] | English | 2019 | TTS | Expands data volume, covers latest TTS technology |
WaveFake [85] | English, Japanese | 2021 | TTS | Encompasses state-of-the-art generative models |
HAD [86] | Chinese | 2021 | Partially fake | Focuses on detecting partially fake audio |
ITW [64] | English | 2022 | TTS | Provides audio recorded in the wild |
LibriSeVoc [87] | English | 2023 | Self-vocoding | Includes neural vocoder artefacts in audio samples |
SceneFake [88] | English | 2024 | Scene manipulation | Detects scene forgery in audio |
EmoFake [89] | English, Chinese | 2024 | EVC | Focuses on emotional change in speech |
CVoiceFake [90] | English, Chinese, German, French, Italian | 2024 | TTS | Multilingual, provides ground-truth transcriptions |
MLAAD [91] | 38 languages | 2024 | TTS | Multilingual, offers support for global applicability |
4. Evaluation Metrics
4.1. Equal Error Rate (EER)
4.2. Tandem Decision Cost Function (t-DCF)
5. Audio Deepfake Detection
5.1. Frontend Features
5.1.1. Handcrafted Features
5.1.2. Learning-Based Features
5.2. Backend Model
5.2.1. Machine Learning
5.2.2. Deep Learning
5.3. Discussion
- Data augmentation (DA): Data augmentation expands the dataset by manually altering the original data, thereby diversifying the training set and reducing overfitting to enhance model generalisation. Common audio DA techniques include frequency masking, time-shifting, reverberation, and newly proposed methods such as SpecAug and RawBoost. As observed in the table, many of the high-performing systems in recent years, particularly on the ASVspoof 2021 dataset, have incorporated DA to achieve superior results. Notably, the MelSpec+ABC-CapsNet system proposed by Wani et al. [105] achieved SOTA performance on the ASVspoof 2019 LA dataset without employing any DA. This highlights the possibility that the choice of input features and model architecture may outweigh the impact of augmentation techniques in certain scenarios.
- Frontend and backend selection: The frontend is responsible for feature extraction, forming the foundation of system performance. SSL frontends, such as WavLM and W2V2, have dominated the recent audio deepfake detection landscape. Their ability to capture rich contextual information across both temporal and spectral domains has significantly surpassed traditional handcrafted features like LFCC and CQCC. While handcrafted features remain relevant in resource-constrained scenarios due to their simplicity and efficiency, the prevailing trend clearly favours SSL features for their superior adaptability to unseen attacks. The backend classifier determines how effectively the extracted features are utilised for classification. Traditional machine learning models, such as the GMM and SVM, are no longer the focus of contemporary research due to their limited ability to handle complex spoofing attacks. In contrast, advanced deep learning models, such as CapsNet and GNN, as well as hybrid architectures, have shown remarkable capabilities in capturing intricate temporal and spectral dependencies. Studies by Wani et al. [105] and Ulutas et al. [142] have further demonstrated that combining SOTA backend models with traditional handcrafted features can outperform systems utilising learning-based features alone, underscoring the importance of a well-chosen backend in achieving optimal performance.
- Generalisation: Table 7 presents the generalisation ability of various models on the ITW dataset. It emphasises the pivotal role of DA in not only improving model performance but also enhancing cross-domain robustness. For instance, models employing DA, such as XLS-R and W2V2+MoE Fusion, achieved significantly lower EERs on out-of-domain datasets compared to those without DA. This underscores that well-designed DA strategies can effectively enhance a model’s resilience against unseen attacks. Furthermore, SSL feature extraction methods, with their ability to comprehensively capture temporal and spectral information, exhibited superior adaptability and robustness compared to traditional handcrafted features. This proves the advantage of SSL features in addressing complex deepfake attacks and stresses their critical role in improving model robustness to unseen scenarios.
Year | DA | Frontend | Backend | EER (%) | t-DCF |
---|---|---|---|---|---|
2024 [105] | MelSpec+VGG18 | ABC-CapsNet | 0.06 | ||
2023 [142] | CQT | ViT | 0.19 | 0.1102 | |
2024 [131] | ✓ | SLIM Framework | 0.2 | ||
2022 [153] | W2V2 | VIB | 0.4 | 0.0107 | |
2024 [129] | WavLM | MFA | 0.42 | 0.0126 | |
2023 [149] | ✓ | LFB | GCN | 0.71 | 0.0192 |
2024 [154] | ✓ | W2V2 | MoE Fusion | 0.74 | |
2024 [127] | W2V2 | SVM | 0.9 | ||
2022 [155] | W2V2+Light-DARTS | 1.08 | |||
2024 [127] | W2V2 | MLP | 1.11 | ||
2022 [151] | AASIST | 1.13 | 0.0347 | ||
2024 [156] | RawBMamba | 1.19 | 0.036 | ||
2024 [127] | W2V2 | LR | 1.25 | ||
2021 [150] | ✓ | RawGAT-ST | 1.39 | 0.0443 | |
2024 [156] | RawMamba | 1.47 | 0.0467 | ||
2021 [108] | Spec | Attn-CNN-OC-Softmax | 1.87 | 0.051 | |
2024 [115] | LFCC+MPE | SENet | 1.94 | ||
2021 [114] | LFCC | CapsNet | 1.97 | 0.0538 | |
2021 [113] | LFCC | ResNet18-OC-Softmax | 2.19 | 0.059 | |
2024 [115] | LFCC+MPE | LCNN | 2.41 | ||
2022 [122] | RawNet2 | STATNet | 2.45 | 0.062 | |
2024 [130] | WavLM+MTB | MLP | 2.47 | ||
2021 [114] | LPS | CapsNet | 3.19 | 0.0982 | |
2021 [110] | ✓ | LPS | TE-ResNet | 6.02 | |
2021 [110] | ✓ | MFCC | TE-ResNet | 6.54 | |
2021 [110] | ✓ | CQCC | TE-ResNet | 7.14 | |
2024 [130] | W2V2+MTB | MLP | 16.4 |
Year | DA | Frontend | Backend | LA (EER %) | LA (t-DCF) | DF (EER %) |
---|---|---|---|---|---|---|
2024 [117] | ✓ | XLS-R | SLS | 3.88 | 2.09 | |
2024 [154] | ✓ | W2V2 | MoE Fusion | 2.96 | 2.54 | |
2024 [129] | WavLM | MFA | 5.08 | 2.56 | ||
2022 [152] | ✓ | W2V2 | SA | 1 | 0.2066 | 3.69 |
2024 [131] | ✓ | SLIM Framework | 4.4 | |||
2022 [126] | ✓ | W2V2 | MLP+ASP | 3.54 | 4.98 | |
2022 [155] | W2V2+Light-DARTS | 7.86 | ||||
2024 [130] | WavLM+MTB | MLP | 9.87 | |||
2024 [156] | RawBMamba | 3.28 | 0.2709 | 15.85 | ||
2023 [119] | ✓ | MelSpec+SincNet | Transformer-Bi-Level | 20.24 | ||
2024 [156] | RawMamba | 2.84 | 0.2517 | 22.48 | ||
2024 [130] | W2V2+MTB | MLP | 27.1 |
Year | DA | Frontend | Backend | Training Data | EER (%) |
---|---|---|---|---|---|
2024 [117] | ✓ | XLS-R | SLS | ASVspoof 2019 LA (T) | 8.87 |
2024 [154] | ✓ | W2V2 | MoE Fusion | ASVspoof 2019 LA (T) | 9.17 |
2024 [131] | ✓ | SLIM Framework | S1: Common Voice, RAVDESS S2: ASVspoof 2019 LA (T) | 12.5 | |
2024 [116] | Fusion | ResNet18 | ASVspoof 2019 LA (T, D) | 24.27 | |
2023 [107] | ✓ | C-CQT | CVNN | ASVspoof 2019 LA (all splits) | 26.95 ± 3.12 |
2024 [116] | Hubert | ResNet18 | ASVspoof 2019 LA (T, D) | 27.48 | |
2024 [115] | MPE | SENet | ASVspoof 2019 LA (T) | 29.62 |
6. Emerging Research Directions
6.1. Privacy
6.2. Fairness
6.3. Adaptability
6.4. Explainability
6.5. Anti-Spoofing
6.6. Robustness Against Compression
6.7. Real-World Implications
7. Future Research
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Somers, M. Deepfakes, Explained. 2020. Available online: https://mitsloan.mit.edu/ideas-made-to-matter/deepfakes-explained (accessed on 5 February 2025).
- Kietzmann, J.; Lee, L.W.; McCarthy, I.P.; Kietzmann, T.C. Deepfakes: Trick or treat? Bus. Horizons 2020, 63, 135–146. [Google Scholar] [CrossRef]
- Regan, G. A Brief History of Deepfakes. 2024. Available online: https://www.realitydefender.com/blog/history-of-deepfakes (accessed on 5 February 2025).
- Chadha, A.; Kumar, V.; Kashyap, S.; Gupta, M. Deepfake: An overview. In Proceedings of the Second International Conference on Computing, Communications, and Cyber-Security: IC4S 2020, Ghaziabad, India, 3–4 October 2020; Springer: Singapore, 2021; pp. 557–566. [Google Scholar]
- Whitty, M.T. Drug mule for love. J. Financ. Crime 2023, 30, 795–812. [Google Scholar] [CrossRef]
- Frick, N.R.; Wilms, K.L.; Brachten, F.; Hetjens, T.; Stieglitz, S.; Ross, B. The perceived surveillance of conversations through smart devices. Electron. Commer. Res. Appl. 2021, 47, 101046. [Google Scholar] [CrossRef]
- Liu, Y.; Zhang, H.; Li, Y.; Wang, Q.; Liu, C. Multimodal Identity Recognition of Spiking Neural Network Based on Brain-Inspired Reward Propagation. In Proceedings of the 2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), Kaifeng, China, 30 October–2 November 2024; IEEE ComSoc: New York, NY, USA, 2024; pp. 1657–1662. [Google Scholar]
- Gao, S.; Yan, D.; Yan, Y. MULiving: Towards Real-time Multi-User Survival State Monitoring Using Wearable RFID Tags. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; IEEE SPS: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar]
- Kim, J.; Lee, H.; Jeong, S.; Ahn, S.H. Sound-based remote real-time multi-device operational monitoring system using a convolutional neural network (CNN). J. Manuf. Syst. 2021, 58, 431–441. [Google Scholar] [CrossRef]
- Karpagam, M.; Sahana, S.; Sivadharini, S.; Soundhariyasri, S. Night Safety Patrolling Robot. In Proceedings of the 2024 4th Asian Conference on Innovation in Technology (ASIANCON), Pune, India, 23–25 August 2024; pp. 1–4. [Google Scholar]
- Lu, L.; Yu, J.; Chen, Y.; Liu, H.; Zhu, Y.; Kong, L.; Li, M. Lip reading-based user authentication through acoustic sensing on smartphones. IEEE/ACM Trans. Netw. 2019, 27, 447–460. [Google Scholar] [CrossRef]
- Bi, H.; Sun, Y.; Liu, J.; Cao, L. SmartEar: Rhythm-based tap authentication using earphone in information-centric wireless sensor network. IEEE Internet Things J. 2021, 9, 885–896. [Google Scholar] [CrossRef]
- Itani, S.; Kita, S.; Kajikawa, Y. Multimodal personal ear authentication using acoustic ear feature for smartphone security. IEEE Trans. Consum. Electron. 2021, 68, 77–84. [Google Scholar] [CrossRef]
- Wang, Z.; Wang, Y.; Yang, J. Earslide: A secure ear wearables biometric authentication based on acoustic fingerprint. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2024, 8, 1–29. [Google Scholar] [CrossRef]
- Ayeswarya, S.; Singh, K.J. A comprehensive review on secure biometric-based continuous authentication and user profiling. IEEE Access 2024, 12, 82996–83021. [Google Scholar] [CrossRef]
- Amerini, I.; Barni, M.; Battiato, S.; Bestagini, P.; Boato, G.; Bruni, V.; Caldelli, R.; De Natale, F.; De Nicola, R.; Guarnera, L.; et al. Deepfake Media Forensics: Status and Future Challenges. J. Imaging 2025, 11, 73. [Google Scholar] [CrossRef]
- Fraunhofer AISEC. Audio Deep Fake: Demonstrator Entwickelt am Fraunhofer AISEC. 2021. Available online: https://www.youtube.com/watch?v=MZTF0eAALmE (accessed on 19 December 2024).
- The Telegraph. Deepfake Video of Volodymyr Zelensky Surrendering Surfaces on Social Media. 2022. Available online: https://www.youtube.com/watch?v=X17yrEV5sl4 (accessed on 19 December 2024).
- Devine, C.; O’Sullivan, D.; Lyngaas, S. A Fake Recording of a Candidate Saying He’d Rigged the Election Went Viral. Experts Say it’s Only the Beginning. 2024. Available online: https://edition.cnn.com/2024/02/01/politics/election-deepfake-threats-invs/index.html (accessed on 19 December 2024).
- BBC SOUNDS. The Mexican Mayor and a Deepfake Scandal. 2024. Available online: https://www.bbc.co.uk/sounds/play/w3ct5d9g (accessed on 19 December 2024).
- Matza, M. Fake Biden Robocall Tells Voters to skip New Hampshire Primary Election. 2024. Available online: https://www.bbc.com/news/world-us-canada-68064247 (accessed on 19 December 2024).
- Stupp, C. Fraudsters Used AI to Mimic CEO’s Voice in Unusual Cybercrime Case. 2019. Available online: https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402 (accessed on 19 December 2024).
- Bilika, D.; Michopoulou, N.; Alepis, E.; Patsakis, C. Hello me, meet the real me: Audio deepfake attacks on voice assistants. arXiv 2023, arXiv:2302.10328. [Google Scholar]
- Ubert, J. Fake it: Attacking Privacy Through Exploiting Digital Assistants Using Voice Deepfakes. Ph.D. Thesis, Marymount University, Arlington, VA, USA, 2023. [Google Scholar]
- Nagothu, D.; Poredi, N.; Chen, Y. Evolution of Attacks on Intelligent Surveillance Systems and Effective Detection Techniques. In Intelligent Video Surveillance-New Perspectives; IntechOpen: London, UK, 2022. [Google Scholar]
- Mai, K.T.; Bray, S.; Davies, T.; Griffin, L.D. Warning: Humans cannot reliably detect speech deepfakes. PLoS ONE 2023, 18, e0285333. [Google Scholar] [CrossRef]
- Warren, K.; Tucker, T.; Crowder, A.; Olszewski, D.; Lu, A.; Fedele, C.; Pasternak, M.; Layton, S.; Butler, K.; Gates, C.; et al. “Better Be Computer or I’m Dumb”: A Large-Scale Evaluation of Humans as Audio Deepfake Detectors. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security (CCS), Salt Lake City, UT, USA, 14–18 October 2024; ACM: New York, NY, USA, 2024; pp. 2696–2710. [Google Scholar]
- Westerlund, M. The emergence of deepfake technology: A review. Technol. Innov. Manag. Rev. 2019, 9, 39–52. [Google Scholar]
- Zhao, H.; Zhou, W.; Chen, D.; Wei, T.; Zhang, W.; Yu, N. Multi-attentional deepfake detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; IEEE ComSoc: New York, NY, USA, 2021; pp. 2185–2194. [Google Scholar]
- Rana, M.S.; Nobi, M.N.; Murali, B.; Sung, A.H. Deepfake detection: A systematic literature review. IEEE Access 2022, 10, 25494–25513. [Google Scholar]
- Lin, Y.; Xie, Z.; Chen, T.; Cheng, X.; Wen, H. Image privacy protection scheme based on high-quality reconstruction DCT compression and nonlinear dynamics. Expert Syst. Appl. 2024, 257, 124891. [Google Scholar] [CrossRef]
- Khanjani, Z.; Watson, G.; Janeja, V.P. Audio deepfakes: A survey. Front. Big Data 2023, 5, 1001063. [Google Scholar]
- Almutairi, Z.; Elgibreen, H. A review of modern audio deepfake detection methods: Challenges and future directions. Algorithms 2022, 15, 155. [Google Scholar] [CrossRef]
- Patel, Y.; Tanwar, S.; Gupta, R.; Bhattacharya, P.; Davidson, I.E.; Nyameko, R.; Aluvala, S.; Vimal, V. Deepfake generation and detection: Case study and challenges. IEEE Access 2023, 11, 143296–143323. [Google Scholar]
- Yi, J.; Wang, C.; Tao, J.; Zhang, X.; Zhang, C.Y.; Zhao, Y. Audio deepfake detection: A survey. arXiv 2023, arXiv:2308.14970. [Google Scholar]
- Ren, Y.; Liu, C.; Liu, W.; Wang, L. A survey on speech forgery and detection. J. Signal Process. 2021, 37, 2412–2439. [Google Scholar]
- Mubarak, R.; Alsboui, T.; Alshaikh, O.; Inuwa-Dute, I.; Khan, S.; Parkinson, S. A survey on the detection and impacts of deepfakes in visual, audio, and textual formats. IEEE Access 2023, 11, 144497–144529. [Google Scholar] [CrossRef]
- Masood, M.; Nawaz, M.; Malik, K.M.; Javed, A.; Irtaza, A.; Malik, H. Deepfakes generation and detection: State-of-the-art, open challenges, countermeasures, and way forward. Appl. Intell. 2023, 53, 3974–4026. [Google Scholar] [CrossRef]
- Tabet, Y.; Boughazi, M. Speech synthesis techniques. A survey. In Proceedings of the 7th International Workshop on Systems, Signal Processing and Their Applications (WOSSPA 2011), Tipaza, Algeria, 9–11 May 2011; pp. 67–70. [Google Scholar]
- Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
- Ping, W.; Peng, K.; Gibiansky, A.; Arik, S.O.; Kannan, A.; Narang, S.; Raiman, J.; Miller, J. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv 2017, arXiv:1710.07654. [Google Scholar]
- Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R.; et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE SPS: Piscataway, NJ, USA, 2018; pp. 4779–4783. [Google Scholar]
- Ren, Y.; Hu, C.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.Y. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv 2020, arXiv:2006.04558. [Google Scholar]
- Kong, Z.; Ping, W.; Huang, J.; Zhao, K.; Catanzaro, B. Diffwave: A versatile diffusion model for audio synthesis. arXiv 2020, arXiv:2009.09761. [Google Scholar]
- Huang, R.; Lam, M.W.; Wang, J.; Su, D.; Yu, D.; Ren, Y.; Zhao, Z. Fastdiff: A fast conditional diffusion model for high-quality speech synthesis. arXiv 2022, arXiv:2204.09934. [Google Scholar]
- Lee, S.G.; Ping, W.; Ginsburg, B.; Catanzaro, B.; Yoon, S. Bigvgan: A universal neural vocoder with large-scale training. arXiv 2022, arXiv:2206.04658. [Google Scholar]
- Liao, S.; Lan, S.; Zachariah, A.G. EVA-GAN: Enhanced Various Audio Generation via Scalable Generative Adversarial Networks. arXiv 2024, arXiv:2402.00892. [Google Scholar]
- Lee, S.H.; Choi, H.Y.; Lee, S.W. Periodwave: Multi-period flow matching for high-fidelity waveform generation. arXiv 2024, arXiv:2408.07547. [Google Scholar]
- Liu, P.; Dai, D.; Wu, Z. RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction. arXiv 2024, arXiv:2403.05010. [Google Scholar]
- Shibuya, T.; Takida, Y.; Mitsufuji, Y. Bigvsan: Enhancing gan-based neural vocoders with slicing adversarial network. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE SPS: Piscataway, NJ, USA, 2024; pp. 10121–10125. [Google Scholar]
- Lee, S.H.; Choi, H.Y.; Lee, S.W. Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization. arXiv 2024, arXiv:2408.08019. [Google Scholar]
- Godulla, A.; Hoffmann, C.P.; Seibert, D. Dealing with deepfakes–an interdisciplinary examination of the state of research and implications for communication studies. SCM Stud. Commun. Media 2021, 10, 72–96. [Google Scholar] [CrossRef]
- Sisman, B.; Yamagishi, J.; King, S.; Li, H. An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 29, 132–157. [Google Scholar] [CrossRef]
- Veaux, C.; Yamagishi, J.; King, S. Towards personalised synthesised voices for individuals with vocal disabilities: Voice banking and reconstruction. In Proceedings of the Fourth Workshop on Speech and Language Processing for Assistive Technologies (SLPAT), Grenoble, France, 21–22 August 2013; ACL: Stroundsburg, PA, USA, 2013; pp. 107–111. [Google Scholar]
- Srivastava, B.M.L.; Vauquier, N.; Sahidullah, M.; Bellet, A.; Tommasi, M.; Vincent, E. Evaluating voice conversion-based privacy protection against informed attackers. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE SPS: Piscataway, NJ, USA, 2020; pp. 2802–2806. [Google Scholar]
- Mohammadi, S.H.; Kain, A. An overview of voice conversion systems. Speech Commun. 2017, 88, 65–82. [Google Scholar]
- Sisman, B.; Zhang, M.; Sakti, S.; Li, H.; Nakamura, S. Adaptive wavenet vocoder for residual compensation in gan-based voice conversion. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; IEEE SPS: Piscataway, NJ, USA, 2018; pp. 282–289. [Google Scholar]
- Sisman, B.; Zhang, M.; Li, H. Group sparse representation with wavenet vocoder adaptation for spectrum and prosody conversion. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1085–1097. [Google Scholar]
- Kaneko, T.; Kameoka, H. Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks. In Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, 3–7 September 2018; IEEE SPS: Piscataway, NJ, USA, 2018; pp. 2100–2104. [Google Scholar]
- Qian, K.; Zhang, Y.; Chang, S.; Yang, X.; Hasegawa-Johnson, M. Autovc: Zero-shot voice style transfer with only autoencoder loss. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; PMLR: New York, NY, USA, 2019; pp. 5210–5219. [Google Scholar]
- Wu, D.Y.; Chen, Y.H.; Lee, H.Y. Vqvc+: One-shot voice conversion by vector quantization and u-net architecture. arXiv 2020, arXiv:2006.04154. [Google Scholar]
- Li, J.; Tu, W.; Xiao, L. Freevc: Towards high-quality text-free one-shot voice conversion. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE SPS: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
- Huang, J.; Zhang, C.; Ren, Y.; Jiang, Z.; Ye, Z.; Liu, J.; He, J.; Yin, X.; Zhao, Z. MulliVC: Multi-lingual Voice Conversion with Cycle Consistency. arXiv 2024, arXiv:2408.04708. [Google Scholar]
- Müller, N.M.; Czempin, P.; Dieckmann, F.; Froghyar, A.; Böttinger, K. Does audio deepfake detection generalize? arXiv 2022, arXiv:2203.16263. [Google Scholar]
- Wu, Z.; Evans, N.; Kinnunen, T.; Yamagishi, J.; Alegre, F.; Li, H. Spoofing and countermeasures for speaker verification: A survey. Speech Commun. 2015, 66, 130–153. [Google Scholar]
- Wu, Z.; Kinnunen, T.; Evans, N.; Yamagishi, J. ASVspoof 2015: Automatic speaker verification spoofing and countermeasures challenge evaluation plan. Training 2014, 10, 3750. [Google Scholar]
- Kinnunen, T.; Evans, N.; Yamagishi, J.; Lee, K.A.; Sahidullah, M.; Todisco, M.; Delgado, H. Asvspoof 2017: Automatic speaker verification spoofing and countermeasures challenge evaluation plan. Training 2017, 10, 1508. [Google Scholar]
- Delgado, H.; Todisco, M.; Sahidullah, M.; Evans, N.; Kinnunen, T.; Lee, K.A.; Yamagishi, J. ASVspoof 2017 Version 2.0: Meta-data analysis and baseline enhancements. In Proceedings of the The Speaker and Language Recognition Workshop, Les Sables d’Olonne, France, 26–29 June 2018; pp. 296–303. [Google Scholar]
- Lee, K.A.; Larcher, A.; Wang, G.; Kenny, P.; Brümmer, N.; Van Leeuwen, D.; Aronowitz, H.; Kockmann, M.; Vaquero, C.; Ma, B.; et al. The RedDots data collection for speaker recognition. In Proceedings of the Interspeech 2015, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Kinnunen, T.; Sahidullah, M.; Falcone, M.; Costantini, L.; Hautamäki, R.G.; Thomsen, D.; Sarkar, A.; Tan, Z.H.; Delgado, H.; Todisco, M.; et al. Reddots replayed: A new replay spoofing attack corpus for text-dependent speaker verification research. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; IEEE SPS: Piscataway, NJ, USA, 2017; pp. 5395–5399. [Google Scholar]
- Todisco, M.; Wang, X.; Vestman, V.; Sahidullah, M.; Delgado, H.; Nautsch, A.; Yamagishi, J.; Evans, N.; Kinnunen, T.; Lee, K.A. ASVspoof 2019: Future horizons in spoofed and fake audio detection. arXiv 2019, arXiv:1904.05441. [Google Scholar]
- Wang, X.; Yamagishi, J.; Todisco, M.; Delgado, H.; Nautsch, A.; Evans, N.; Sahidullah, M.; Vestman, V.; Kinnunen, T.; Lee, K.A.; et al. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Comput. Speech Lang. 2020, 64, 101114. [Google Scholar]
- Veaux, C.; Yamagishi, J.; MacDonald, K. CSTR VCTK Corpus: English Multi-Speaker Corpus for CSTR Voice Cloning Toolkit; The Centre for Speech Technology Research (CSTR)-University of Edinburgh: Edinburgh, UK, 2017; Volume 6, p. 15. [Google Scholar]
- Yamagishi, J.; Wang, X.; Todisco, M.; Sahidullah, M.; Patino, J.; Nautsch, A.; Liu, X.; Lee, K.A.; Kinnunen, T.; Evans, N.; et al. ASVspoof 2021: Accelerating progress in spoofed and deepfake speech detection. arXiv 2021, arXiv:2109.00537. [Google Scholar]
- Liu, X.; Wang, X.; Sahidullah, M.; Patino, J.; Delgado, H.; Kinnunen, T.; Todisco, M.; Yamagishi, J.; Evans, N.; Nautsch, A.; et al. Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 2507–2522. [Google Scholar]
- Wang, X.; Delgado, H.; Tak, H.; Jung, J.W.; Shim, H.J.; Todisco, M.; Kukanov, I.; Liu, X.; Sahidullah, M.; Kinnunen, T.; et al. ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale. arXiv 2024, arXiv:2408.08739. [Google Scholar]
- Pratap, V.; Xu, Q.; Sriram, A.; Synnaeve, G.; Collobert, R. Mls: A large-scale multilingual dataset for speech research. arXiv 2020, arXiv:2012.03411. [Google Scholar]
- Yi, J.; Fu, R.; Tao, J.; Nie, S.; Ma, H.; Wang, C.; Wang, T.; Tian, Z.; Bai, Y.; Fan, C.; et al. Add 2022: The first audio deep synthesis detection challenge. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE SPS: Piscataway, NJ, USA, 2022; pp. 9216–9220. [Google Scholar]
- Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proceedings of the 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Seoul, Republic of Korea, 1–3 November 2017; pp. 1–5. [Google Scholar]
- Shi, Y.; Bu, H.; Xu, X.; Zhang, S.; Li, M. Aishell-3: A multi-speaker mandarin tts corpus and the baselines. arXiv 2020, arXiv:2010.11567. [Google Scholar]
- Fu, Y.; Cheng, L.; Lv, S.; Jv, Y.; Kong, Y.; Chen, Z.; Hu, Y.; Xie, L.; Wu, J.; Bu, H.; et al. Aishell-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario. arXiv 2021, arXiv:2104.03603. [Google Scholar]
- Yi, J.; Tao, J.; Fu, R.; Yan, X.; Wang, C.; Wang, T.; Zhang, C.Y.; Zhang, X.; Zhao, Y.; Ren, Y.; et al. Add 2023: The second audio deepfake detection challenge. arXiv 2023, arXiv:2305.13774. [Google Scholar]
- Wang, D.; Zhang, X. Thchs-30: A free chinese speech corpus. arXiv 2015, arXiv:1512.01882. [Google Scholar]
- Reimao, R.; Tzerpos, V. For: A dataset for synthetic speech detection. In Proceedings of the 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Timisoara, Romania, 10–12 October 2019; pp. 1–10. [Google Scholar]
- Frank, J.; Schönherr, L. Wavefake: A data set to facilitate audio deepfake detection. arXiv 2021, arXiv:2111.02813. [Google Scholar]
- Yi, J.; Bai, Y.; Tao, J.; Ma, H.; Tian, Z.; Wang, C.; Wang, T.; Fu, R. Half-truth: A partially fake audio detection dataset. arXiv 2021, arXiv:2104.03617. [Google Scholar]
- Sun, C.; Jia, S.; Hou, S.; Lyu, S. Ai-synthesized voice detection using neural vocoder artifacts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 904–912. [Google Scholar]
- Yi, J.; Wang, C.; Tao, J.; Zhang, C.Y.; Fan, C.; Tian, Z.; Ma, H.; Fu, R. Scenefake: An initial dataset and benchmarks for scene fake audio detection. Pattern Recognit. 2024, 152, 110468. [Google Scholar] [CrossRef]
- Zhao, Y.; Yi, J.; Tao, J.; Wang, C.; Dong, Y. EmoFake: An initial dataset for emotion fake audio detection. In Proceedings of the 23rd China National Conference on Chinese Computational Linguistics, Taiyuan, China, 25–28 July 2024; Springer: Singapore, 2024; pp. 419–433. [Google Scholar]
- Li, X.; Li, K.; Zheng, Y.; Yan, C.; Ji, X.; Xu, W. Safeear: Content privacy-preserving audio deepfake detection. arXiv 2024, arXiv:2409.09272. [Google Scholar]
- Müller, N.M.; Kawa, P.; Choong, W.H.; Casanova, E.; Gölge, E.; Müller, T.; Syga, P.; Sperl, P.; Böttinger, K. Mlaad: The multi-language audio anti-spoofing dataset. arXiv 2024, arXiv:2401.09512. [Google Scholar]
- Teh, P.S.; Zhang, N.; Teoh, A.B.J.; Chen, K. A survey on touch dynamics authentication in mobile devices. Comput. Secur. 2016, 59, 210–235. [Google Scholar] [CrossRef]
- Kinnunen, T.; Lee, K.A.; Delgado, H.; Evans, N.; Todisco, M.; Sahidullah, M.; Yamagishi, J.; Reynolds, D.A. t-DCF: A detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification. arXiv 2018, arXiv:1804.09618. [Google Scholar]
- Kinnunen, T.H.; Lee, K.A.; Tak, H.; Evans, N.; Nautsch, A. t-EER: Parameter-free tandem evaluation of countermeasures and biometric comparators. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 2622–2637. [Google Scholar] [CrossRef]
- Shim, H.j.; Jung, J.w.; Kinnunen, T.; Evans, N.; Bonastre, J.F.; Lapidot, I. a-DCF: An architecture agnostic metric with application to spoofing-robust speaker verification. arXiv 2024, arXiv:2403.01355. [Google Scholar]
- Sahidullah, M.; Kinnunen, T.; Hanilçi, C. A comparison of features for synthetic speech detection. In Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015), Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Das, R.K.; Yang, J.; Li, H. Long Range Acoustic Features for Spoofed Speech Detection. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1058–1062. [Google Scholar]
- Wang, X.; Yamagishi, J. Investigating self-supervised front ends for speech spoofing countermeasures. arXiv 2021, arXiv:2111.07725. [Google Scholar]
- Li, M.; Ahmadiadli, Y.; Zhang, X.P. A comparative study on physical and perceptual features for deepfake audio detection. In Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia, Lisboa, Portugal, 14 October 2022; ACM: New York, NY, USA, 2022; pp. 35–41. [Google Scholar]
- Ambaye, G.A. Time and Frequency Domain Analysis of Signals: A Review. Int. J. Eng. Res. Technol. 2020, 9, 271–276. [Google Scholar]
- Zhang, Y.; Wang, W.; Zhang, P. The effect of silence and dual-band fusion in anti-spoofing system. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 4279–4283. [Google Scholar]
- Chen, T.; Kumar, A.; Nagarsheth, P.; Sivaraman, G.; Khoury, E. Generalization of Audio Deepfake Detection. In Proceedings of the Speaker and Language Recognition Workshop (Odyssey 2020), Tokyo, Japan, 1–5 November 2020; pp. 132–137. [Google Scholar]
- Tak, H.; Jung, J.w.; Patino, J.; Todisco, M.; Evans, N. Graph attention networks for anti-spoofing. arXiv 2021, arXiv:2104.03654. [Google Scholar]
- Fathan, A.; Alam, J.; Kang, W.H. Mel-spectrogram image-based end-to-end audio deepfake detection under channel-mismatched conditions. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; IEEE ComSoc: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
- Wani, T.M.; Gulzar, R.; Amerini, I. ABC-CapsNet: Attention based Cascaded Capsule Network for Audio Deepfake Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024, Seattle, WA, USA, 17–21 June 2024; IEEE ComSoc: New York, NY, USA, 2024; pp. 2464–2472. [Google Scholar]
- Li, X.; Li, N.; Weng, C.; Liu, X.; Su, D.; Yu, D.; Meng, H. Replay and synthetic speech detection with res2net architecture. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE SPS: Piscataway, NJ, USA, 2021; pp. 6354–6358. [Google Scholar]
- Müller, N.M.; Sperl, P.; Böttinger, K. Complex-valued neural networks for voice anti-spoofing. arXiv 2023, arXiv:2308.11800. [Google Scholar]
- Ling, H.; Huang, L.; Huang, J.; Zhang, B.; Li, P. Attention-Based Convolutional Neural Network for ASV Spoofing Detection. In Proceedings of the Interspeech 2021, Brno, Czechia, 30 August–3 September 2021; pp. 4289–4293. [Google Scholar]
- Randall, R.B. A history of cepstrum analysis and its application to mechanical problems. Mech. Syst. Signal Process. 2017, 97, 3–19. [Google Scholar]
- Zhang, Z.; Yi, X.; Zhao, X. Fake speech detection using residual network with transformer encoder. In Proceedings of the 2021 ACM Workshop on Information Hiding and Multimedia Security, Online, 22–25 June 2021; ACM: New York, NY, USA, 2021; pp. 13–22. [Google Scholar]
- Hamza, A.; Javed, A.R.R.; Iqbal, F.; Kryvinska, N.; Almadhor, A.S.; Jalil, Z.; Borghol, R. Deepfake audio detection via MFCC features using machine learning. IEEE Access 2022, 10, 134018–134028. [Google Scholar]
- Firc, A.; Malinka, K.; Hanáček, P. Deepfake Speech Detection: A Spectrogram Analysis. In Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing, Avila, Spain, 8–12 April 2024; ACM: New York, NY, USA, 2024; pp. 1312–1320. [Google Scholar]
- Zhang, Y.; Jiang, F.; Duan, Z. One-class learning towards synthetic voice spoofing detection. IEEE Signal Process. Lett. 2021, 28, 937–941. [Google Scholar]
- Luo, A.; Li, E.; Liu, Y.; Kang, X.; Wang, Z.J. A capsule network based approach for detection of audio spoofing attacks. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE SPS: Piscataway, NJ, USA, 2021; pp. 6359–6363. [Google Scholar]
- Wang, C.; He, J.; Yi, J.; Tao, J.; Zhang, C.Y.; Zhang, X. Multi-Scale Permutation Entropy for Audio Deepfake Detection. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 1406–1410. [Google Scholar]
- Yang, Y.; Qin, H.; Zhou, H.; Wang, C.; Guo, T.; Han, K.; Wang, Y. A robust audio deepfake detection system via multi-view feature. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 13131–13135. [Google Scholar]
- Zhang, Q.; Wen, S.; Hu, T. Audio deepfake detection with self-supervised XLS-R and SLS classifier. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; ACM: New York, NY, USA, 2024; pp. 6765–6773. [Google Scholar]
- Ravanelli, M.; Bengio, Y. Speaker recognition from raw waveform with sincnet. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; IEEE SPS: Piscataway, NJ, USA, 2018; pp. 1021–1028. [Google Scholar]
- Li, M.; Ahmadiadli, Y.; Zhang, X.P. Robust Deepfake Audio Detection via Bi-Level Optimization. In Proceedings of the 2023 IEEE 25th International Workshop on Multimedia Signal Processing (MMSP), Poitiers, France, 27–29 September 2023; IEEE SPS: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
- Jung, J.W.; Heo, H.S.; Kim, J.H.; Shim, H.J.; Yu, H.J. Rawnet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. arXiv 2019, arXiv:1904.08104. [Google Scholar]
- Jung, J.W.; Kim, S.B.; Shim, H.J.; Kim, J.H.; Yu, H.J. Improved rawnet with feature map scaling for text-independent speaker verification using raw waveforms. arXiv 2020, arXiv:2004.00526. [Google Scholar]
- Ranjan, R.; Vatsa, M.; Singh, R. Statnet: Spectral and temporal features based multi-task network for audio spoofing detection. In Proceedings of the 2022 IEEE International Joint Conference on Biometrics (IJCB 2022), Abu Dhabi, United Arab Emirates, 10–13 October 2022; pp. 1–9. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Conneau, A.; Baevski, A.; Collobert, R.; Mohamed, A.; Auli, M. Unsupervised cross-lingual representation learning for speech recognition. arXiv 2020, arXiv:2006.13979. [Google Scholar]
- Babu, A.; Wang, C.; Tjandra, A.; Lakhotia, K.; Xu, Q.; Goyal, N.; Singh, K.; Von Platen, P.; Saraf, Y.; Pino, J.; et al. XLS-R: Self-supervised cross-lingual speech representation learning at scale. arXiv 2021, arXiv:2111.09296. [Google Scholar]
- Martín-Doñas, J.M.; Álvarez, A. The vicomtech audio deepfake detection system based on wav2vec2 for the 2022 add challenge. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22-27 May 2022; IEEE SPS: Piscataway, NJ, USA, 2022; pp. 9241–9245. [Google Scholar]
- Saha, S.; Sahidullah, M.; Das, S. Exploring Green AI for Audio Deepfake Detection. arXiv 2024, arXiv:2403.14290. [Google Scholar]
- Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
- Guo, Y.; Huang, H.; Chen, X.; Zhao, H.; Wang, Y. Audio Deepfake Detection With Self-Supervised Wavlm And Multi-Fusion Attentive Classifier. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12702–12706. [Google Scholar]
- Zhu, Y.; Powar, S.; Falk, T.H. Characterizing the temporal dynamics of universal speech representations for generalizable deepfake detection. In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), Seoul, Republic of Korea, 14–19 April 2024; IEEE SPS: Piscataway, NJ, USA, 2024; pp. 139–143. [Google Scholar]
- Zhu, Y.; Koppisetti, S.; Tran, T.; Bharaj, G. Slim: Style-linguistics mismatch model for generalized audio deepfake detection. arXiv 2024, arXiv:2407.18517. [Google Scholar]
- Simonyan, K. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Mcuba, M.; Singh, A.; Ikuesan, R.A.; Venter, H. The effect of deep learning methods on deepfake audio detection for digital investigation. Procedia Comput. Sci. 2023, 219, 211–219. [Google Scholar] [CrossRef]
- Alegre, F.; Vipperla, R.; Evans, N. Spoofing countermeasures for the protection of automatic speaker recognition from attacks with artificial signals. In Proceedings of the 13th Interspeech 2012, Portland, OR, USA, 9–13 September 2012; Volume 29, pp. 54–58. [Google Scholar]
- Chingovska, I.; Anjos, A.; Marcel, S. Anti-spoofing in action: Joint operation with a verification system. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 23–28 June 2013; IEEE ComSoc: New York, NY, USA, 2013; pp. 98–104. [Google Scholar]
- Villalba, J.; Miguel, A.; Ortega, A.; Lleida, E. Spoofing detection with DNN and one-class SVM for the ASVspoof 2015 challenge. In Proceedings of the Interspeech 2015, Dresden, Germany, 6–10 September 2015; Volume 2015, pp. 2067–2071. [Google Scholar]
- Iqbal, F.; Abbasi, A.; Javed, A.R.; Jalil, Z.; Al-Karaki, J.N. Deepfake Audio Detection Via Feature Engineering And Machine Learning. In Proceedings of the CIKM 2022 Workshops, Atlanta, GA, USA, 17–21 October 2022; CEUR Workshop Proceedings; 2022. Volume 3318, pp. 1–12. [Google Scholar]
- Javed, A.; Malik, K.M.; Malik, H.; Irtaza, A. Voice spoofing detector: A unified anti-spoofing framework. Expert Syst. Appl. 2022, 198, 116770. [Google Scholar] [CrossRef]
- Karo, M.; Yeredor, A.; Lapidot, I. Compact Time-Domain Representation for Logical Access Spoofed Audio. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 32, 946–958. [Google Scholar] [CrossRef]
- Hinton, G.E.; Krizhevsky, A.; Wang, S.D. Transforming auto-encoders. In Proceedings of the Artificial Neural Networks and Machine Learning–ICANN 2011: 21st International Conference on Artificial Neural Networks, Espoo, Finland, 14–17 June 2011; Proceedings; Springer: Berlin/Heidelberg, Germany, 2011; pp. 44–51. [Google Scholar]
- Alexey, D. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Ulutas, G.; Tahaoglu, G.; Ustubioglu, B. Deepfake audio detection with vision transformer based method. In Proceedings of the 2023 46th International Conference on Telecommunications and Signal Processing (TSP), Online, 12–14 July 2023; pp. 244–247. [Google Scholar]
- Goel, C.; Koppisetti, S.; Colman, B.; Shahriyari, A.; Bharaj, G. Towards Attention-based Contrastive Learning for Audio Spoof Detection. In Proceedings of the Interspeech 2023, Dublin, Ireland, 20–24 August 2023; Volume 2023, pp. 2758–2762. [Google Scholar]
- Gong, Y.; Chung, Y.A.; Glass, J. Ast: Audio spectrogram transformer. arXiv 2021, arXiv:2104.01778. [Google Scholar]
- Gong, Y.; Lai, C.I.; Chung, Y.A.; Glass, J. Ssast: Self-supervised audio spectrogram transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; AAAI: Menlo Park, CA, USA, 2022; Volume 36, pp. 10699–10709. [Google Scholar]
- Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Philip, S.Y. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4–24. [Google Scholar]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
- Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
- Chen, F.; Deng, S.; Zheng, T.; He, Y.; Han, J. Graph-based spectro-temporal dependency modeling for anti-spoofing. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE SPS: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
- Tak, H.; Jung, J.W.; Patino, J.; Kamble, M.; Todisco, M.; Evans, N. End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection. arXiv 2021, arXiv:2107.12710. [Google Scholar]
- Jung, J.W.; Heo, H.S.; Tak, H.; Shim, H.J.; Chung, J.S.; Lee, B.J.; Yu, H.J.; Evans, N. Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE SPS: Piscataway, NJ, USA, 2022; pp. 6367–6371. [Google Scholar]
- Tak, H.; Todisco, M.; Wang, X.; Jung, J.W.; Yamagishi, J.; Evans, N. Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. arXiv 2022, arXiv:2202.12233. [Google Scholar]
- Eom, Y.; Lee, Y.; Um, J.S.; Kim, H. Anti-spoofing using transfer learning with variational information bottleneck. arXiv 2022, arXiv:2204.01387. [Google Scholar]
- Wang, Z.; Fu, R.; Wen, Z.; Tao, J.; Wang, X.; Xie, Y.; Qi, X.; Shi, S.; Lu, Y.; Liu, Y.; et al. Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0. arXiv 2024, arXiv:2409.11909. [Google Scholar]
- Wang, C.; Yi, J.; Tao, J.; Sun, H.; Chen, X.; Tian, Z.; Ma, H.; Fan, C.; Fu, R. Fully automated end-to-end fake audio detection. In Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia, Lisboa, Portugal, 14 October 2022; ACM: New York, NY, USA, 2022; pp. 27–33. [Google Scholar]
- Chen, Y.; Yi, J.; Xue, J.; Wang, C.; Zhang, X.; Dong, S.; Zeng, S.; Tao, J.; Zhao, L.; Fan, C. RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection. arXiv 2024, arXiv:2406.06086. [Google Scholar]
- Haselton, T. Google Admits Partners Leaked More Than 1000 Private Conversations with Google Assistant. 2019. Available online: https://www.cnbc.com/2019/07/11/google-admits-leaked-private-voice-conversations.html (accessed on 6 January 2025).
- Yadav, A.K.S.; Bhagtani, K.; Salvi, D.; Bestagini, P.; Delp, E.J. FairSSD: Understanding Bias in Synthetic Speech Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; IEEE ComSoc: New York, NY, USA, 2024; pp. 4418–4428. [Google Scholar]
- Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common voice: A massively-multilingual speech corpus. arXiv 2019, arXiv:1912.06670. [Google Scholar]
- Lea, C.; Mitra, V.; Joshi, A.; Kajarekar, S.; Bigham, J.P. Sep-28k: A dataset for stuttering event detection from podcasts with people who stutter. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE SPS: Piscataway, NJ, USA, 2021; pp. 6798–6802. [Google Scholar]
- Zhang, X.; Yi, J.; Wang, C.; Zhang, C.Y.; Zeng, S.; Tao, J. What to remember: Self-adaptive continual learning for audio deepfake detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; AAAI: Menlo Park, CA, USA, 2024; Volume 38, pp. 19569–19577. [Google Scholar]
- Dong, F.; Tang, Q.; Bai, Y.; Wang, Z. Advancing Continual Learning for Robust Deepfake Audio Classification. arXiv 2024, arXiv:2407.10108. [Google Scholar]
- Zhang, X.; Yi, J.; Tao, J. EVDA: Evolving Deepfake Audio Detection Continual Learning Benchmark. arXiv 2024, arXiv:2405.08596. [Google Scholar]
- Lundberg, S. A unified approach to interpreting model predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar]
- Ge, W.; Patino, J.; Todisco, M.; Evans, N. Explaining deep learning models for spoofing and deepfake detection with SHapley Additive exPlanations. In Proceedings of the ICASSP 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE SPS: Piscataway, NJ, USA, 2022; pp. 6387–6391. [Google Scholar]
- Ge, W.; Todisco, M.; Evans, N. Explainable deepfake and spoofing detection: An attack analysis using SHapley Additive exPlanations. arXiv 2022, arXiv:2202.13693. [Google Scholar]
- Müller, N.M.; Dieckmann, F.; Czempin, P.; Canals, R.; Böttinger, K.; Williams, J. Speech is silver, silence is golden: What do ASVspoof-trained models really learn? arXiv 2021, arXiv:2106.12914. [Google Scholar]
- Channing, G.; Sock, J.; Clark, R.; Torr, P.; de Witt, C.S. Toward Robust Real-World Audio Deepfake Detection: Closing the Explainability Gap. arXiv 2024, arXiv:2410.07436. [Google Scholar]
- Yu, Z.; Zhai, S.; Zhang, N. Antifake: Using adversarial audio to prevent unauthorized speech synthesis. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, Copenhagen, Denmark, 26–30 November 2023; ACM: New York, NY, USA, 2023; pp. 460–474. [Google Scholar]
- Juvela, L.; Wang, X. Collaborative Watermarking for Adversarial Speech Synthesis. In Proceedings of the ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 11231–11235. [Google Scholar]
- Liu, C.; Zhang, J.; Zhang, T.; Yang, X.; Zhang, W.; Yu, N. Detecting Voice Cloning Attacks via Timbre Watermarking. arXiv 2023, arXiv:2312.03410. [Google Scholar]
- Wang, B.; Tang, Y.; Wei, F.; Ba, Z.; Ren, K. FTDKD: Frequency-Time Domain Knowledge Distillation for Low-Quality Compressed Audio Deepfake Detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 4905–4918. [Google Scholar]
- Xiang, Z.; Yadav, A.K.S.; Tubaro, S.; Bestagini, P.; Delp, E.J. Extracting efficient spectrograms from mp3 compressed speech signals for synthetic speech detection. In Proceedings of the 2023 ACM Workshop on Information Hiding and Multimedia Security, Chicago, IL, USA, 28–30 June 2023; ACM: New York, NY, USA, 2023; pp. 163–168. [Google Scholar]
- Yadav, A.K.S. Machine Learning Approaches for Speech Forensics. Ph.D. Thesis, Purdue University Graduate School, West Lafayette, IN, USA, 2024. [Google Scholar]
- Robins-Early, N. CEO of World’s Biggest ad Firm Targeted by Deepfake Scam. 2024. Available online: https://www.theguardian.com/technology/article/2024/may/10/ceo-wpp-deepfake-scam (accessed on 6 February 2025).
- DCVC. Deep Tech to Defeat Deepfakes: Reality Defender Is on the Case. 2024. Available online: https://www.dcvc.com/news-insights/deep-tech-to-defeat-deepfakes/ (accessed on 6 February 2025).
- RESEMBLE.AI. Introducing State-of-the-Art in Multimodal Deepfake Detection. 2024. Available online: https://www.resemble.ai/multimodal-deepfake-detection/ (accessed on 6 February 2025).
- Turner, M. How Clever App Can Spot Deepfake Videos in Latest Battle Against ‘Lethal’ AI Clones on Facebook and YouTube. 2024. Available online: https://www.thescottishsun.co.uk/tech/13570234/deepfake-detector-app-mcafee-ai-videos/ (accessed on 6 February 2025).
- Taylor, J.; Whitty, M. An exploration of the awareness and attitudes of psychology students regarding their psychological literacy for working in the cybersecurity industry. Psychol. Learn. Teach. 2024, 23, 298–314. [Google Scholar]
Authors | Year | Main Contributions | Limitations |
---|---|---|---|
Chadha et al. [4] | 2021 | Introduces deepfake technology, types, and detection methods, focusing mainly on image and video deepfakes. | Limited focus on audio deepfake detection, lacks evaluation of modern detection models and benchmark comparisons. |
Ren et al. [36] | 2021 | Discusses deepfake attacks on both human perception and ASV systems. | Lacks analysis of advanced detection models, limited discussion on explainability or privacy in deepfake detection. |
Almutairi and Elgibreen [33] | 2022 | Categorises ML/DL techniques, analyses datasets, and discusses future challenges. | Limited exploration of advanced detection models. |
Khanjani et al. [32] | 2023 | Categorises generation and detection methods, reviews over 150 studies, and discusses societal threats. | Lacks a quantitative comparison of detection models, does not cover recent advancements beyond 2021. |
Patel et al. [34] | 2023 | Provides a multimodal deepfake analysis, introduces a case study on detection inconsistencies. | Lacks a detailed technical evaluation of audio deepfake detection methods, does not provide quantitative model benchmarking. |
Yi et al. [35] | 2023 | Performs an experimental comparison on benchmark datasets. | Lacks real-world application discussions and in-depth analysis of robustness and explainability. |
Mubarak et al. [37] | 2023 | Emphasises societal impacts and holistic mitigation strategies, discusses detection techniques and emerging threats. | Takes a generalised approach across media types, does not include recent technological advances in audio detection models. |
Masood et al. [38] | 2023 | Provides a detailed review of deepfakes across modalities, discusses datasets, detection methods, and future challenges. | Lacks in-depth technical analysis specific to audio deepfake detection, does not benchmark detection models across datasets. |
Ours | 2024 | Most up-to-date survey (till 2024); provides a quantitative comparison of detection models; first to analyse privacy, fairness, adaptability, and explainability in audio deepfake detection. Expands beyond detection methods to robustness, real-world deployment, and future research directions. |
Attribute | Speech Synthesis | Voice Conversion |
---|---|---|
Objective | Generate speech from nonspeech input | Transform input speech into target speech |
Input | Text, phonemes, etc. | Source speech signal |
Output | Speech signal | Speech signal with consistent content but altered style or attributes |
Complexity | Language understanding and speech generation | Feature transformation and target reconstruction |
Representative models | DeepVoice, Tacotron, FastSpeech | CycleGAN-VC, AutoVC, MulliVC, VQVC, FreeVC |
Main challenges | Enhancing speech naturalness and fluency | Maintaining content consistency while changing target attributes |
Challenge | Language | Year | Frontend | Backend | Performance | ||
---|---|---|---|---|---|---|---|
ASVspoof | English | 2019 | tDCF | EER(%) | |||
CQCC | GMM | 0.2839 | 9.57 | ||||
LFCC | GMM | 0.2605 | 8.09 | ||||
2021 | LA | DF | |||||
tDCF | EER(%) | EER(%) | |||||
CQCC | GMM | 0.4974 | 15.62 | 25.56 | |||
LFCC | GMM | 0.5758 | 19.3 | 25.25 | |||
LFCC | LCNN | 0.3445 | 9.26 | 23.48 | |||
RawNet2 | 0.4257 | 9.5 | 22.38 | ||||
2024 | CM | SASV | |||||
minDCF | EER(%) | a-DCF | |||||
RawNet2 | 0.8266 | 36.04 | |||||
AASIST | 0.7106 | 29.12 | |||||
Fusion-based | 0.6806 | ||||||
Single integrated | 0.5741 | ||||||
ADD | Chinese | 2022 2 | LF (EER %) | PF (EER %) | |||
LFCC | GMM | 25.2 | 45.8 | ||||
LFCC | GMM | 24.1 | 47.5 | ||||
LFCC | LCNN | 32.3 | 47.8 | ||||
LFCC | LCNN | 29.9 | 48.1 | ||||
RawNet2 | 35.2 | 50.1 | |||||
RawNet2 | 33.9 | 50.2 | |||||
2023 | FG-D (WEER 3 %) | ||||||
LFCC | GMM | 53.04 | |||||
LFCC | LCNN | 66.72 | |||||
Wav2Vec2 | LCNN | 30.05 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, B.; Cui, H.; Nguyen, V.; Whitty, M. Audio Deepfake Detection: What Has Been Achieved and What Lies Ahead. Sensors 2025, 25, 1989. https://doi.org/10.3390/s25071989
Zhang B, Cui H, Nguyen V, Whitty M. Audio Deepfake Detection: What Has Been Achieved and What Lies Ahead. Sensors. 2025; 25(7):1989. https://doi.org/10.3390/s25071989
Chicago/Turabian StyleZhang, Bowen, Hui Cui, Van Nguyen, and Monica Whitty. 2025. "Audio Deepfake Detection: What Has Been Achieved and What Lies Ahead" Sensors 25, no. 7: 1989. https://doi.org/10.3390/s25071989
APA StyleZhang, B., Cui, H., Nguyen, V., & Whitty, M. (2025). Audio Deepfake Detection: What Has Been Achieved and What Lies Ahead. Sensors, 25(7), 1989. https://doi.org/10.3390/s25071989