DDFNet: A Dual-Domain Fusion Network for Robust Synthetic Speech Detection

Lu, Jing; Zhang, Qiang; Cao, Jialu; Tian, Hui

doi:10.3390/bdcc9030058

Open AccessArticle

DDFNet: A Dual-Domain Fusion Network for Robust Synthetic Speech Detection

¹

College of Computer Science and Technology, Huaqiao University, Xiamen 361021, China

²

Xiamen Key Laboratory of Data Security & Blockchain Technology, Huaqiao University, Xiamen 361021, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(3), 58; https://doi.org/10.3390/bdcc9030058

Submission received: 16 January 2025 / Revised: 25 February 2025 / Accepted: 27 February 2025 / Published: 3 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

The detection of synthetic speech has become a pressing challenge due to the potential societal risks posed by synthetic speech technologies. Existing methods primarily focus on either the time or frequency domain of speech, limiting their ability to generalize to new and diverse speech synthesis algorithms. In this work, we present a novel and scientifically grounded approach, the Dual-domain Fusion Network (DDFNet), which synergistically integrates features from both the time and frequency domains to capture complementary information. The architecture consists of two specialized single-domain feature extraction networks, each optimized for the unique characteristics of its respective domain, and a feature fusion network that effectively combines these features at a deep level. Moreover, we incorporate multi-task learning to simultaneously capture rich, multi-faceted representations, further enhancing the model’s generalization capability. Extensive experiments on the ASVspoof 2019 Logical Access corpus and ASVspoof 2021 tracks demonstrate that DDFNet achieves strong performance, maintaining competitive results despite the challenges posed by channel changes and compression coding, highlighting its robust generalization ability.

Keywords:

Dual-domain fusion; multi-task learning; synthetic speech detection; speech forensics

1. Introduction

Thanks to rapid progress in deep learning and neural network models, synthetic speech has achieved a remarkable level of realism and naturalness, making it nearly indistinguishable from human speech in many cases. This improvement has significantly enhanced the user experience in various applications, such as virtual assistants, customer service bots, and content creation tools, fostering more natural and engaging human-computer interaction. However, as these advancements usher in new possibilities, they also open the door for malicious actors to exploit synthetic speech technologies. Cybercriminals can use deepfake voices to deceive victims, impersonate trusted individuals, or bypass automatic speaker verification systems, thus undermining the security of voice-based authentication and verification methods. This has raised substantial concerns about privacy, security, and the overall trustworthiness of voice-driven technologies. Therefore, it is imperative to develop robust and reliable detection techniques to distinguish between real and synthetic speech, ensuring that trust and security are maintained in applications where speech authentication is critical.

Up to now, there are two primary techniques employed in generating synthetic speech: text-to-speech (TTS) and voice conversion (VC). The former focuses on converting text (linguistic content) into speech in the style of the target speaker, and it typically involves two main steps: text analysis and speech generation [1]. The latter, on the other hand, aims to transform the speech of a source speaker into the speech of another target speaker, encompassing three main steps: speech analysis, feature mapping, and speech reconstruction [2]. During the speech generation or reconstruction process, various acoustic features (such as Mel-spectrogram [3,4] and Mel-frequency cepstral coefficients [5,6]) are utilized to enhance the naturalness and clarity of synthetic speech. These features help improve the quality of the generated output but also unavoidably introduce certain traces or artifacts in the speech signal, such as slight imperfections in prosody, cadence, or frequency patterns. These residual artifacts provide a potential avenue for detecting synthetic speech, as they can be identified by sophisticated detection models that analyze discrepancies between synthetic and authentic human speech.

To accurately detect synthetic speech, researchers have made significant efforts to uncover effective hidden artifacts that distinguish machine-generated speech from human speech. Depending on the source of the artifacts, there are two primary approaches for synthetic speech detection: frequency domain information (FDI)-based approaches [7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23] and time domain information (TDI)-based approaches [24,25,26,27,28]. The FDI-based approaches primarily focus on extracting acoustic features in the frequency domain (such as linear frequency cepstral coefficients (LFCCs) [12], constant-Q cepstral coefficients (CQCCs) [13], or the log power spectrum of the short-time Fourier transform (STFT) [14]), which are then used as input to detection models. These methods have achieved initial success in ASVspoof challenges [29,30], demonstrating their effectiveness in identifying synthetic speech. However, these approaches often neglect other critical information, such as magnitude information [10] and phase information [31], which are essential for building robust deep neural network-based detection models [26]. By overlooking such features, these models may fail to capture subtle but significant artifacts that help in distinguishing synthetic from human speech. In contrast, the TDI-based approaches directly capture time-domain artifacts from the raw speech waveform, enabling end-to-end synthetic speech detection [26]. These methods are capable of analyzing the entire speech signal in the time domain, which can be particularly useful for detecting low-level anomalies. However, time-domain features may struggle to effectively capture changes in the low-frequency region, an area identified as crucial for uncovering important artifacts in synthetic speech [32]. This limitation highlights the challenge in relying solely on time-domain analysis to fully characterize synthetic speech traces. Both FDI-based and TDI-based approaches use features from a single domain, which may limit their ability to comprehensively capture the subtle and multi-faceted nature of artifacts present in synthetic speech. Given this, a hybrid approach that jointly models both the time and frequency domains may offer a significant advantage. By combining information from both domains, such an approach could uncover more robust and effective artifacts, potentially improving the accuracy and reliability of synthetic speech detection systems. This multi-domain modeling holds promise for developing more sensitive and adaptive detection models capable of identifying synthetic speech with greater precision.

In view of this, we introduce a novel framework termed Dual-Domain Fusion Network (DDFNet), specifically designed for synthetic speech detection. This network consists of four key sub-networks: a time-domain feature extraction network (TDFEN), a frequency-domain feature extraction network (FDFEN), a feature fusion network (FFN), and a classification network. Each component plays a crucial role in improving the performance of synthetic speech detection by leveraging both time and frequency domain features. The TDFEN is responsible for obtaining highly distinguishable time-domain detection features in an end-to-end manner, directly processing the raw speech waveform. By focusing on time-domain characteristics, it captures temporal artifacts that are critical for distinguishing synthetic speech. The FDFEN, on the other hand, takes frequency-domain acoustic features as input, such as Mel-spectrograms or Mel-frequency cepstral coefficients, and through training, it extracts highly expressive frequency-domain detection features that highlight the subtle discrepancies between real and synthetic speech. The FFN plays a pivotal role by organically fusing the two types of features—time-domain and frequency-domain—into a dual-domain joint representation. This fusion results in more distinguishable and robust features that provide richer information for the final detection task. Finally, the classification network uses the joint features generated by the FFN to perform the ultimate detection task, classifying the input speech as either real or synthetic. To ensure that each sub-network performs optimally, we introduce multi-task learning, which guides the training process for each network. This strategy ensures that all sub-networks learn complementary features in a coherent and unified manner, leading to improved performance across all stages of the detection pipeline. We comprehensively evaluate the performance of DDFNet by comparing it with previous related work using the ASVspoof 2019 [33] and ASVspoof 2021 [34] datasets. The experimental results demonstrate that DDFNet achieves strong performance, maintaining competitiveness even when facing challenges such as channel changes and compression coding, showcasing its robust generalization ability.

2. Related Work

The detection methods for synthetic speech can be broadly classified into two main categories: those based on FDI [7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23] and those based on TDI [24,25,26,27,28]. Each approach leverages different aspects of acoustic signals to identify the artifacts introduced during speech synthesis. This section provides a detailed review of these two categories and highlights how our work advances beyond the existing methods.

FDI-based methods [7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23] rely on time-frequency transformations to extract key acoustic features from speech signals, such as fundamental frequency, formants, and energy. These transformations are typically performed using techniques like the STFT [14] and Constant-Q Transform (CQT) [8,15,26]. The resulting features, such as LFCCs [12], CQCCs [13], and logarithmic power spectra [14], capture essential information regarding the spectral structure of the signal. While these features are effective at identifying local patterns in both time and frequency domains through frame-based segmentation, they have notable limitations. Specifically, these methods often fail to account for long-term temporal dependencies, which can lead to incomplete or distorted representations of the signal’s dynamic characteristics. Additionally, operations involved in feature extraction—such as squaring, dimensionality reduction, or pooling—may result in the loss of valuable information. Despite these challenges, frequency-domain features remain foundational in synthetic speech detection, primarily focusing on identifying artifacts in the spectral domain that arise due to the synthesis process.

In contrast, TDI-based methods [24,25,26,27,28] leverage the raw waveform of speech for detection. With the advent of deep learning, particularly convolutional neural networks (CNNs), significant progress has been made in this area. Early work by Muckenhirn et al. [27] employed neural networks to directly learn distinguishing features from raw speech waveforms, bypassing the need for manual feature engineering. This allowed the network to autonomously identify patterns indicative of synthetic artifacts. Building on these ideas, Hua et al. [26] introduced more sophisticated architectures, such as the Inc-TSSDNet and Resc-TSSDNet models, incorporating skip connections and parallel convolutions to improve generalization. However, the Inc-TSSDNet model was found to struggle with cross-dataset generalization, prompting Wang et al. [28] to enhance the model with attention mechanisms, specifically channel and 1-D spatial attention, to focus on the most relevant regions of the signal. Despite these improvements, the method still relies heavily on large volumes of labeled data for effective training, highlighting the need for more robust and data-efficient solutions. Further advancements were made by Jung et al. [35], who introduced sinc-convolution layers, transforming the raw waveform into single-channel two-dimensional images. This allowed the model to simultaneously capture both spectral and temporal information, providing a more holistic view of the synthetic speech signal. However, this approach, while innovative, still primarily focuses on either time- or frequency-domain features and does not fully leverage the complementary nature of both domains.

As previously discussed, frequency-domain methods excel at capturing synthetic traces embedded in the spectral characteristics of speech, whereas time-domain methods are more effective at detecting temporal artifacts in the waveform. However, both approaches have their limitations: FDI-based methods struggle with capturing subtle temporal discrepancies, while TDI-based methods may miss frequency-related anomalies that signal synthetic speech. Current research predominantly focuses on one domain or the other, often overlooking the potential benefits of combining both domains. By relying exclusively on either time- or frequency-domain features, existing methods may fail to capture critical information that could be identified by jointly analyzing both. This gap in the literature underscores the need for more comprehensive approaches that integrate time-domain and frequency-domain features, thus enabling a fuller understanding of the synthetic speech signal. To address the limitations of single-domain approaches, we propose a novel framework, the DDFNet, which integrates both time-domain and frequency-domain information. By leveraging the complementary strengths of both domains, DDFNet offers a more robust and holistic approach to synthetic speech detection. Our framework aims to capture both the fine-grained temporal variations and the spectral inconsistencies that characterize synthetic speech, thereby improving detection accuracy and generalization across diverse datasets.

3. The Proposed Method

Our proposed DDFNet aims to achieve synthetic speech detection through joint modeling of both time-domain and frequency-domain features. As illustrated in Figure 1, the process begins with the TDFEN (Time-Domain Feature Extraction Network) and FDFEN (Frequency-Domain Feature Extraction Network), which extract domain-specific features from their corresponding inputs, i.e., time-domain speech waveforms and frequency-domain acoustic features, respectively. The next step involves the FFN (Feature Fusion Network), which combines these extracted domain features into a unified joint representation. This fusion process allows the model to leverage complementary information from both domains, enhancing its ability to distinguish between real and synthetic speech. Finally, the resulting joint features are passed to the classification network for the ultimate task of synthetic speech detection. In the following sections, we will describe the key components of DDFNet in more detail, explaining how each sub-network contributes to improving the overall performance of the system.

3.1. Time-Domain Feature Extraction Network

The TDFEN is designed to extract distinguishable time-domain features (denoted as

v_{T}

) from the speech waveform (denoted as

x_{T}

). To accomplish this, researchers have developed several effective schemes for time-domain feature extraction [24,25,26], with RW-ResNet [25] and Res-TSSDNet [26] being among the most typical and widely used approaches. Both methods share the key characteristic of employing residual networks to learn feature representations from the time-domain speech signal. However, there are important differences between these two approaches. Specifically, Res-TSSDNet introduces a larger perceptual field in its residual structure, making it more generalizable and efficient in terms of the number of parameters. This allows the model to learn more comprehensive temporal patterns while reducing the risk of overfitting. As a result, we opt for Res-TSSDNet as the time-domain feature extractor in our framework, given its superior ability to capture temporal dependencies in speech. The structure of Res-TSSDNet used in this work is illustrated in Figure 2. In contrast to the original design of Res-TSSDNet, we make a modification by eliminating the projection shortcut in the last residual module. This modification helps reduce the number of training parameters, thereby mitigating the potential risk of overfitting and improving the model’s ability to generalize to unseen data.

Specifically, the speech waveform is first passed through a 1-dimensional convolutional layer, followed by a max-pooling layer. These initial operations serve to extract local features and downsample the signal, capturing essential temporal patterns while reducing the data dimensionality. The resulting feature maps are then processed by four ResNet-like blocks, each consisting of three convolutional layers with Batch Normalization and ReLU activations. These blocks are designed to progressively refine the feature representations, learning increasingly abstract patterns in the speech signal. Between two adjacent blocks, a max-pooling layer is employed to further reduce the spatial dimensionality of the feature maps, helping to increase the network’s invariance to small variations in the input. Finally, the feature maps are transformed into distinguishable features

v_{T} \in R^{d_{T}}

through two fully connected (FC) layers. Here,

d_{T}

denotes the dimension of the extracted features. This structure enables the model to efficiently capture temporal dependencies while ensuring that the extracted features are compact and discriminative for the task of synthetic speech detection.

3.2. Frequency-Domain Feature Extraction Network

The frequency domain is another crucial aspect of speech that can provide complementary clues for synthetic speech detection. Given its potential, we utilize the FDFEN to extract features from the frequency domain. Specifically, the FDFEN extracts distinguishable frequency-domain features (denoted as

v_{F}

) from the frequency-domain acoustic features (denoted as

x_{F}

) of the speech signal. To achieve this, we conducted a thorough evaluation of several related works [10,14,32]. These studies typically follow a similar approach: they extract an acoustic feature from the frequency domain and then pass it through a classification model for detection. After careful consideration of various methods, we selected SENet as our backbone network for this task. SENet is known for its superior performance in extracting discriminative features and its efficient parameter utilization, making it a suitable choice for our network. To mitigate the potential risk of overfitting, we made adjustments to the number of ResNet-like blocks, as shown in Figure 3. This design modification helps ensure that the model maintains a balance between complexity and generalization.

The process for extracting frequency-domain features begins with a 2-dimensional convolutional layer that takes the acoustic feature

x_{F}

as input and outputs feature maps. These maps are then passed through a max-pooling operation to reduce their dimensionality, which helps to retain the most important information while discarding less relevant details. To further enhance the feature extraction process, we employ ResNet-like blocks with channel attention mechanisms, allowing the model to focus on the most informative parts of the frequency-domain features. Ultimately, these operations result in the extraction of distinguishable frequency-domain features

v_{F} \in R^{d_{F}}

, where

d_{F}

the dimensionality of the features. This comprehensive extraction process ensures that the frequency-domain features provide rich, informative representations for detecting synthetic speech.

3.3. Feature Fusion Network

The FFN is designed to produce highly distinguishable dual-domain joint features by effectively integrating multiple single-domain features. This integration is achieved through a process of mapping single-domain features from their respective representation spaces into a shared representation space using a linear projection. To enhance the non-linearity and expressiveness of the resulting joint representations, we incorporate the ReLU activation function into the fusion process.

Specifically, the fusion begins with a concatenation operation to combine the single-domain features into a unified representation. This aggregated feature vector is then passed through a linear projection layer, which transforms it into a shared representation space. To further refine the joint features, the output is processed by the ReLU activation function, introducing non-linear capabilities that enrich the representation and make it more effective for detection tasks. The resulting joint features, denoted as

v_{D} \in R^{d_{D}}

, have a dimensionality of

d_{D}

. This process ensures that the FFN captures the complementary information from both the time and frequency domains, enabling a more robust detection capability. The procedure for obtaining the joint features can be formally described as:

v_{D} = δ ((v_{T} \oplus v_{F}) W_{D})

(1)

where

δ

denotes the ReLU activation function,

W_{D}

denotes the weights of the linear projection layer, and ⊕ signifies the concatenation operation that aggregates single-domain features into a unified representation.

Following this process, the resulting joint features

v_{D}

are further refined through a fully-connected layer, which serves as the final step in the pipeline to accomplish synthetic speech detection. This layer leverages the distinguishable characteristics of the dual-domain joint features to make an accurate classification, distinguishing between genuine and synthetic speech signals effectively. This combination of linear projection, non-linear activation, and fully-connected classification ensures that the FFN maximizes the utility of both time-domain and frequency-domain information for robust detection.

3.4. Multi-Task Learning

In practice, training a dual-domain fusion network with a single loss function may encounter suboptimal performance, particularly in scenarios involving network degradation. This limitation arises because a single loss often fails to effectively guide the training of multiple single-domain networks, leading to reduced specialization in their feature extraction capabilities. Specifically, when relying on a single loss, both the TDFEN and FDFEN may inadvertently focus on extracting common features, which undermines their ability to capture rich, domain-specific information essential for robust detection. Furthermore, the FFN plays a critical role in synthesizing comprehensive features from the time and frequency domains. To achieve this, it is crucial for the FFN to leverage highly distinguishable single-domain features rather than diluted, overlapping representations. This necessitates a more nuanced approach to network optimization that encourages diversity and richness in the features extracted by the single-domain networks while ensuring their effective integration in the joint domain.

To address these challenges, we adopt a multi-task learning strategy to optimize the network. Specifically, we introduce three distinct loss functions: (1) Time-domain loss to incentivize the TDFEN to learn richer, domain-specific features from the time domain; (2) Frequency-domain loss to drive the FDFEN to extract detailed, complementary features from the frequency domain; (3) Classification loss, which guides the FFN to effectively integrate features from both domains and produce comprehensive representations for synthetic speech detection.

This multi-task learning framework ensures that the network components are optimized to specialize and collaborate, enabling the extraction of rich, domain-specific features and their seamless fusion for enhanced detection performance. Below, we provide a detailed description of each loss function employed in this strategy.

3.4.1. Time-Domain Loss

The time-domain loss is designed to enhance the ability of the TDFEN to extract rich and distinguishable features directly from the speech waveform. By minimizing this loss, the network is encouraged to capture subtle time-domain artifacts that are crucial for synthetic speech detection. To calculate the time-domain loss

L_{T}

, we employ a linear projection

P_{T}

that maps the time-domain feature representation

v_{T}

to a probability value representing the likelihood of the speech being genuine. This approach allows the network to evaluate the quality of time-domain features in distinguishing between real and synthetic speech. Mathematically, the time-domain loss can be formulated as:

\begin{matrix} L_{T} = - E_{(v_{T}, y) \sim (V_{T}, Y)} [y \cdot l o g (P_{T} (v_{T})) + (1 - y) \cdot l o g (1 - P_{T} (v_{T}))] \end{matrix}

(2)

where

V_{T}

denotes the set of time-domain features extracted by the TDFEN, and Y denotes the corresponding set of ground-truth labels, with

y = 1

indicating real speech and

y = 0

indicating synthetic speech.

3.4.2. Frequency-Domain Loss

The frequency-domain loss serves to optimize the FDFEN, ensuring it extracts complementary and meaningful features from the frequency-domain acoustic representations. Similar to the time-domain loss, the frequency-domain loss

L_{F}

also follows a classification-based objective, defined as:

\begin{matrix} L_{F} = - E_{(v_{F}, y) \sim (V_{F}, Y)} [y \cdot l o g (P_{F} (v_{F})) + (1 - y) \cdot l o g (1 - P_{F} (v_{F}))] \end{matrix}

(3)

where

V_{F}

denotes the set of frequency-domain features extracted by the FDFEN,

P_{F}

is the linear projection that maps the frequency-domain features

v_{F}

to the probability of the speech being real, and Y denotes the corresponding set of ground-truth labels.

3.4.3. Classification Loss

The classification loss

L_{C}

plays a central role in the FFN, guiding the network to effectively integrate features from both the time and frequency domains. This loss ensures that the joint features extracted from both domains are highly discriminative and optimized for accurate synthetic speech detection. Let

P_{C}

denote the fully-connected layer in the classification network, which processes the joint features. The classification loss

L_{C}

can be computed as follows:

\begin{matrix} L_{C} = - E_{(v_{D}, y) \sim (V_{D}, Y)} [y \cdot l o g (P_{C} (v_{D})) + (1 - y) \cdot l o g (1 - P_{C} (v_{D}))] \end{matrix}

(4)

where

V_{D}

denotes the set of joint features obtained after the fusion of time-domain and frequency-domain features,

P_{C}

is the fully-connected layer that maps the joint features

v_{D}

to a probability value indicating whether the speech is real or synthetic, and Y denotes the corresponding set of ground-truth labels.

3.4.4. Total Loss

The total loss function combines these three individual losses to ensure a balanced optimization of each network component, ultimately driving the entire system toward more accurate synthetic speech detection. For the objective function of our DDFNet, the total loss

L

is defined as:

L = L_{C} + λ_{1} L_{T} + λ_{2} L_{F}

(5)

where

λ_{1}

and

λ_{2}

are weighting parameters for the TDFEN and FDFEN, respectively. These parameters control the relative contribution of the time-domain and frequency-domain losses to the total loss. In our implementation, the default values for these parameters are set to

λ_{1} = λ_{2} = 1

, ensuring equal emphasis on both domains.

By incorporating these loss functions into a multi-task learning framework, we ensure that each component of the network is optimized to its full potential. This approach not only improves the specialization of each individual network (time-domain and frequency-domain feature extractors) but also fosters seamless collaboration between them. The result is a more robust and effective detection system for synthetic speech, capable of leveraging both time-domain and frequency-domain features to achieve higher performance.

4. Performance Evaluation

4.1. Experiment Settings

4.1.1. Dataset

The rapid evolution of synthetic speech generation demands detection frameworks capable of both generalizing across new attack types and maintaining robustness against real-world distortions. The primary goal of this study is to enhance generalization capability, which remains a key challenge in spoof detection. To achieve this, we focus on the ASVspoof 2019 LA corpus [33] as our main evaluation platform. This dataset features 19 distinct spoofing algorithms (denoted by S01–S19), with mutually exclusive attack types between the training/development (S01–S06) and evaluation (S09–S17) subsets, providing an authoritative benchmark for assessing cross-algorithm generalization, particularly for text-to-speech (TTS) and voice conversion (VC) attacks. By concentrating on this dataset, we aim to evaluate and improve the model’s ability to generalize across different spoofing techniques, which is the central focus of our work.

To complement this, and to establish a foundation for future work on robustness, we include supplementary experiments on the ASVspoof 2021 corpus [34], specifically focusing on the logical access (LA) and deepfake (DF) tracks. These tracks introduce real-world challenges such as codec compression and channel distortions. Although the primary focus of this paper is on generalization, we explore these challenges to build a foundational understanding of how DDFNet performs under real-world conditions. By testing DDFNet’s resilience to these distortions, we aim to lay the groundwork for future robustness research, providing essential insights into how our model can be adapted for handling more complex, real-world interference. These exploratory experiments in ASVspoof 2021 serve as a preliminary step, setting the stage for investigating robustness and transmission-induced degradations in subsequent studies.

While newer datasets such as the ASVspoof 5 [36] and the Multilingual Audio Anti-Spoofing Dataset (MLAAD) [37] offer expanded evaluation scenarios, including adversarial attacks and cross-lingual challenges, their inclusion in the current study would introduce additional complexities that extend beyond the core objective of enhancing generalization. The ASVspoof 5’s incorporation of neural codec-based environmental variations and the MLAAD’s focus on 23-language diversity, though valuable for future research, would introduce variables that might obscure the investigation into fundamental generalization mechanisms. Therefore, we deliberately exclude these datasets from the current analysis to maintain methodological clarity and focus. We plan to explore these datasets in future work, particularly as we address more multi-dimensional detection challenges, including adversarial robustness and cross-lingual adaptation.

This dataset selection strategy ensures that our current study provides valuable insights into improving the generalization ability against evolving synthetic speech threats, while also laying a solid foundation for future research on robustness and cross-lingual adaptability.

4.1.2. Metrics

All approaches were evaluated using two key metrics: equal error rate (EER) and minimum tandem detection cost function (min t-DCF) [38]. The EER is a widely used metric that reflects the overall performance of a detection approach by measuring the point at which the false acceptance rate (FAR) equals the false rejection rate (FRR). A lower EER indicates better overall detection accuracy. Let s represent the detection score output by a spoof detection model, and let

θ

denote a decision threshold. At this threshold, the false acceptance rate (FAR, denoted by

P_{f_{a}} (θ)

) and false rejection rate (FRR, denoted by

P_{f_{a}} (θ)

) can be calculated as follows.

\begin{matrix} P_{f_{a}} (θ) & = \frac{N_{f} (s > θ)}{N_{f}} \\ P_{f_{r}} (θ) & = \frac{N_{f} (s \leq θ)}{N_{r}} \end{matrix}

(6)

where

N_{f}

and

N_{r}

represent the total number of spoofed and genuine speech samples, respectively.

N_{f} (s > θ)

denotes the number of spoofed speech samples where the detection score s exceeds the threshold

θ

. Similarly,

N_{f} (s \leq θ)

represents the number of genuine speech samples where the detection score s is less than or equal to the threshold

θ

. Additionally,

P_{f_{r}} (θ)

is a monotonically increasing function with respect to the threshold

θ

, while

P_{f_{a}} (θ)

is monotonically decreasing. The EER is achieved when these two error rates intersect:

\begin{matrix} EER = P_{f_{a}} (θ^{*}) & = P_{f_{r}} (θ^{*}) \end{matrix}

(7)

where

θ^{*}

is the threshold satisfying

P_{f_{a}} (θ^{*}) = P_{f_{r}} (θ^{*})

.

On the other hand, the min t-DCF is particularly useful in assessing the impact of a detection approach on automatic speaker verification (ASV) systems. This metric quantifies the cost of false positive and false negative errors in a manner that accounts for the operational requirements of ASV, such as the trade-off between system accuracy and security. A lower min t-DCF value indicates a more effective detection approach with less negative impact on the ASV system’s performance. The formula for calculating the min t-DCF is as follows.

\begin{matrix} t - DCF = C_{f a} \times (1 - P_{t a}) \times P_{f a} (θ) + C_{f r} \times P_{t a} \times P_{f r} (θ) \end{matrix}

(8)

where

C_{f a}

and

C_{f r}

represent the weights for FAR and FRR, respectively, while

P_{t a}

and

1 - P_{t a}

represent the prior probabilities of the genuine speaker and the impersonating attacker, respectively.

In summary, a lower EER signifies better overall detection performance, while a lower min t-DCF indicates less disruption to ASV systems, reflecting more robust synthetic speech detection.

4.1.3. Data Preprocessing

In this paper, we employ both the speech waveform and the log power spectrum (LPS) derived from the STFT as our time-domain feature and frequency-domain acoustic feature, respectively. The processing of the speech waveform follows a similar approach to that employed in Res-TSSDNet [26], with the speech duration set to 6 s.

For the LPS extraction, the speech waveform is first passed through the short-time Fourier transform using a Blackman window. Subsequently, the absolute value, square, and logarithmic operations are applied to obtain the LPS. The number of FFT bins is set to 1728, with a window size of 1728 and a hop length of 130. In line with previous work [32], we focus primarily on the low-frequency part of the LPS, as it is often where the most significant artifacts indicative of synthetic speech are concentrated.

4.1.4. Implementation Details

We employ the Adam optimizer with betas = (0.9, 0.999) for training our model. The initial learning rate is set to 0.001, and we use a cosine annealing strategy to gradually decrease the learning rate during training. The minimum learning rate is set to

5 \times 10^{- 5}

, and the maximum number of iterations corresponds to the total number of epochs. Our model was trained for a total of 80 epochs, with a batch size of 128. This setup ensures efficient training while preventing overfitting, enabling the model to learn effectively from both the time-domain and frequency-domain features over multiple iterations.

4.2. Ablation Study

To evaluate the effectiveness of dual-domain fusion and multi-task learning, we conducted experiments on the ASVspoof 2019 LA evaluation sets using our proposed DDFNet. For comparison, we also evaluated the performance of single-domain feature extraction networks with classification heads and DDFNet without time-domain and frequency-domain losses. The results of these experiments are presented in Table 1. (The upward arrow after the indicator in the table means that the larger the value, the better the performance, while the downward arrow means that the smaller the value, the better the performance. This rule also applies to the arrows in the tables that follow in the article).

On one hand, the DDFNet consistently outperforms the TDFEN and FDFEN networks, each with a classification head. Specifically, DDFNet achieves an EER of 0.69% and min t-DCF of 0.0203 on the evaluation set. For comparison:

The DDFNet outperforms the TDFEN by 84% in terms of EER.
The DDFNet shows a 73% improvement over the FDFEN in terms of EER.
DDFNet also surpasses the average performance by 31.68% in terms of EER.

These results clearly highlight that integrating both time-domain and frequency-domain features through dual-domain fusion significantly enhances the generalization ability of the model. Moreover, the dual-domain fusion method effectively exploits the complementarity between different domains, further boosting the overall performance of the detection system. These findings suggest that dual-domain fusion is a more robust approach for multi-information fusion in synthetic speech detection, making it a superior alternative to single-domain systems.

On the other hand, the experiment results also underscore the effectiveness of multi-task learning in improving the performance of DDFNet. When comparing DDFNet with and without auxiliary losses (i.e., the time-domain and frequency-domain losses), we observe a substantial performance improvement. Specifically, DDFNet with multi-task learning achieves a 72% reduction in EER; The min t-DCF is reduced by 69% compared to DDFNet without auxiliary losses. These results strongly suggest that multi-task learning helps regularize the training process, guiding the network to learn richer, more discriminative feature representations from both domains. This leads to better generalization and improved detection accuracy.

Overall, the experiments clearly demonstrate the advantages of dual-domain fusion and multi-task learning. Dual-domain fusion enables the network to leverage the complementary information from both time and frequency domains, while multi-task learning optimizes each network component and enhances the learning of discriminative features. The significant improvements in both EER and min t-DCF demonstrate the potential of these strategies in advancing synthetic speech detection.

4.3. Performance Comparison with Previous Methods

To evaluate the effectiveness of our proposed DDFNet, we conducted a performance comparison with several state-of-the-art approaches on the ASVspoof 2019 LA evaluation set. This comparison aims to assess how DDFNet performs, in terms of both EER and min t-DCF—two critical metrics for synthetic speech detection—against leading FDI-based and TDI-based methods.

As shown in Table 2, the best FDI-based method, the Dual-Branch Network, achieves an EER of 0.80% and a min t-DCF of 0.0214. The best TDI-based method, Res-TSSDNet, achieves an EER of 1.64% and a min t-DCF of 0.0482. These results set solid baselines for comparison, highlighting the performance of the top approaches in the field.

In contrast, DDFNet outperforms both methods, achieving an EER of 0.69% and a min t-DCF of 0.0203 on the evaluation set. Compared to the Dual-Branch Network, DDFNet improves the EER by 13.75% and reduces the min t-DCF by 0.0011. More notably, DDFNet achieves a 57.93% improvement in EER and a 0.0279 reduction in min t-DCF compared to Res-TSSDNet. These significant performance improvements underline the potential of dual-domain fusion—integrating both time-domain and frequency-domain features—to enhance the generalization ability of synthetic speech detection systems. The 13.75% improvement in EER over the best FDI-based method and the 57.93% improvement over the best TDI-based method demonstrate that DDFNet more effectively leverages the complementary strengths of both domains. Additionally, the reduction in min t-DCF further supports these findings, with DDFNet surpassing all previous methods, particularly Res-TSSDNet. The lower min t-DCF indicates that DDFNet not only improves detection accuracy but also minimizes the impact on downstream tasks, such as automatic speaker verification (ASV), making it a more robust solution for real-world applications.

In conclusion, the performance comparison highlights DDFNet’s superior capabilities over both FDI-based and TDI-based methods. The results demonstrate that dual-domain fusion, combined with multi-task learning, significantly enhances synthetic speech detection. By effectively integrating time-domain and frequency-domain features, DDFNet achieves notable improvements in both EER and min t-DCF. These findings confirm the effectiveness of DDFNet and suggest that dual-domain fusion is a promising avenue for advancing synthetic speech detection technologies.

With the continuous progress in synthetic speech detection, the robustness of methods has gained increasing attention. Therefore, we further conduct experiments on the logical access (LA) and deepfake (DF) tracks of ASVspoof 2021 to evaluate the ability of DDFNet to withstand interference from channel changes and compression coding. These experiments provide foundational insights for future work focused on robustness. To assess the impact of channel changes and compression coding on algorithm performance, we compare DDFNet with LFCC+OCT [23], TSSDNet [26], and AASIST [35].

As shown in Table 3, the results show a noticeable performance degradation across all methods when compared to the ASVspoof 2019 LA track, underscoring the adverse impact of channel changes and compression coding on the generalization ability of detection methods to previously unseen speech synthesis algorithms. Specifically, on the LA 2021 track, AASIST performs the best, followed by DDFNet. On the DF 2021 track, LFCC+OCT leads, with DDFNet outperforming TSSDNet, although it still lags behind LFCC+OCT.

Although DDFNet’s performance is slightly lower compared to the ASVspoof 2019 LA track, this decline is primarily attributed to the added complexities of channel changes and compression coding in the 2021 dataset, which challenge the generalization capabilities of all methods. Despite this, the results still demonstrate the competitive advantages of DDFNet in maintaining relatively strong performance even under these challenging conditions, setting the stage for further improvements in future robustness and generalization studies.

5. Conclusions

In this paper, we propose DDFNet, a novel approach for synthetic speech detection that integrates both time-domain and frequency-domain features during training. This dual-domain fusion enhances generalization by enabling the model to adapt to unseen synthetic speech algorithms. Additionally, we incorporate multi-task learning, which enriches feature representation and improves detection accuracy. Our experimental results on the ASVspoof 2019 LA evaluation set demonstrate the effectiveness of our method. DDFNet outperforms state-of-the-art approaches, achieving the best performance with an EER of 0.69% and min t-DCF of 0.0203, surpassing the best FDI-based and TDI-based methods by significant margins (13.75% and 57.93%, respectively). Moreover, additional experiments on the ASVspoof 2021 tracks reveal that while DDFNet’s performance slightly declines due to channel changes and compression coding, it still maintains competitive results, outperforming certain baseline methods. This indicates that DDFNet, while excelling in generalization, also holds strong potential for future advancements in robustness under real-world distortions. These findings highlight the dual-domain fusion and multi-task learning as powerful strategies for improving both generalization and robustness in synthetic speech detection. Our approach not only boosts performance but also offers strong adaptability, making it well-suited for handling future developments in speech synthesis. This work lays the groundwork for more robust and versatile detection systems that can be applied to security and biometrics applications, addressing both evolving synthetic speech algorithms and real-world challenges.

Author Contributions

J.L.: Writing—review & editing, Methodology, Validation. Q.Z.: Writing—original draft, Methodology, Investigation. J.C.: Writing—review & editing, Investigation. H.T.: Writing—review & editing, Supervision, Funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Major Science and Technology Project of Xiamen (Industry and Information Technology Area) under Grant Number: 3502Z20231007.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kumar, Y.; Koul, A.; Singh, C. A deep learning approaches in text-to-speech system: A systematic review and recent research perspective. Multimed. Tools Appl. 2023, 82, 15171–15197. [Google Scholar] [CrossRef]
Sisman, B.; Yamagishi, J.; King, S.; Li, H. An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 132–157. [Google Scholar] [CrossRef]
Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R.; et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4779–4783. [Google Scholar]
Zhang, J.X.; Ling, Z.H.; Dai, L.R. Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 540–552. [Google Scholar] [CrossRef]
Zen, H.; Tokuda, K.; Masuko, T.; Kobayasih, T.; Kitamura, T. A hidden semi-Markov model-based speech synthesis system. IEICE Trans. Inf. Syst. 2007, 90, 825–834. [Google Scholar] [CrossRef]
Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6820–6824. [Google Scholar]
Alzantot, M.; Wang, Z.; Srivastava, M.B. Deep residual neural networks for audio spoofing detection. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, 15–19 September 2019; Kubin, G., Kacic, Z., Eds.; ISCA: Kolkata, Indian, 2019; pp. 1078–1082. [Google Scholar] [CrossRef]
Kwak, I.Y.; Kwag, S.; Lee, J.; Huh, J.H.; Lee, C.H.; Jeon, Y.; Hwang, J.; Yoon, J.W. ResMax: Detecting Voice Spoofing Attacks with Residual Network and Max Feature Map. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 4837–4844. [Google Scholar]
Tak, H.; Patino, J.; Nautsch, A.; Evans, N.W.D.; Todisco, M. Spoofing attack detection using the non-linear fusion of sub-band classifiers. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, 25–29 October 2020; Meng, H., Xu, B., Zheng, T.F., Eds.; ISCA: Kolkata, Indian, 2020; pp. 1106–1110. [Google Scholar] [CrossRef]
Wang, Z.; Cui, S.; Kang, X.; Sun, W.; Li, Z. Densely connected convolutional network for audio spoofing detection. In Proceedings of the 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, 7–10 December 2020; pp. 1352–1360. [Google Scholar]
Wu, Z.; Das, R.K.; Yang, J.; Li, H. Light convolutional neural network with feature genuinization for detection of synthetic speech attacks. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, 25–29 October 2020; pp. 1101–1105. [Google Scholar]
Luo, A.; Li, E.; Liu, Y.; Kang, X.; Wang, Z.J. A capsule network based approach for detection of audio spoofing attacks. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6359–6363. [Google Scholar]
Zhang, Z.; Yi, X.; Zhao, X. Fake speech detection using residual network with transformer encoder. In Proceedings of the 2021 ACM Workshop on Information Hiding and Multimedia Security, Online, 22–25 June 2021; pp. 13–22. [Google Scholar] [CrossRef]
Li, X.; Li, N.; Weng, C.; Liu, X.; Su, D.; Yu, D.; Meng, H. Replay and synthetic speech detection with res2net architecture. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, 6–11 June 2021; pp. 6354–6358. [Google Scholar] [CrossRef]
Li, X.; Wu, X.; Lu, H.; Liu, X.; Meng, H. Channel-wise gated res2net: Towards robust detection of synthetic speech attacks. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, 30 August–3 September 2021; Hermansky, H., Cernocký, H., Burget, L., Lamel, L., Scharenborg, O., Motlícek, P., Eds.; ISCA: Kolkata, Indian, 2021; pp. 4314–4318. [Google Scholar] [CrossRef]
Ma, X.; Liang, T.; Zhang, S.; Huang, S.; He, L. Improved lightcnn with attention modules for asv spoofing detection. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
Ray, R.; Karthik, S.; Mathur, V.; Kumar, P.; Maragatham, G.; Tiwari, S.; Shankarappa, R.T. Feature genuinization based residual squeeze-and-excitation for audio anti-spoofing in sound AI. In Proceedings of the 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kharagpur, India, 6–8 July 2021; pp. 1–5. [Google Scholar]
Tak, H.; Jung, J.; Patino, J.; Todisco, M.; Evans, N.W.D. Graph attention networks for anti-spoofing. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, 30 August–3 September 2021; Hermansky, H., Cernocký, H., Burget, L., Lamel, L., Scharenborg, O., Motlícek, P., Eds.; ISCA: Kolkata, Indian, 2021; pp. 2356–2360. [Google Scholar] [CrossRef]
Wang, X.; Yamagishi, J. A comparative study on recent neural spoofing countermeasures for synthetic speech detection. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, 30 August–3 September 2021; pp. 4259–4263. [Google Scholar]
Cui, S.; Huang, B.; Huang, J.; Kang, X. Synthetic Speech Detection Based on Local Autoregression and Variance Statistics. IEEE Signal Process. Lett. 2022, 29, 1462–1466. [Google Scholar] [CrossRef]
Lei, Z.; Yan, H.; Liu, C.; Ma, M.; Yang, Y. Two-Path GMM-ResNet and GMM-SENet for ASV Spoofing Detection. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6377–6381. [Google Scholar]
Yue, F.; Chen, J.; Su, Z.; Wang, N.; Zhang, G. Audio spoofing detection using constant-q spectral sketches and parallel-attention se-resnet. In Proceedings of the Computer Security—ESORICS 2022—27th European Symposium on Research in Computer Security, Copenhagen, Denmark, 26–30 September 2022; Proceedings, Part III, ser. Lecture Notes in Computer Science. Atluri, V., Pietro, R.D., Jensen, C.D., Meng, W., Eds.; Springer: Berlin/Heidelberg, Germany, 2022; Volume 13556, pp. 756–762. [Google Scholar] [CrossRef]
Li, C.; Yang, F.; Yang, J. The Role of Long-Term Dependency in Synthetic Speech Detection. IEEE Signal Process. Lett. 2022, 29, 1142–1146. [Google Scholar] [CrossRef]
Muckenhirn, H.; Magimai-Doss, M.; Marcel, S. End-to-End convolutional neural network-based voice presentation attack detection. In Proceedings of the 2017 IEEE International Joint Conference on Biometrics (IJCB), Denver, CO, USA, 1–4 October 2017; pp. 335–341. [Google Scholar]
Ma, Y.; Ren, Z.; Xu, S. Rw-resnet: A novel speech anti-spoofing model using raw waveform. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, 30 August–3 September 2021; Hermansky, H., Cernocký, H., Burget, L., Lamel, L., Scharenborg, O., Motlícek, P., Eds.; ISCA: Kolkata, Indian, 2021; pp. 4144–4148. [Google Scholar] [CrossRef]
Hua, G.; Teoh, A.B.J.; Zhang, H. Towards end-to-end synthetic speech detection. IEEE Signal Process. Lett. 2021, 28, 1265–1269. [Google Scholar] [CrossRef]
Muckenhirn, H.; Abrol, V.; Magimai-Doss, M.; Marcel, S. Understanding and visualizing raw waveform-based cnns. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, 15–19 September 2019; Kubin, G., Kacic, Z., Eds.; ISCA: Kolkata, Indian, 2019; pp. 2345–2349. [Google Scholar] [CrossRef]
Wang, J.; Hua, G.; Huang, S. End-to-end Synthetic Speech Detection Based on Attention Mechanism. J. Signal Process. 2022, 38, 1975–1987. [Google Scholar]
Wu, Z.; Evans, N.; Kinnunen, T.; Yamagishi, J.; Alegre, F.; Li, H. Spoofing and countermeasures for speaker verification: A survey. Speech Commun. 2015, 66, 130–153. [Google Scholar] [CrossRef]
Todisco, M.; Wang, X.; Vestman, V.; Sahidullah, M.; Delgado, H.; Nautsch, A.; Yamagishi, J.; Evans, N.W.D.; Kinnunen, T.H.; Lee, K.A. Asvspoof 2019: Future horizons in spoofed and fake audio detection. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, 15–19 September 2019; Kubin, G., Kacic, Z., Eds.; ISCA: Kolkata, Indian, 2019; pp. 1008–1012. [Google Scholar]
Sanchez, J.; Saratxaga, I.; Hernáez, I.; Navas, E.; Erro, D.; Raitio, T. Toward a Universal Synthetic Speech Spoofing Detection Using Phase Information. IEEE Trans. Inf. Forensics Secur. 2015, 10, 810–820. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, W.; Zhang, P. The effect of silence and dual-band fusion in anti-spoofing system. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, 30 August–3 September 2021; Hermansky, H., Cernocký, H., Burget, L., Lamel, L., Scharenborg, O., Motlícek, P., Eds.; ISCA: Kolkata, Indian, 2021; pp. 4279–4283. [Google Scholar] [CrossRef]
Wang, X.; Yamagishi, J.; Todisco, M.; Delgado, H.; Nautsch, A.; Evans, N.; Sahidullah, M.; Vestman, V.; Kinnunen, T.; Lee, K.A.; et al. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Comput. Speech Lang. 2020, 64, 101114. [Google Scholar] [CrossRef]
Yamagishi, J.; Wang, X.; Todisco, M.; Sahidullah, M.; Patino, J.; Nautsch, A.; Liu, X.; Lee, K.A.; Kinnunen, T.; Evans, N.; et al. ASVspoof 2021: Accelerating progress in spoofed and deepfake speech detection. arXiv 2021, arXiv:2109.00537. [Google Scholar]
Jung, J.w.; Heo, H.S.; Tak, H.; Shim, H.j.; Chung, J.S.; Lee, B.J.; Yu, H.J.; Evans, N. AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6367–6371. [Google Scholar]
Wang, X.; Delgado, H.; Tak, H.; Jung, J.w.; Shim, H.j.; Todisco, M.; Kukanov, I.; Liu, X.; Sahidullah, M.; Kinnunen, T.; et al. ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale. arXiv 2024, arXiv:2408.08739. [Google Scholar]
Müller, N.M.; Kawa, P.; Choong, W.H.; Casanova, E.; Gölge, E.; Müller, T.; Syga, P.; Sperl, P.; Böttinger, K. Mlaad: The multi-language audio anti-spoofing dataset. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–7. [Google Scholar]
Kinnunen, T.; Lee, K.; Delgado, H.; Evans, N.W.D.; Todisco, M.; Sahidullah, M.; Yamagishi, J.; Reynolds, D.A. t-dcf: A detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification. In Proceedings of the Odyssey 2018: The Speaker and Language Recognition Workshop, Les Sables d’Olonne, France, 26–29 June 2018; Larcher, A., Bonastre, J., Eds.; ISCA: Kolkata, Indian, 2018; pp. 312–319. [Google Scholar] [CrossRef]
Zhang, Y.; Jiang, F.; Duan, Z. One-class learning towards synthetic voice spoofing detection. IEEE Signal Process. Lett. 2021, 28, 937–941. [Google Scholar] [CrossRef]
Ge, W.; Patino, J.; Todisco, M.; Evans, N. Raw differentiable architecture search for speech deepfake and spoofing detection. In Proceedings of the 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, Brno, Czechia, 30 August–3 September 2021; pp. 22–28. [Google Scholar]
Fu, Q.; Teng, Z.; White, J.; Powell, M.E.; Schmidt, D.C. FastAudio: A Learnable Audio Front-End For Spoof Speech Detection. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 3693–3697. [Google Scholar]
Ma, K.; Feng, Y.; Chen, B.; Zhao, G. End-to-End Dual-Branch Network Towards Synthetic Speech Detection. IEEE Signal Process. Lett. 2023, 30, 359–363. [Google Scholar] [CrossRef]

Figure 1. The dual-domain fusion network for synthetic speech detection.

Figure 2. The structure of the TDFEN.

Figure 3. The structure of the FDFEN.

Table 1. The results of the ablation experiments that demonstrate the effectiveness of dual-domain fusion and multi-task learning.

Configuration	EER (%) ↓	t-DCF ↓
DDFNet	0.69	0.0203
DDFNet w/o auxiliary loss	2.46	0.0665
TDFEN	4.34	0.1283
FDFEN	2.55	0.0760
Average	1.01	0.0386

Table 2. Performance comparison of our proposed DDFNet and state-of-the-art approaches on ASVspoof 2019 LA evaluation set (Note: the methods with asterisks employ multiple frequency domain acoustic features).

Approaches	EER (%) ↓	t-DCF ↓
CQT+ResMax [8]	2.19	0.0600
LFCC+ResNet18 [39]	2.19	0.0590
DenseNet * [10]	1.98	0.0469
LFCC+LCNN-LSTM-sum [19]	1.92	0.0520
SE-Res2Net50 * [14]	1.89	0.0452
LFCC+GMM-ResNet [21]	1.80	0.0498
CQT+MCG-Res2Net50 [15]	1.78	0.0520
Raw PC-DARTS [40]	1.77	0.0517
FastAudio-Tri+X-vector [41]	1.73	0.0491
LPS+SENet [32]	1.14	0.0368
Capsule * [12]	1.07	0.0328
LFCC+OCT [23]	1.06	0.0345
scDenseNet * [20]	0.98	0.0320
PA-SE-ResNet * [22]	0.96	0.0307
AASIST [35]	0.83	0.0275
Dual-Branch Network * [42]	0.80	0.0214
RW-ResNet [25]	2.98	0.0817
Res-TSSDNet [26]	1.64	0.0482
Ours: DDFNet	0.69	0.0203

Table 3. Performance comparison of our proposed DDFNet and state-of-the-art approaches on ASVspoof 2021 LA and DF 2021 evaluation set.

Approaches	LA2021		DF2021
Approaches	EER (%) ↓	t-DCF ↓	EER (%) ↓
DDFNet	14.45	0.5924	26.91
LFCC+OCT [23]	15.68	0.6200	20.95
TSSDNet [26]	15.24	0.6075	30.07
AASIST [35]	11.47	0.5081	21.07

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, J.; Zhang, Q.; Cao, J.; Tian, H. DDFNet: A Dual-Domain Fusion Network for Robust Synthetic Speech Detection. Big Data Cogn. Comput. 2025, 9, 58. https://doi.org/10.3390/bdcc9030058

AMA Style

Lu J, Zhang Q, Cao J, Tian H. DDFNet: A Dual-Domain Fusion Network for Robust Synthetic Speech Detection. Big Data and Cognitive Computing. 2025; 9(3):58. https://doi.org/10.3390/bdcc9030058

Chicago/Turabian Style

Lu, Jing, Qiang Zhang, Jialu Cao, and Hui Tian. 2025. "DDFNet: A Dual-Domain Fusion Network for Robust Synthetic Speech Detection" Big Data and Cognitive Computing 9, no. 3: 58. https://doi.org/10.3390/bdcc9030058

APA Style

Lu, J., Zhang, Q., Cao, J., & Tian, H. (2025). DDFNet: A Dual-Domain Fusion Network for Robust Synthetic Speech Detection. Big Data and Cognitive Computing, 9(3), 58. https://doi.org/10.3390/bdcc9030058

Article Menu

DDFNet: A Dual-Domain Fusion Network for Robust Synthetic Speech Detection

Abstract

1. Introduction

2. Related Work

3. The Proposed Method

3.1. Time-Domain Feature Extraction Network

3.2. Frequency-Domain Feature Extraction Network

3.3. Feature Fusion Network

3.4. Multi-Task Learning

3.4.1. Time-Domain Loss

3.4.2. Frequency-Domain Loss

3.4.3. Classification Loss

3.4.4. Total Loss

4. Performance Evaluation

4.1. Experiment Settings

4.1.1. Dataset

4.1.2. Metrics

4.1.3. Data Preprocessing

4.1.4. Implementation Details

4.2. Ablation Study

4.3. Performance Comparison with Previous Methods

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI