1. Introduction
Recent advances in deep neural networks (DNNs) have improved the performance of speaker verification (SV) systems, including short-duration and far-field scenarios [
1,
2,
3,
4,
5]. However, SV systems are known to be vulnerable to various presentation attacks, such as replay attacks, voice conversion, and speech synthesis. These vulnerabilities have inspired research into presentation attack detection (PAD), which classifies given utterances as spoofed or not spoofed [
6,
7,
8], where many DNN-based systems have achieved promising results [
9,
10,
11].
Table 1 demonstrates the vulnerability of conventional SV systems when faced with presentation attacks. The performance is reported using the three types of equal error rates (EERs) described in
Table 2 [
12].
Table 2 shows the target and non-target trials for calculating the EER, which are represented by 1 and 0, respectively. Zero-effort (ZE)-EER describes the conventional SV performance without considering the presence of presentation attacks. PAD-EER denotes the EER for PAD which only considers whether an input is spoofed. Integrated speaker verification (ISV)-EER describes overall performance, considering both speaker identity and spoofing. We refer to “replay spoofing-aware SV” as an ISV task and report its performance using ISV-EER. Results show that the EER of SV degrades to 33.72% with replayed utterances; this fatal performance degradation supports the necessity of a spoofing-aware ISV system. In this paper, PAD refers to replay attacks, because the ASVspoof2017 dataset only focuses on replay attack detection which is known to be the easiest yet effective attack. Three tasks are considered: SV, PAD, and ISV, and performance is evaluated using ZE-EER, PAD-EER, and ISV-EER.
While a number of studies have worked to develop independent systems for SV and PAD, few have sought to integrate the SV and PAD systems [
12,
13,
14,
15,
16,
17]. More specifically, this handful of studies proposed approaches such as cascaded, parallel [
12,
13], and joint systems [
14,
16,
17]. Most existing studies used common features to integrate the two tasks for system efficiency.
Section 2 further takes up this existing body of work.
In this paper, we propose two spoofing-aware frameworks for the ISV task, illustrated in
Figure 1. We use a light convolutional neural network (CNN) (LCNN) architecture [
18] for both frameworks; this choice is based on its success in various PAD studies [
11,
19]. The first proposed framework expands existing work by proposing a monolithic end-to-end (E2E) architecture. More specifically, it conducts speaker identification (SID) and PAD to train a common feature using multi-task learning (MTL) [
20]. Concurrently, it uses the embeddings to compose trials and conduct the ISV task. Using the sum of SID, PAD, and ISV losses, the entire DNN is jointly optimized. However, based on tendencies observed during internal experiments, we hypothesize that training a common feature for the ISV task may not be ideal because the properties required for each task differ: the PAD task representation uses device and channel information while SV needs to remove it (further discussed in
Section 3).
Based on our hypothesis, we propose a novel modular approach using a separate DNN. This approach inputs two speaker embeddings (for enrollment and test each) and a PAD prediction to make the ISV decision. It adopts a two-phase approach. In the first phase, the speaker identifier and PAD system are trained separately. In the second phase, speaker embeddings are extracted from a pretrained speaker identifier [
21], and the embeddings and PAD prediction results are fed to a separate DNN module. Using this framework, we achieved a 21.77% relative improvement in terms of ISV-EER.(We use the trial in
https://www.asvspoof.org/index2017.html for calculating ISV-EER.)
The contributions of this paper are as follows.
Propose a novel E2E framework that jointly optimizes SID, PAD, and the ISV task.
Experimentally validate the hypothesis that the discriminative information required for the SV and the PAD task may be distinct, requiring separate front-end modeling.
Propose a separate modular back-end DNN that takes speaker embeddings and PAD predictions as an input to make ISV decisions.
The remainder of the paper is organized as follows.
Section 2 details related work on the integrated system of SV and PAD.
Section 3 introduces the two proposed frameworks.
Section 4 presents our experiments and results, and the paper is concluded in
Section 5.
2. Related Work
In this section, we introduce the two studies most relevant to this study [
12,
16,
17]. First, Todisco et al. [
12] propose a separate modeling of two Gaussian back-end systems with a unified threshold for both SV and PAD tasks. Their study explores various acoustic features to find which ones best simultaneously suited both tasks. As organizers of the ASVspoof challenges, official trials for the ISV task are released in this study. For our purposes, it is important to highlight that these trials include both ZE and replayed non-target, which we use throughout this paper. However, Todisco et al. [
12] reported the average of two EERs—ZE-EER and PAD-EER—because they separately modeled two Gaussian mixture models for each task.
Li et al. [
16,
17] extended Todisco et al.’s work [
12] by proposing an integrated ISV system; this study is the first that reports an ISV-EER. More specifically, they propose a three-phase training framework for extracting an embedding for the ISV task, followed by a probabilistic linear discriminant analysis (PLDA) back-end. In the first phase, a MTL [
20] framework is employed to train a common embedding for both SV and PAD tasks. In the second and third phases, the embedding is adapted to fit the ISV task. However, because the DNN is adapted in the third phase to fit the enrollment speakers, it has limitations for real-world scenarios. In addition, because the performance is reported does not exploit organizer’s official trials, it is difficult to compare the performance with the literature.
In this paper, we first propose an E2E framework, illustrated in
Figure 1a, that extends the work of Li et al. [
16,
17] in two aspects: First, we adopt a single phase training approach by using three loss functions for SID, PAD, and ISV. Second, our framework directly outputs a spoofing-aware score without using a separate back-end system.