Replay Attack Detection Using Integrated Glottal Excitation Based Group Delay Function and Cepstral Features

Chaudhari, Amol; Shedge, Dnyandeo; Bairagi, Vinayak; Nanthaamornphong, Aziz

doi:10.3390/sym16070788

Open AccessArticle

Replay Attack Detection Using Integrated Glottal Excitation Based Group Delay Function and Cepstral Features

¹

Department of Electronics and Telecommunication Engineering, AISSMS Institute of Information Technology, Pune 411001, India

²

College of Computing, Prince of Songkla University, Phuket Campus, Phuket 83120, Thailand

^*

Authors to whom correspondence should be addressed.

Symmetry 2024, 16(7), 788; https://doi.org/10.3390/sym16070788

Submission received: 19 May 2024 / Revised: 10 June 2024 / Accepted: 17 June 2024 / Published: 23 June 2024

(This article belongs to the Section Engineering and Materials)

Download

Browse Figures

Versions Notes

Abstract

:

The automatic speaker verification system is susceptible to replay attacks. Recent literature has focused on score-level integration of multiple features, phase information-based features, high frequency-based features, and glottal excitation for the detection of replay attacks. This work presents glottal excitation-based all-pole group delay function (GAPGDF) features for replay attack detection. The essence of a group delay function based on the all-pole model is to exploit information from the speech signal phase spectrum in an effective manner. Further, the performance of integrated high-frequency-based CQCC features with cepstral features, subband spectral centroid-based features (SCFC and SCMC), APGDF, and LPC-based features is evaluated on the ASVspoof 2017 version 2.0 database. On the development set, an EER of 3.08% is achieved, and on the evaluation set, an EER of 9.86% is achieved. The proposed GAPGDF features provide an EER of 10.5% on the evaluation set. Finally, integrated GAPGDF and GCQCC features provide an EER of 8.80% on the evaluation set. The computation time required for the ASV systems based on various integrated features is compared to ensure symmetry between the integrated features and the classifier.

Keywords:

automatic speaker verification; group delay function; high-frequency band; IA-IF; replay attack

1. Introduction

Speaker recognition refers to recognizing person from their voices. Speaker recognition involves speaker identification and speaker verification. The process of speaker identification is to identify the person from a known set of voices. Speaker verification verifies the claimed identity of an individual. Recent challenges in the Automatic Speaker Verification (ASV) system involve the detection of spoofing attacks. Amid different spoofing attacks, replay attack detection is more challenging in the case of the automatic speaker verification (ASV) system. Also, it is easy for attackers to mount the reply attack as it does not involve any specific expertise. In recent years, there have been many efforts from the research community for countermeasures and recognition of the reply attack. Major efforts to detect replay signals are made on the ASVspoof 2017 challenge, which a group of researchers held during INTERSPEECH 2017 [1]. Afterwards, ASVspoof 2017 version 2.0 was made available, with certain meta-data modifications [2]. For the ASV-spoof 2017 database, a few phase- and magnitude-based features along with fused combination strategies were suggested. CQCC, MFCC, IMFCC, SCMC, and SCFC are a few of the magnitude-based features proposed in [3]. The combined system CQCC, Mel-RP, and PBSFVT features are presented in [4]. In Interspeech 2018, frequency modulation features [5] and Linear Prediction features based on frequency domain [6] are demonstrated. In [7], score-level fusion of CQCC and AWFCC is reported for improved performance for the detection of replay attack. Score level fusion of power function-based features and CQCC is proposed in [8].

Recently, many systems have been proposed for the detection of replayed signal from genuine speech. Such systems are categorized into two types, one type which focus on features and other on the classifiers [9]. One of the attributes of the taxonomy presented in [9] mentioned use of multiple features for improved performance [9]. Another attribute mentioned in [9] is fusion. Fusion can be performed at score level or at feature level [9]. Serial feature fusion of multiple features has been demonstrated in [10] by current authors. There have been many ASV systems proposed which are based on use of multiple features and combining them at score level for replay attack detection. These efforts are studied in following section.

2. Related Work

There have been several approaches proposed based on the use of phase information. The effectiveness of relative phase information-based features, namely, LPR-RP and LPAES-RP is demonstrated in [11]. In [11], authors have mentioned that LPR-RP and LPAES-RP features fused with CQCC at the score level are effective. The authors in [12] have proposed linear prediction residual consisting of excitation source information based MFCC (RMFCC) features. RMFCC and CQCC features fused at score level reported better performance in [12]. The Hilbert envelope based and residual phase features have been fused at score level [13].

The efficacy of instantaneous frequency-based features has been mentioned in the recent work. Features based on instantaneous amplitude and instantaneous frequency are proposed in [14]. These features are extracted using energy-separation algorithm and ESA-IFCC features are fused with CQCC at score level [14]. Some of the approaches based on instantaneous frequency-based feature are demonstrated in [15,16].

Score level fusion of source, instantaneous frequency and cepstral features is proposed in [15]. In [16], Instantaneous Frequency based Cochlear Filter Cepstral Coefficients are proposed. Multiple features CQCC, CFCC, CFCCIF, CFCCIF-ESA, CFCCIF-QESA are fused at the score level in [16]. Use of multiple features AFCCs_FAF, ARP_DBF, and CQCC at the score level is proposed in [17]. Glottal MFCC and shifted CQCC based features fused at the score level in [18]. Teager energy-based features are fused at the score level with cepstral features in [19]. Autoencoder reconstructed features when fused at the score level with MFCCs, CQCCs, SCMCs, CCCs, LPCCs, IMFCCs, RFCCs, LFCCs, SCFCs, and spectrogram are effective. This has been demonstrated in [20]. Effectiveness of gammatone-scale relative phase combined with CQCC at the score level is presented in [21].

Recent studies have also proposed the enhancement of existing feature extraction approaches. Based on constant Q-transform three concatenated features, CQSPIC, CQEPIC, and CESPIC are proposed in [22]. In [23], improved ETECC features have been proposed for replay attack detection. Glottal information based CQCC features have been proposed with importance of high frequency band in [24]. In [24], frequency band 7–8 kHz is considered for glottal information based CQCC features. The importance of high-frequency features in the computation of CQCCs is demonstrated in [25,26]. Replay attack detection using 2D-ILRCC features as the enhancement of source based RAD features proposed in [27].

From the extensive literature survey, it is observed that use of multiple features and fusion at score level is widely carried out. Some of approaches have focused on sub-band analysis and enhancing baseline feature extraction schemes. Recent approaches have focused on Glottal information-based features using iterative adaptive inverse filtering [18,24]. It is mentioned in [9] that fusion can enhance replay attack detection system performance, but it also adds more tasks to the fusion computation, increasing computation time. The motivation for this work is to analyze the trade-off between computation time and the detection rate. This work examines the score level fusion approach considering the computation time. This work presents the computation time for the systems that performed better in terms of detection rate. The symmetry between integrated features and classifier is necessary for computation time.

This work evaluated the performance of ASV system by integrating cepstral features, linear prediction-based features at score level. Unlike the approach mentioned in [10], score level fusion is evaluated by integrating different systems based on cepstral and LPC based features. In this work, CQCC features are evaluated with high frequency band (6–8 kHz) and score level integration is evaluated considering cepstral and linear prediction-based features. This work also evaluated IMFCC, SCMC, SCFC, RFCC [3] and APGDF [28] based features. The performance of IMFCC, SCMC, SCFC, and RFCC features has been evaluated in [3]. However, all pole group delay function-based features are less popular for replay attack detection. This work proposes two different approaches, one is based on GCQCC with cepstral mean and variance normalization (CMVN). Second is Glottal information based APGDF (GAPGDF) features with CMVN.

The structure of this paper is as follows. Section 1 presented the introduction. Section 2 presents the related work, followed by methods in Section 3. The experimentation and outcomes are covered in Section 4. Section 5 provides a summary and conclusion of the paper.

3. Method

3.1. Glottal Flow Derivative

Speech is thought of as the time-varying vocal-tract system’s reaction to a stream of airflow passing through the glottis, the slit-like entrance of the vocal folds, in terms of signal processing [18,24,29]. Airflow through the glottis or glottal flow stimulates the vocal tract system. The lip radiation, corresponding to first order differentiation, filters the vocal tract system’s output even further [18,24,29]. Due to the lip’s differentiation property, the glottal flow derivative (GFD) in uttered speech is the form in which the excitation can be observed. Thus, the glottal flow’s derivative is frequently regarded as the excitation signal representation in speech processing tasks [18,24,29]. Glottal flow and its derivative waveforms can be referred from [18,24,29]. The phases open, closed, and return are the three distinct sections that make up the full cycle.

The closed phase is the state in which there is no airflow and the vocal folds are completely closed. During the open phase, there is non-zero airflow and either fully or partially opened vocal folds. The return phase, which starts when the speech mechanism has finished producing speech, is the period between the glottal closure time and the glottal flow derivative’s most negative value. A sequence of glottal cycles forms excitation. Consequently, the three levels of the excitation information, i.e., between, within, and across the glottal cycle, are reflected in the glottal flow derivatives. The information about the glottal cycle comprises the timed activities, like the glottal flow duration and the instants of closure and opening [18,24,29]. Between the subsequent glottal cycles, information about the pitch average and epoch strengths is reflected. High level information can be detected across the many glottal cycles, such as prosody and intonation [18,24,29]. As mentioned in [18,24], the numerous devices and components employed during the replay configuration setup have an impact on all these information. The GFD signal’s amplitude, frequency, and shape may all be altered by these disturbances [18,24]. As mentioned in [18,24], it can be forecast that for the purpose of detecting replay signals, the use of the GFD signals from replay and genuine voice samples may be helpful.

3.2. IA-IF

Various methods of GFD estimation are compared in [30]. These methods include DYPSA [30,31], Hilbert Envelope-based method [30,32], SEDREAMS [30,33], YAGA [30,34], and ZFR [30,35]. Along with these methods, dynamic plosion index (DPI) algorithms proposed in [36] need precise glottal closure instant (GCI) estimation for GFD signals [18]. Replay signals have higher levels of noise than the real voice signal, mostly from reverberation and the impact of intermediate devices [18,37]. It is challenging to accurately estimate GCIs from noisy speech [18]. As an alternative, GFD estimate of replay signals is performed using the iterative adaptive inverse filtering (IAIF) technique, which is independent of GCI locations in [18,24]. From the speech signal, the GFD signal is approximated using the iterative LP analysis method in the IAIF approach [38].

An estimate of the glottal contribution is obtained in the first iteration by the computation of a first order LPC model. To produce a more accurate model for the glottal contribution, the higher order LPC model is computed in the second iteration. The details of the IAIF method can be found in [24]. To examine how the replay mechanism affects the information from the excitation source, an analysis is conducted on the temporal and spectral representation of the replay and genuine signal. The GFD signal representations in the temporal and spectral domains, as computed from genuine (T_1000003.wav) and replay (T_1001511.wav) speech pairs obtained from the ASVspoof 2017 version 2.0 database, are displayed in Figure 1.

From Figure 1a, a replayed/spoof speech signal is distorted in amplitude and periodicity in temporal representation. Also, spectral representation is more distorted for replayed/spoof speech. Figure 2 shows the spectrograms of genuine (Figure 2a) and spoofed (Figure 2b) speech signals after processing with the IAIF method. It is evident from the spectrogram that spoof speech signal loses periodicity (as shown in rectangular shape) and concentration of speech signal is less (as shown in oval shape) as compared to genuine speech signal. These differences may help for distinguishing genuine signal from a spoofed signal.

3.3. Cepstral Mean and Variance Normalization

Robustness is a crucial factor in assessing how well replay detection methods work [39]. As mentioned in [39], the mismatch between the two datasets usually results in a decline in the detector’s performance when a spoof/replay detection method trained on one dataset is applied to sounds from another dataset. This is because of different background/channel noises. CMVN is commonly used in replay attack detection in various approaches. Such efforts are [12,17,18,19,23,40]. “CMVN eliminates the convolution noise in the temporal domain, such as channel distortion, and the channel noise in the cepstral domain, which corresponds to the additive deviation of the cepstral domain” [39]. Each training and test sample is converted to zero mean and unit variance using CMVN, which matches mean and variance CMVN [39,41].

Consider

y_{t}

be the N-dimensional cepstral feature vector at time t.

y_{t} (i)

represents the ith component of

y_{t}

. y = {

y_{1}

,

y_{2}

, …,

y_{t}

, …...,

y_{T}

} represents the voice segment of length T. “CMVN first calculates the mean μ and the variance σ² using the maximum likelihood estimate for each feature dimension” [39],

μ (i) = \frac{1}{T} \sum_{t = 1}^{T} y_{t} (i), 1 \leq i \leq N

(1)

σ^{2} (i) = \frac{1}{T - 1} \sum_{t = 1}^{T} {{(y}_{t} (i) - μ (i))}^{2}, 1 \leq i \leq N

(2)

Now, each dimension feature vector is normalized,

y_{t} (i) = \frac{y_{t} (i) - μ (i)}{σ (i)}, 1 \leq i \leq N, 1 \leq t \leq T

(3)

3.4. Group Delay Function

The phase spectrum’s negative derivative is known as the group delay function [42]. If x(n) is a frame of speech, then its Fourier transform representation in polar form is as follows,

x (n) \overset{F}{\leftrightarrow} X (ω) = |X (ω)| e^{j θ (ω)}

(4)

Here, the magnitude spectrum is |X(ω)| and θ(ω) represents the phase spectrum. The continuous phase function’s negative derivative is the group delay function.

τ (ω) = - \frac{d}{d ω} θ (ω)

(5)

Figure 3 shows the group delay spectrum of the genuine (Figure 3a) and spoof (Figure 3b) signal. Closely spaced higher formants are visible for genuine and spoof speech. It is evident from Figure 3 that peaks are dominant in the spectrum of spoof signal. This might be due to addition of channel noise produced by the recording device which is being added or integrated into the speech signal.

3.5. Group Delay Function of All-Pole Models

The speech spectrum is approximated in linear prediction analysis using an all-pole model [42,43]. “An equivalent representation of the vocal tract as a cascade of several second-order and first-order all-pole filters is possible when it is thought of as an all-pole filter” [42,44]. The product of the magnitude spectra of each individual filter yields the overall magnitude spectrum in this representation. Individual phase spectra are added together to form the overall phase spectrum, which in turn forms the group delay spectrum [42].

H (ω) = \frac{G}{1 - \sum_{k = 1}^{p} a (k) e^{- j ω k}}

(6)

For the Equation (6), the linear prediction can be specified as, for the power spectrum |X(ω)|², find the coefficients set a(k) so that in a least-squared sense, the speech power spectrum and H(ω) power spectrum match [42,43]. Here, signal dependent gain is G. The model order is p. The filter formed by H(ω) has both a magnitude response and a phase response. This filter’s group delay function is known as the all-pole group delay function. “Closely spaced higher formants can be captured because of the group delay function’s high-resolution characteristic” [42]. In the group delay spectrum, the higher order formants are more noticeable, especially under low and high vocal effort conditions [42]. As mentioned in [42,45], by applying DCT, the APGDF is converted into cepstral coefficients. The Figure 4 shows the feature extraction process.

3.6. Proposed IA-IF Based APGDF Feature Extraction

The block diagram of the training phase and testing phase framework is shown in Figure 5. The training phase involves IA-IF based APGDF feature extraction from genuine and spoof speech signals, followed by a Gaussian Mixture model for genuine and spoof speech signals. In the testing phase, IAIF based APGDF feature features are extracted from test samples present in the development and evaluation sets. Scores are computed using the log-likelihood ratio, which is then used to compute the Equal Error Rate (EER). In this work, the IA-IF based APGDF features are named GAPGDF.

3.7. Classifier

GMM is still the most widely utilized classifier [9]. The GMM classifier (CQCC + GMM) is also provided by the ASVspoof 2017 challenge. This work employed GMM as a classifier to discriminate between genuine and spoofed speech. One GMM for genuine and another one for spoof utterances are trained using 512-component models. This training is carried out using the Expectation-Maximization (EM) algorithm with random initialization. The log-likelihood ratio for the test utterance is used to generate the score, considering both the genuine and spoofed speech models. Implementation of GMM is available from VLFeat [46].

3.8. Evaluation Metric

Using the Bosaris toolbox [47], the ASVspoof 2017 challenge [1] has provided an evaluation metric for baseline systems called the Equal Error Rate (EER). In the recent years, EER has been the main criteria for the evaluation of replay attack detection systems [9]. In this work, EER is used as an evaluation metric for the computation of the performance of the system. More details on the EER can be found in [18].

4. Results and Discussion

4.1. Database

The ASVspoof 2017 version 2.0 database [2] has been used for the experimentation. This database contains genuine utterances from the RedDots corpus [2]. A variety of heterogeneous devices and acoustic environments are used for replaying and recording bona fide utterances which results in spoofed utterances [2]. This database has three non-overlapping sections, training, development, and evaluation subsets. More information about the database is available in [2]. Table 1 describes the database.

4.2. Experiment Setup

To conduct the experiments, MATLAB version R2021a software is used on the Windows 11 operating system. The system used was a Lenovo Legion Core i7 10th Generation laptop.

4.3. Baseline System

Current authors have esvaluated baseline CQCC [2] and LFCC [48] methods in [10] along with MFCC, LPC, and LPCC methods. The results of the work are shown in Table 2. In all experiments, DevEER (%) presents % equal error rate achieved on the development set and EvalEER (%) present % equal error rate achieved on the evaluation set.

In [10], the current authors have integrated cepstral and LPC based features for various combinations at the serial level. This work integrates cepstral and LPC based features at the score-level.

4.4. Score-Level Integration of Cepstral and LPC Based Features

This section presents the results of the integration of cepstral and LPC based features at the score level. The parameters considered are the same as those mentioned in [10] for the evaluation of cepstral and LPC based features as shown in Table 3. The MFCC, LPC, and LPCC are referred from voicebox toolbox [49].

In addition to the cepstral and LPC based features, timbrel feature set formed using various features are demonstrated in [10]. This work explores the efficacy of these features at score level integration. The 90-dimensional zero-crossings feature set (ZCR) is also evaluated along with the timbrel feature set. These timbrel feature set included, brightness [50], entropy, event density, flatness, inharmonicity, kurtosis, pitch, irregularity [50], rolloff [50], rms, skewness, and spread. These features are extracted using the MIR toolbox [51]. The basic idea for score level integration is to integrate the scores computed from different systems. For example, CQCC + GMM is one system, MFCC + GMM is another system, and so on. Table 4 presents the results of the experiments carried out by integrating the Cepstral domain, LPC, Zero Crossings and Timbre feature set at the score level.

From Table 4, on the development set, integration of CQCC, LFCC, MFCC, LPC, and LPCC features achieved an EER of 5.44%, and integration of ZCR, MFCC, CQCC, LFCC, and LPC features achieved an EER of 18.33% on the evaluation set. It is evident that cepstral and LPC-based features are more prominent than timbrel features when integrated at the score level.

4.5. High Frequency Band

In the process of spoof signal generation, original speech undergoes degradation due to the impulse response of the microphone and playback device as well as the environment’s acoustic properties [26]. This hypothesis is analyzed to discriminate the replayed speech from genuine speech using spectral features derived from several frequency subbands in [26]. The work presented in [24,25,26] has focused on information present in the high-frequency band. The high-frequency band 6–8 kHz is shown to be more useful for replay attack detection. CQCC features are extracted considering the frequency band of 6–8 kHz [24,25,26]. This work evaluated the performance of CQCC features considering the frequency band of 6–8 kHz. Also, various cepstral and LPC-based features are integrated with high frequency based CQCC features. Table 5 presents the parameterization used for the extraction of high frequency based CQCC features. This parameterization has been demonstrated in [52].

Table 6 presents experiment results for integration high frequency based CQCC features with us cepstral and LPC based features.

As observed in Table 6, high frequency based CQCC features resulted in better performance than baseline CQCC features. On the development set, integrated features CQCC, LFCC, MFCC, LPC, and LPCC achieved an EER of 4.81% and 13.45% on the evaluation set. From these results, it is observed that integration of high frequency based CQCC features with cepstral and LPC based features performs better at score level fusion.

4.6. APGDF, IMFCC, RFCC, SCFC and SCMC Features

For replay attack detection, spectral sub-band centroid based features (SCMC and SCFC), Inverted MFCC, and RFCC features are evaluated in [3]. In [28], APGDF features are examined for synthetic speech detection. This work evaluated the performance of these features to examine their effectiveness. Table 7 shows the parameters considered for the respective features. The implementation of features mentioned in Table 7 are available in [28,53,54].

The Figure 6 shows experiment results of APGDF, IMFCC, RFCC, SCFC and SCMC features.

As observed from Figure 6, though the results are less prominent, these features can provide complementary information, so they are integrated with the cepstral domain and LPC domain features. This is because results achieved on the development set are more prominent than baseline methods. As the group delay function based on the all-pole model provides closely spaced higher formants [42], this work uses APGDF features for replay attack detection.

4.7. Score Level Integration of All Features

It is evident that high frequency based CQCC features resulted in better performance as compared to baseline CQCC. Moreover, integrated features based on high frequency based CQCC, LFCC, MFCC, and LPC based features (LPC and LPCC) resulted in improved performance on the development and evaluation set. This proves that integrated features at the score level are effective. Further, to obtain complementary information, features based on the Sub-band Centroid (SCMC and SCFC), Inverted MFCC, RFCC, and APGDF are integrated. Table 8 shows the result of the integration of these features.

As observed in Table 8, an EER of 3.08% is attained on the development set for high frequency based CQCC features integrated with cepstral features, LPC based features, IMFCC, RFCC, APGDF, SCMC, and SCFC features. On the evaluation set, an EER of 9.86% is achieved for the same integrated features.

4.8. Glottal Excitation Based CQCC Features with CMVN

Next, this work evaluated the performance of glottal excitation based CQCC features. These features are evaluated in [24]. However, in this work, the performance of GCQCC features with the CMVN technique is evaluated. CMVN implementation referred from MSR Identity Toolbox [55]. For glottal excitation computation, IA-IF method from the COVAREP repository [56] version 1.4.2 is used. The CQCC parameters considered are the same as the baseline CQCC. Nevertheless, this work evaluated the performance of GCQCC features by varying the number of cepstral coefficients, including static, delta, and double delta coefficients. Figure 7 shows the results of GCQCC with the CMVN technique.

As shown in Figure 7, on the development and evaluation set, 120 cepstral coefficients (40 + 40 Δ and 40 ΔΔ) resulted in better performance. So, in the next subsequent experiments, GCQCC with the CMVN technique considered with 120 cepstral coefficients.

4.9. All Pole Group Delay Function Features (APGDF) with CMVN

The APGDF features with the CMVN technique are evaluated on the development dataset by varying frame length and LPC order. The work presented in [57,58] has shown frame length beyond 20 milliseconds is useful for speaker recognition performance. This work evaluates the frame length for different sizes, which may be useful for the detection of a reply attack. Figure 8 shows the % EER computed for APGDF with the CMVN technique by varying LPC order with a frame length of 20 msec. The % EER computed for APGDF with the CMVN technique by varying LPC order with a frame length of 30 msec is shown in Figure 9. Whereas, Figure 10 shows the % EER computed for APGDF with the CMVN technique by varying LPC order with a frame length of 40 msec. The % EER computed for APGDF with the CMVN technique by varying LPC order with a frame length of 50 msec is shown in Figure 11.

From all the above figures (Figure 8, Figure 9, Figure 10 and Figure 11), it is observed that EER 7.49% is achieved for a frame length of 40 milliseconds with LPC order 40 on the development set. Hence, for the evaluation set, frame length 40 milliseconds and LPC order 40 is considered.

4.10. Glottal Excitation Based APGDF Features with CMVN

IA-IF based APGDF features with the CMVN technique are evaluated on an evaluation set. For the evaluation set, experiments are carried out considering frame length of size 40 and 50 milliseconds and the LPC order is varied. Figure 12 shows the % EER computed for GAPGDF with the CMVN technique by varying LPC order with a frame length of 40 msec. The % EER computed for GAPGDF with the CMVN technique by varying LPC order with a frame length of 50 msec is shown in Figure 13.

From Figure 12 and Figure 13, it is observed that EER 10.5% is achieved for the frame length of 50 milliseconds with LPC order 80 on the evaluation set. Hence, for the evaluation set, frame length 50 milliseconds and LPC order 80 is considered.

The proposed IA-IF APGDF (GAPGDF) features with the CMVN technique are integrated with the IA-IF CQCC (GCQCC) features with the CMVN technique. The results of the best cases are considered while integrating the features mentioned in Table 9.

The APGDF features are best for the development set and the GAPGDF features shown better performance on the evaluation set. Integrated APGDF and GCQCC features resulted in an EER of 5.48% on the development set and an EER of 8.80% is obtained for integrated GAPGDF and GCQCC features on the evaluation set. These results shows that GAPGDF features are effective for replay attack detection.

4.11. Computation Time

This section describes the computation time required for the execution of the system which is based on the score-level integration of different speech features. From the results presented in the last sections, it is evident that different speech features integrated at the score level improve the performance of the ASV system against replay attack. A particular system represents an algorithm i.e., the features extracted in the training phase from genuine and spoof signals, followed by GMM modelling. Thereafter, in the testing phase, features extracted from development or evaluation data are used to compute the score using GMM models trained in the training phase. The score of such multiple systems based on various features is computed. Every system requires a certain amount of time for complete execution. The time taken by the respective algorithm that implements a particular system is presented. For example, system 1 can be considered a score computed using CQCC features, system 2 can be considered a score computed using MFCC features, and so on. In Table 10, the computation time required for the computation of the scores considering fused systems has been shown. Additionally, serial feature fusion approach presented in [10] is also considered for the comparison. This computation time is measured using the tic and toc functions available in MATLAB.

From Table 10, it is evident that systems based on the serial integration of speech features require more computation time for execution. As feature vector size increases, GMM requires more time for training the model in serial integration. Whereas, score level integration of features requires less execution time as compared to serial fusion. In score level fusion, each system is trained separately so that the feature vector size is low dimension. The scores computed from each system are then integrated. However, more time will be required if more systems are trained to compute the score. As shown in Table 10, the computation time required for various integrated feature systems is shown. The proposed system based on integrated features GAPGDF and GCQCC requires less time as compared to other approaches.

4.12. Computation Complexity

An algorithm’s computational complexity, or the number of steps it takes to finish an algorithm, is approximated based on the amount of the input in Table 11.

The Table 11 shows the computational complexity approximated for the proposed approach i.e., integrated GAPGDF and GCQCC features. The computational complexity will increase when more features are integrated. In terms of computation complexity, computational time, and achieved results, the proposed integrated features perform comparably better than integrating many features.

4.13. Performance Comparison between the Proposed Algorithm and Recent Approaches

The Table 12 shows performance comparison between the proposed algorithm and recent approaches on the development set and evaluation set of ASVspoof 2017 version 2.

From the Table 12, on the development set, an EER of 3.08% and on the evaluation set, an EER of 9.86% is achieved for the integration of cepstral features, LPC-based features, and APGDF, RFCC, SCMC, SCFC, and IMFCC features. However, more computation time is required for such integration. The proposed system, based on the integration of APGDF and the GCQCC feature, resulted in an EER of 5.48% on the development set and 12.44% on the evaluation set. Better results are achieved for the proposed system based on the integration of GAPGDF and GCQCC features, with an EER of 8.80% on the evaluation set. The proposed GAPGDF features are prominent in terms of results and computation time.

5. Conclusions

This work evaluated the performance of integrated features based on the cepstral domain, LPC domain, and subband spectral centroid at the score level. The trials were conducted with the goal of lowering EER, which indicates a better defense against replay attacks. The results achieved for high frequency based CQCC features (HF-CQCC), considering a frequency band of 6 kHz to 8 kHz, are promising as compared to baseline CQCC and other cepstral domain features. Using these integrated features based on HF-CQCC, cepstral domain, LPC domain, and subband spectral centroid, on the development set, an EER of 3.08% and on the evaluation set, an EER of 9.86% is achieved. Also, glottal excitation based CQCC (GCQCC) features resulted in better performance as compared to baseline CQCC and HF-CQCC. This work proposed all pole group delay function-based features extracted using glottal excitation (GAPGDF). With an EER of 7.49%, APGDF features have demonstrated superior performance on the development set, and on the evaluation set, GAPGDF features performed better, with an EER of 10.5% as compared to baseline methods, GCQCC, and HF-CQCC features. EER of 5.48% was obtained on the development set when APGDF features were integrated with GCQCC, and 8.80% was obtained on the evaluation set when GAPGDF features were integrated with GCQCC. This shows that APGDF features are promising for improvement in the performance of the ASV system against replay attack.

Though integration of multiple features based on the cepstral domain, LPC domain, and subband spectral centroid resulted in better performance on the development set, it requires more computation time as compared to GAPGDF and GCQCC integrated features, especially on the evaluation set where the database size is larger as compared to the development size. It is concluded that multiple features can be integrated at the score level for improved performance, but such integration increases computation time. Therefore, feature extracted should be efficient and effective so that the computation time required is as less as possible with improved performance. The integrated GAPGDF and GCQCC features are computationally effective considering the results achieved on the evaluation set and computation time. However, an advanced version of the algorithm may include experimentation on a real-time database. Future research will concentrate on, integration of glottal excitation based APGDF features with high-frequency glottal excitation based CQCC features on real-time database.

Author Contributions

Conceptualization, A.C., D.S. and V.B.; methodology, A.C., D.S. and V.B.; software, A.C.; validation, D.S., V.B. and A.N.; formal analysis, V.B. and A.N.; investigation, A.C.; resources, A.C.; data curation, A.C., D.S., V.B. and A.N.; writing—original draft preparation, A.C.; writing—review and editing, A.C., D.S. and V.B.; visualization, A.C. and V.B.; supervision, D.S., V.B. and A.N.; project administration, D.S. and V.B.; funding acquisition, A.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the College of Computing at Prince of Songkla University, Thailand.

Data Availability Statement

The database and baseline system are available from organizers of ASVspoof challenge https://www.asvspoof.org/ accessed on 8 February 2021. The respective toolboxes used for the codes are mentioned in the paper.

Acknowledgments

The authors express their gratitude to everyone who helped with and supported this study, whether directly or indirectly.

Conflicts of Interest

Authors declares no conflict of interest.

References

Kinnunen, T.; Evans, N.; Yamagishi, J.; Lee, K.A.; Todisco, M.; Delgado, H. ASVspoof 2017: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan. 2018. Available online: http://www.asvspoof.org/index2017.html (accessed on 8 February 2021).
Delgado, H.; Todisco, M.; Sahidullah, M.; Evans, N.; Kinnunen, T.; Lee, K.A.; Yamagishi, J. ASVspoof 2017 Version 2.0: Meta-data analysis and baseline enhancements. In Proceedings of the Speaker and Language Recognition Workshop (Odyssey 2018), Les Sables d’Olonne, France, 26–29 June 2018; pp. 296–303. [Google Scholar] [CrossRef]
Font, R.; Espín, J.M.; Cano, M.J. Experimental analysis of features for replay attack detection-Results on the ASVspoof 2017 Challenge. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association, Stockholm, Sweden, 20–24 August 2017; pp. 7–11. [Google Scholar] [CrossRef]
Li, D.; Wang, L.; Dang, J.; Liu, M.; Oo, Z.; Nakagawa, S.; Guan, H.; Li, X. Multiple Phase Information Combination for Replay Attacks Detection. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 656–660. [Google Scholar] [CrossRef]
Gunendradasan, T.; Wickramasinghe, B.; Le, P.N.; Ambikairajah, E.; Epps, J. Detection of replay-spoofing attacks using frequency modulation features. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association, Hyderabad, India, 2–6 September 2018; pp. 636–640. [Google Scholar] [CrossRef]
Wickramasinghe, B.; Irtza, S.; Ambikairajah, E.; Epps, J. Frequency domain linear prediction features for replay spoofing attack detection. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association, Hyderabad, India, 2–6 September 2018; pp. 661–665. [Google Scholar] [CrossRef]
Kamble, M.R.; Patil, H.A. Novel Amplitude Weighted Frequency Modulation Features for Replay Spoof Detection. In Proceedings of the 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei, Taiwan, 26–29 November 2018; pp. 185–189. [Google Scholar] [CrossRef]
Tapkir, P.A.; Kamble, M.R.; Patil, H.A.; Madhavi, M. Replay Spoof Detection using Power Function Based Features. In Proceedings of the 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA, 12–15 November 2018; pp. 1019–1023. [Google Scholar] [CrossRef]
Tan, C.B.; Hijazi, M.H.A.; Khamis, N.; Nohuddin, P.N.E.B.; Zainol, Z.; Coenen, F.; Gani, A. A survey on presentation attack detection for automatic speaker verification systems: State-of-the-art, taxonomy, issues and future direction. Multimed. Tools Appl. 2021, 80, 32725–32762. [Google Scholar] [CrossRef]
Chaudhari, A.A.; Shedge, D.K.; Bairagi, V.K. Integration of Timbrel, Cepstral Domain and Linear Prediction-Based Features for Replay Attack Detection. SSRG Int. J. Electr. Electron. Eng. 2023, 10, 108–125. [Google Scholar] [CrossRef]
Phapatanaburi, K.; Wang, L.; Nakagawa, S.; Iwahashi, M. Replay Attack Detection Using Linear Prediction Analysis-Based Relative Phase Features. IEEE Access 2019, 7, 183614–183625. [Google Scholar] [CrossRef]
Singh, M.; Pati, D. Usefulness of linear prediction residual for replay attack detection. AEU-Int. J. Electron. Commun. 2019, 110, 152837. [Google Scholar] [CrossRef]
Singh, M.; Pati, D. Combining evidences from Hilbert envelope and residual phase for detecting replay attacks. Int. J. Speech Technol. 2019, 22, 313–326. [Google Scholar] [CrossRef]
Kamble, M.R.; Tak, H.; Patil, H.A. Amplitude and Frequency Modulation-based features for detection of replay Spoof Speech. Speech Commun. 2020, 125, 114–127. [Google Scholar] [CrossRef]
Jelil, S.; Das, R.K.; Prasanna, S.R.M.; Sinha, R. Spoof detection using source, instantaneous frequency and cepstral features. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association, Stockholm, Sweden, 20–24 August 2017; pp. 22–26. [Google Scholar] [CrossRef]
Gupta, P.; Chodingala, P.K.; Patil, H.A. Replay spoof detection using energy separation based instantaneous frequency estimation from quadrature and in-phase components. Comput. Speech Lang. 2023, 77, 101423. [Google Scholar] [CrossRef]
Liu, M.; Wang, L.; Dang, J.; Lee, K.A.; Nakagawa, S. Replay attack detection using variable-frequency resolution phase and magnitude features. Comput. Speech Lang. 2021, 66, 101161. [Google Scholar] [CrossRef]
Dutta, K.; Singh, M.; Pati, D. Detection of replay signals using excitation source and shifted CQCC features. Int. J. Speech Technol. 2021, 24, 497–507. [Google Scholar] [CrossRef]
Kamble, M.R.; Patil, H.A. Detection of replay spoof speech using teager energy feature cues. Comput. Speech Lang. 2021, 65, 101140. [Google Scholar] [CrossRef]
Balamurali, B.T.; Lin, K.E.; Lui, S.; Chen, J.M.; Herremans, D. Toward robust audio spoofing detection: A detailed comparison of traditional and learned features. IEEE Access 2019, 7, 84229–84241. [Google Scholar] [CrossRef]
Oo, Z.; Wang, L.; Phapatanaburi, K.; Liu, M.; Nakagawa, S.; Iwahashi, M.; Dang, J. Replay attack detection with auditory filter-based relative phase features. EURASIP J. Audio Speech Music Process 2019, 2019, 8. [Google Scholar] [CrossRef]
Liu, L.; Yang, J. Study on Feature Complementarity of Statistics, Energy, and Principal Information for Spoofing Detection. IEEE Access 2020, 8, 141170–141181. [Google Scholar] [CrossRef]
Patil, A.T.; Acharya, R.; Patil, H.A.; Guido, R.C. Improving the potential of Enhanced Teager Energy Cepstral Coefficients (ETECC) for replay attack detection. Comput. Speech Lang. 2022, 72, 101281. [Google Scholar] [CrossRef]
Bharath, K.P.; Kumar, M.R. New replay attack detection using iterative adaptive inverse filtering and high frequency band. Expert Syst. Appl. 2022, 195, 116597. [Google Scholar] [CrossRef]
Witkowski, M.; Kacprzak, S.; Zelasko, P.; Kowalczyk, K.; Gałka, J. Audio replay attack detection using high-frequency features. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association, Stockholm, Sweden, 20–24 August 2017; pp. 27–31. [Google Scholar] [CrossRef]
Garg, S.; Bhilare, S.; Kanhangad, V. Subband Analysis for Performance Improvement of Replay Attack Detection in Speaker Verification Systems. In Proceedings of the 2019 IEEE 5th International Conference on Identity, Security, and Behavior Analysis (ISBA), Hyderabad, India, 22–24 January 2019; pp. 1–7. [Google Scholar] [CrossRef]
Jelil, S.; Sinha, R.; Prasanna, S.R.M. Spectro-Temporally Compressed Source Features for Replay Attack Detection. IEEE Signal Process Lett. 2024, 31, 721–725. [Google Scholar] [CrossRef]
Sahidullah, M.; Kinnunen, T.; Hanilçi, C. A Comparison of Features for Synthetic Speech Detection. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association, Dresden, Germany, 6–10 September 2015; pp. 2087–2091. [Google Scholar] [CrossRef]
Plumpe, M.D.; Quatieri, T.F.; Reynolds, D.A. Modeling of the glottal flow derivative waveform with application to speaker identification. IEEE Trans. Speech Audio Process. 1999, 7, 569–586. [Google Scholar] [CrossRef]
Drugman, T.; Thomas, M.; Gudnason, J.; Naylor, P.; Dutoit, T. Detection of glottal closure instants from speech signals: A quantitative review. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 994–1006. [Google Scholar] [CrossRef]
Naylor, P.A.; Kounoudes, A.; Gudnason, J.; Brookes, M. Estimation of glottal closure instants in voiced speech using the DYPSA algorithm. IEEE Trans. Speech Audio Process. 2007, 15, 34–43. [Google Scholar] [CrossRef]
Ananthapadmanabha, T.V.; Yegnanarayana, B. Epoch extraction from linear prediction residual for identification of closed glottis interval. IEEE Trans. Acoust. Speech Signal Process 1979, ASSP-27, 309–319. [Google Scholar] [CrossRef]
Drugman, T.; Dutoit, T. Glottal closure and opening instant detection from speech signals. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association, Brighton, UK, 6–10 September 2009; pp. 2891–2894. [Google Scholar] [CrossRef]
Thomas, M.R.P.; Gudnason, J.; Naylor, P.A. Estimation of Glottal Closing and Opening Instants in Voiced Speech Using the YAGA Algorithm. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 82–91. [Google Scholar] [CrossRef]
Murty, K.S.R.; Yegnanarayana, B. Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 2008, 16, 1602–1613. [Google Scholar] [CrossRef]
Prathosh, A.P.; Ananthapadmanabha, T.V.; Ramakrishnan, A.G. Epoch extraction based on integrated linear prediction residual using plosion index. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 2471–2480. [Google Scholar] [CrossRef]
Wu, Z.; Evans, N.; Kinnunen, T.; Yamagishi, J.; Alegre, F.; Li, H. Spoofing and countermeasures for speaker verification: A survey. Speech Commun. 2015, 66, 130–153. [Google Scholar] [CrossRef]
Alku, P. Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering. Speech Commun. 1992, 11, 109–118. [Google Scholar] [CrossRef]
Ye, Y.; Lao, L.; Yan, D.; Lin, L. Detection of Replay Attack Based on Normalized Constant Q Cepstral Feature. In Proceedings of the 2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), Chengdu, China, 12–15 April 2019; pp. 407–411. [Google Scholar] [CrossRef]
Delgado, H.; Todisco, M.; Sahidullah, M.; Sarkar, A.K.; Evans, N.; Kinnunen, T.; Tan, Z.-H. Further optimisations of constant Q cepstral processing for integrated utterance and text-dependent speaker verification. In Proceedings of the 2016 IEEE Spoken Language Technology Workshop (SLT), San Diego, CA, USA, 13–16 December 2016; pp. 179–185. [Google Scholar] [CrossRef]
Prasad, N.V.; Umesh, S. Improved cepstral mean and variance normalization using Bayesian framework. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, 8–12 December 2013; pp. 156–161. [Google Scholar] [CrossRef]
Rajan, P.; Kinnunen, T.; Hanilçi, C.; Pohjalainen, J.; Alku, P. Using Group Delay Functions from All-Pole Models for Speaker Recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Lyon, France, 25–29 August 2013; pp. 2489–2493. [Google Scholar]
Makhoul, J. Linear prediction: A tutorial review. Proc. IEEE 1975, 63, 561–580. [Google Scholar] [CrossRef]
Yegnanarayana, B. Formant extraction from linear-prediction phase spectra. J. Acoust. Soc. Am. 1978, 63, 1638–1640. [Google Scholar] [CrossRef]
Murthy, H.A.; Gadde, V. The modified group delay function and its application to phoneme recognition. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP’03), Hong Kong, China, 6–10 April 2003; pp. 1–68. [Google Scholar] [CrossRef]
Vedaldi, A.; Fulkerson, B. VLFeat—An Open and Portable Library of Computer Vision Algorithms. Available online: https://www.vlfeat.org/index.html (accessed on 14 February 2021).
Brümmer, N.; De Villiers, E. The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF. arXiv 2013, arXiv:1304.2865. [Google Scholar]
Yamagishi, J.; Todisco, M.; Sahidullah, M.; Delgado, H.; Wang, X.; Evans, N.; Kinnunen, T.; Lee, K.A.; Vestman, V.; Nautsch, A. ASVspoof 2019: Automatic speaker verification spoofing and countermeasures challenge evaluation plan. 2019. Available online: https://www.asvspoof.org/index2019.html (accessed on 16 March 2022).
Brookes, M.; Voicebox: Speech Processing Toolbox for Matlab. Software. Available online: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html (accessed on 14 February 2021).
Sardar, V.M.; Shirbahadurkar, S.D. Timbre features for speaker identification of whispering speech: Selection of optimal audio descriptors. Int. J. Comput. Appl. 2021, 43, 1047–1053. [Google Scholar] [CrossRef]
Lartillot, O.; Toiviainen, P.; Eerola, T. A Matlab Toolbox for Music Information Retrieval. In Data Analysis, Machine Learning and Applications; Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2008; pp. 261–268. [Google Scholar]
Chen, K. Auto Speech Tech Project2. Available online: https://github.com/azraelkuan/asvspoof2017 (accessed on 21 January 2024).
Chettri, B.; Benetos, E.; Sturm, B.L.T. Dataset Artefacts in Anti-Spoofing Systems: A Case Study on the ASVspoof 2017 Benchmark. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 3018–3028. [Google Scholar] [CrossRef]
Chettri, B. TASLP-Study-on-Dataset-Artefact. Available online: https://github.com/BhusanChettri/TASLP-study-on-dataset-artefact (accessed on 18 February 2023).
Sadjadi, S.O.; Slaney, M.; Heck, L. MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research. 2013. Available online: https://www.microsoft.com/en-us/download/details.aspx?id=52279 (accessed on 21 March 2024).
Degottex, G.; Kane, J.; Drugman, T.; Raitio, T.; Scherer, S. COVAREP—A collaborative voice analysis repository for speech technologies. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014. [Google Scholar]
González, D.C.; Luan Ling, L.; Violaro, F. Analysis of the Multifractal Nature of Speech Signals. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, CIARP 2012; Alvarez, L., Mejail, M., Gomez, L., Jacobo, J., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7441. [Google Scholar] [CrossRef]
Zhao, H.; He, S. Analysis of speech signals’ characteristics based on MF-DFA with moving overlapping windows. Phys. A Stat. Mech. Its Appl. 2016, 442, 343–349. [Google Scholar] [CrossRef]

Figure 1. Temporal (a) and Spectral (b) representation of IAIF based Genuine and Spoof signal respectively.

Figure 2. Spectrogram of genuine (a) and spoof (b) speech signal after processing with IAIF method.

Figure 3. Group delay spectrum of genuine (a) and spoof signal (b).

Figure 4. APGDF feature extraction process.

Figure 5. Block diagram of training and testing phase for the proposed work.

Figure 6. Performance of APGDF, IMFCC, RFCC, SCFC and SCMC Features.

Figure 7. %EER by varying number of cepstral coefficients for GCQCC with CMVN Technique.

Figure 8. % EER computed for APGDF technique by varying LPC order with frame length 20 msec.

Figure 9. % EER computed for APGDF technique by varying LPC order with frame length 30 msec.

Figure 10. % EER computed for APGDF technique by varying LPC order with frame length 40 msec.

Figure 11. % EER computed for APGDF technique by varying LPC order with frame length 50 msec.

Figure 12. % EER computed for GAPGDF technique by varying LPC order with frame length 40 msec.

Figure 13. % EER computed for GAPGDF technique by varying LPC order with frame length 50 msec.

Table 1. ASVspoof 2017 dataset version 2 details.

Subset	Number of Speakers	Genuine Utterances	Spoofed Utterances
Training	10	1507	1507
Development	8	760	950
Evaluation	24	1298	12,008

Table 2. Performance evaluated in [10] for baseline CQCC, LFCC and other features.

Sr. No.	Method	DevEER (%)	EvalEER (%)
1	Baseline CQCC	12.42	24.17
2	MFCC + delta + Double delta	16.78	23.52
3	LPC	7.18	33.93
4	LPCC	16.18	31.69
5	LFCC	7.55	30.11

Table 3. Parameters for respective features.

Sr. No.	Features	Frame Length	Overlap	Window	Numbers of Bins per Octave	Number of Uniform Samples in the First Octave	Number of Coefficients
1	CQCC [1]	-	-	Hamming	96	16	90 including Δ + ΔΔ
2	LFCC [48]	320 samples	160 samples	Hamming	-	-	90 including Δ + ΔΔ
3	MFCC	256 samples	128 samples	Hamming	-	-	90 including Δ + ΔΔ
4	LPC	512 samples	50 samples	Hamming	-	-	90
5	LPCC	512 samples	50 samples	Hamming	-	-	90

Table 4. Results of experiments carried out by Integrating various Cepstral domain, LPC, Zero Crossings and Timbre feature set at Score Level.

Sr. No.	Method	DevEER (%)	EvalEER (%)
1	LPC + MFCC	6.93	23.1
2	LPC + CQCC	7.41	22.37
3	LPC + MFCC + CQCC	7.14	20.18
4	LPC + Timbre	34.61	47.92
5	LPC + CQCC + Timbre	10.38	31.31
6	LPC + MFCC +Timbre	11.23	33.12
7	LPCC	16.18	31.69
8	LPCC + MFCC	16.38	23.35
9	LPCC + CQCC	11.25	23.9
10	LPCC + MFCC + CQCC	10.94	20.46
11	LPCC + Timbre	46.9	48.62
12	LPCC + CQCC + Timbre	15.28	33.4
13	LPCC + MFCC + Timbre	21.32	35.52
14	LFCC + CQCC	7.4	24.11
15	LFCC + MFCC	7.11	23.98
16	LFCC + MFCC + CQCC	6.86	20.35
17	LFCC + MFCC + CQCC + LPCC	6.85	20.09
18	LFCC + LPC	6.15	29.96
19	LFCC + LPCC	7.25	30.37
20	LFCC + Timbre	47.1	48.62
21	LFCC + CQCC + Timbre	12.7	33.8
22	LFCC + MFCC + Timbre	16.72	35.9
23	LFCC + CQCC + MFCC + Timbre	48.48	32.09
24	LFCC + LPC + Timbre	8.68	32.08
25	LFCC + CQCC + LPC + Timbre	45.94	29.37
26	LFCC + MFCC + LPC + Timbre	46.73	30.76
27	LFCC + CQCC + MFCC + LPC + Timbre	44.51	47.7
28	LFCC + LPCC + Timbre	14.71	34.33
29	LFCC + CQCC + LPCC + Timbre	48	30.35
30	LFCC + MFCC + LPCC + Timbre	48.45	32.36
31	LFCC + CQCC + MFCC + LPCC + Timbre	47.21	48.07
32	LFCC + MFCC + CQCC + LPCC + LPC	5.44	19.63
33	LFCC + MFCC + CQCC + LPC	5.81	19.72
34	ZCR + CQCC	40.66	23.94
35	ZCR + MFCC + Delta + Double Delta	41.58	25.95
36	ZCR + MFCC + CQCC	42.93	28.51
37	ZCR + LFCC	40.9	25.47
38	ZCR + LFCC + CQCC	41.01	28.41
39	ZCR + MFCC + CQCC + LFCC	39.83	49.93
40	ZCR + LPC	38.9	25.39
41	ZCR + LPCC	40.96	24.46
42	ZCR + MFCC + CQCC + LFCC + LPCC + LPC	24.39	26.88
43	ZCR + MFCC + CQCC + LFCC + LPC	38.36	18.33
44	ZCR + MFCC + CQCC + LFCC + LPCC	39.44	20
45	CQCC + MFCC + delta + Double delta	11.29	21.83
46	CQCC + Timbre	46.45	48.6
47	CQCC + MFCC + Timbre	16.96	35.06
48	CQCC + MFCC + LPC + Timbre	46.36	29.8
49	CQCC + MFCC + LPCC + Timbre	48.31	31.52

Table 5. Parameters used for high frequency based CQCC features.

Parameter	Statistics
Number of bins per octave	1024
Sampling Frequency in Hz	16,000
Highest frequency, f_max in Hz	8000
Lowest frequency, f_min in Hz	6000
Number of uniform samples in the first octave	1024
CQCC dimensions	29 coefficients including 0th coefficient + Δ + ΔΔ

Table 6. Results of the experiments carried out by integrating high frequency based CQCC features with cepstral and LPC based features.

Sr. No.	Method	DevEER (%)	EvalEER (%)
1	Baseline CQCC	10	17.8
2	CQCC + MFCC + delta + Double delta	8.15	16.17
3	LPC + CQCC	5.11	18.08
4	LPC + MFCC + CQCC	5.32	14.68
5	LPCC + CQCC	8.81	17.77
6	LPCC + MFCC + CQCC	8.24	16.06
7	LFCC + CQCC	5.79	17.09
8	LFCC + MFCC + CQCC	5.66	16.24
9	LFCC + MFCC + CQCC + LPCC	5.63	15.4
10	ZCR + CQCC	40.06	18.37
11	ZCR + MFCC + CQCC	41.45	28
12	ZCR + LFCC + CQCC	39.48	27.89
13	ZCR+MFCC + CQCC+LFCC	39.37	49.73
14	CQCC + Timbre	40.9	47.77
15	CQCC + MFCC +Timbre	14.12	31.72
16	CQCC + MFCC + LPC + Timbre	45.14	27.47
17	LPC + CQCC + Timbre	6.94	28.52
18	LPCC + CQCC + Timbre	13.4	29.98
19	CQCC + MFCC + LPCC + Timbre	47.68	28.35
20	LFCC + CQCC + Timbre	9.01	30.16
21	LFCC + CQCC + MFCC + Timbre	47.86	28.56
22	LFCC + CQCC + LPC + Timbre	44.52	26.91
23	LFCC + CQCC + MFCC + LPC + Timbre	43.06	47.16
24	LFCC + CQCC + LPCC + Timbre	47.06	27.25
25	LFCC + CQCC + MFCC + LPCC + Timbre	46.2	47.81
26	LFCC + MFCC + CQCC + LPCC + LPC	4.81	13.45
27	ZCR + MFCC + CQCC + LFCC + LPCC + LPC	25.21	26.33
28	LFCC + MFCC + CQCC + LPC	5.1	19.72
29	ZCR + MFCC + CQCC + LFCC + LPC	37.67	18.33
30	ZCR + MFCC + CQCC + LFCC + LPCC	38.73	20

Table 7. Features with their parameters.

Sr. No.	Method	Frame Length (Milliseconds)	Overlap (Milliseconds)	FFT	Window	Numbers of Filters/LPC Order for APGDF	Frequency Band (Hz)
1	APGDF [28]	20	10	512	Hamming	20	-
2	IMFCC [3]	20	10	512	Hamming	40	200–8000
3	RFCC [3]	20	10	512	Hamming	20	200–8000
4	SCFC [3]	20	10	512	Hamming	20	100–8000
5	SCMC [3]	20	10	512	Hamming	40	100–8000

Table 8. Performance evaluation by integrating high frequency based CQCC features with Cepstral, LPC based, Spectral Sub-band Centroid, and APGDF features.

Sr. No.	Method	DevEER (%)	EvalEER (%)
1	CQCC + MFCC + LFCC + LPCC + LPC + APGDF + RFCC + SCMC + IMFCC + SCFC	3.08	9.86

Table 9. Score level integration of GAPGDF and GCQCC features and APGDF and GCQCC features.

Sr. No.	Method	DevEER (%)	EvalEER (%)
1	GCQCC	10.29	12.69
2	APGDF	7.49	25.80
3	GAPGDF	38.99	10.5
4	APGDF + GCQCC	5.48	12.44
5	GAPGDF + GCQCC	9.63	8.80

Table 10. Computation time required for the score computation considering fused systems.

Sr. No.	Fused System	Computation Time in Seconds
Sr. No.	Fused System	Development Set	Evaluation Set
1	ZCR + LPC [10]	3165.41	-
2	ZCR + MFCC + CQCC + LFCC + LPCC [10]	-	87,198.52
3	LFCC + MFCC + CQCC + LPCC + LPC	4973.38	12,775.91
4	APGDF + RFCC + SCMC + IMFCC + SCFC	1914.82	3234.98
5	LFCC + MFCC + CQCC + LPCC + LPC + APGDF + RFCC + SCMC + IMFCC + SCFC	6888.2	16,010.89
6	APGDF	237.96	448.84
7	GAPGDF	317.68	677.51
8	GCQCC	848.42	2097.61
9	GAPGDF + GCQCC	1166.1	2775.12
10	APGDF + GCQCC	1086.38	2546.45

Table 11. Approximated computational complexity for the techniques.

Sr. No.	Method	Dominated by	Approximated Computational Complexity
1	IA-IF [18,24,55]	Dominated by filtering	O(N × m), where N is the length of the input signal and m is the length of the filter coefficients
2	CQCC [2]	Dominated by CQT	O(N × logN), where N is the length of the signal
3	CMVN [55]	Dominated by number of frames and coefficients	O(N × m), where N is the number of frames and m is the number of coefficients
4	APGDF [28,53,54]	Dominated by LPC	O(N × p²), where N is number of frames and p is the order of the LPC model
4	APGDF [28,53,54]	Dominated by LP group delay using forward difference	O(NFFT × LPC Order × M), where NFFT is the number of FFT points, LPC order is the order of the linear prediction model, and M is the number of speech frames
5	IA-IF based CQCC with CMVN (GCQCC)	Dominated by filtering and CQT	O(N × m) + O(N × logN) + O(N × m)
6	IA-IF based APGDF with CMVN (GAPGDF)	Dominated by filtering, LPC, and LP group delay using forward difference	O(N × m) + O(NFFT × LPC Order × M) + O(N × p²) + O(N × m)
7	GAPGDF + GCQCC	Dominated by filtering, CQT, LPC, and LP group delay using forward difference	O(N × m) + O(NFFT ×LPC Order × M) + O(N × p²) + O(N × m) + O(N × m) + O(N × logN) + O(N × m)

Table 12. Performance comparison between the proposed algorithm and recent techniques.

Sr. No.	Authors	Features	DevEER (%)	EvalEER (%)
1	Jelil et. al., 2024 [27]	2D-ILRCC	11.90	10.87
2	Gupta et.al., 2023 [16]	Score level fusion of CQCC, CFCC, CFCCIF, CFCCIF-ESA, CFCCIF-QESA	9.21	11.24
3	Kamble and Patil 2021 [19]	CQCC + LFCC + MFCC + TECC	6.68	10.45
4	Kamble et al., 2020 [14]	CQCC + ESA-IACC	7.03	12.11
4	Kamble et al., 2020 [14]	CQCC + ESA-IFCC	7.74	10.12
5	B. T. Balamurali et al., 2019 [20]	MFCCs, CQCCs, SCMCs, CCCs, LPCCs, IMFCCs, RFCCs, LFCCs, SCFCs, spectrogram, Autoencoder reconstructed features	--	10.8
6	Proposed Approach	LFCC + MFCC + CQCC + LPCC + LPC + APGDF + RFCC + SCMC + IMFCC + SCFC	3.08	9.86
7	Proposed Approach	APGDF + GCQCC	5.48	12.44
8	Proposed Approach	GAPGDF + GCQCC	9.63	8.80

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chaudhari, A.; Shedge, D.; Bairagi, V.; Nanthaamornphong, A. Replay Attack Detection Using Integrated Glottal Excitation Based Group Delay Function and Cepstral Features. Symmetry 2024, 16, 788. https://doi.org/10.3390/sym16070788

AMA Style

Chaudhari A, Shedge D, Bairagi V, Nanthaamornphong A. Replay Attack Detection Using Integrated Glottal Excitation Based Group Delay Function and Cepstral Features. Symmetry. 2024; 16(7):788. https://doi.org/10.3390/sym16070788

Chicago/Turabian Style

Chaudhari, Amol, Dnyandeo Shedge, Vinayak Bairagi, and Aziz Nanthaamornphong. 2024. "Replay Attack Detection Using Integrated Glottal Excitation Based Group Delay Function and Cepstral Features" Symmetry 16, no. 7: 788. https://doi.org/10.3390/sym16070788

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Replay Attack Detection Using Integrated Glottal Excitation Based Group Delay Function and Cepstral Features

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Glottal Flow Derivative

3.2. IA-IF

3.3. Cepstral Mean and Variance Normalization

3.4. Group Delay Function

3.5. Group Delay Function of All-Pole Models

3.6. Proposed IA-IF Based APGDF Feature Extraction

3.7. Classifier

3.8. Evaluation Metric

4. Results and Discussion

4.1. Database

4.2. Experiment Setup

4.3. Baseline System

4.4. Score-Level Integration of Cepstral and LPC Based Features

4.5. High Frequency Band

4.6. APGDF, IMFCC, RFCC, SCFC and SCMC Features

4.7. Score Level Integration of All Features

4.8. Glottal Excitation Based CQCC Features with CMVN

4.9. All Pole Group Delay Function Features (APGDF) with CMVN

4.10. Glottal Excitation Based APGDF Features with CMVN

4.11. Computation Time

4.12. Computation Complexity

4.13. Performance Comparison between the Proposed Algorithm and Recent Approaches

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI