5.1. Dataset
The speech dataset used in this work were obtained from the public VoxCeleb1 audio files dataset. VoxCeleb1 [
25] dataset contains 153,516 utterances which is collected from 1251 speakers. The utterances are extracted from different kinds of celebrities uploaded to YouTube. The ratio of male speaker in the dataset is balanced which is 55%. The speakers span a wide range of different ethnicities, accents, professions and ages. The VoxCeleb1 dataset has identification and verification split as shown in the
Table 4 for speaker identification and verification, respectively.
Although VoxCeleb1 dataset is not strictly clean from noises, we assumed it as clean dataset and generated noisy VoxCeleb1 datasets by adding a real-world noises (such as: Babble, Street, and restaurant noises) to the clean dataset. The noises used in this study is obtained from the source at [
26].
For speaker identification, firstly an original development and test split of the VoxCeleb1 dataset is mixed into one. Then the dataset is divided into training, validation and test split with the ratio of 80%, 10% and 10%, respectively. During speaker identification, randomly selected noise is added to each utterance of the training and validation split at the SNR level from −5 dB to 20 dB. The speaker identification performance is evaluated by adding randomly selected noise to each of the test split utterances at the SNR level −5 dB, 0 dB, 5 dB, 10 dB, 15 dB and 20 dB.
For speaker verification, the original development and test split of the VoxCeleb1 dataset is used without change during our experiment. The training split consists of 148,642 utterances from 1211 speakers and the test split consists of 4874 utterances from 40 speakers which produces a total of 37,720 trials. During training for speaker verification, for each clean utterances, the noisy utterances are generated by adding randomly selected noises at the random SNR level from −5 dB to 20 dB. The speaker verification performance is evaluated by using noise added verification test split. The randomly selected noise is added to each of the test utterances at the SNR level −5 dB, 0 dB, 5 dB, 10 dB, 15 dB and 20 dB during evaluating for speaker verification.
5.2. Implementation Details and Training
In this study, Cochleogram and Mel Spectrogram which is generated from the utterances is used as an input for the CNN architecture used to our experiment. For Cochleogram and Mel Spectrogram generation, a 30 ms hamming windows with the overlapping size of 15 ms are used for the 128 filters and 2048-point FFT. Finally, Cochleogram and Mel Spectrogram of size 1088 × 288 (frequency × time) is generated for each of the utterances and used as an input to each of the models. Since the aim of this study is to analyze the robustness of the Cochleogram and Mel Spectrogram features, none of the audio preprocessing such as voice activity detection or silence removal is applied. None of the normalization and data augmentation is applied to Cochleogram and Mel Spectrogram during training the model.
The implementation of this study is conducted by using TensorFlow deep learning frameworks written in Python, which can be executed on the graphics processing unit (GPU). Our experiment is conducted on the NVIDIA TITAN Xp GPU. The experiment is conducted by using the CNN architectures such as: basic 2DCNN, VGG-16, ResNet-50, ECAPA-TDNN and TitaNet architectures which are discussed on the
Table 1,
Table 2 and
Table 3 and in the
Figure 3 and
Figure 5, respectively. For evaluating the performance of the Cochleogram and Mel Spectrogram, separate models are trained for speaker identification and verification on each of the CNN architectures. Separate models are trained for Cochleogram and Mel Spectrogram features during identification and verification. For evaluating the performance of Cochleogram and Mel Spectrogram features at different levels of SNR (−5 dB to 20 dB) and at different types of noises (i.e., Babble, Street and Restaurant noises), a single model is trained for each Cochleogram and Mel Spectrogram features. For evaluating the speaker identification and verification performance of Cochleogram and Mel Spectrogram features without additive noises, a separate model is trained for Cochleogram and Mel Spectrogram features. During conducting our experiment, publicly available python codes are used by customizing into appropriate forms.
During both speaker identification and verification, the models are trained for 20 epochs with a minibatch size of 32. For each of the epochs, the training pairs are re-shuffled. Each of the models used RMSprop optimizer with the minimum learning rate 0.0001 and categorical cross-entropy used as the loss function. The training is performed using the Softmax function. The weights in the models were initialized randomly at the start of the training process, and progressively updated throughout the process. The validation set of the dataset is used for hyper-parameter tuning and early stopping. The speaker identification performance is measured by using accuracy metrics for training, validation and test split of the dataset. Verification performance is measured by using Equal Error Rate (EER) for test split of the dataset.
5.3. Results
This section presents, the result of noise robustness analysis of Cochleogram and Mel Spectrogram features in speaker identification and verification. The performance of Cochleogram and Mel Spectrogram features in speaker identification and verification is presented in the
Table 5 and
Table 6, respectively. Sample speaker identification performance of both features using VGG-16 architecture is presented graphically as shown in the
Figure 6,
Figure 7 and
Figure 8 for the dataset without additive noise, medium noise and high noise ratio, respectively. For the clarity of the readers, the ratio of noise added to the VoxCeleb1 dataset is classified into three categories such as: low noise ratio (without additive noise), medium noise ratio (10 dB, 15 dB and 20 dB) and high noise ratio (−5 dB, 0 dB and 5 dB). At each level of SNR, the performance of both Cochleogram and Mel Spectrogram features in speaker identification and verification is analyzed and presented in the
Table 5 and
Table 6, respectively. From the
Figure 6, we can see that the speaker identification performance of both Cochleogram and Mel Spectrogram is approximately equal with the dataset without additive noise.
Figure 7 presents that Cochleogram features achieved better performance than Mel Spectrogram features on the datasets with medium noise ratio.
Figure 8 shows that Cochleogram shows superior performance than Mel Spectrogram features on the datasets with high noise ratio.
Table 5 presents more detail about the identification performance of Cochleogram and Mel Spectrogram at different ratio of noise and using different types of deep learning architectures such as: (Basic 2DCNN, VGG-16, ResNet-50, TDNN and TitaNet). In the
Table 5, the results show that the Cochleogram features achieved superior performance than Mel Spectrogram features on the dataset with high noise ratio. For example, the accuracy of Cochleogram features using VGG-16 at SNR of −5 dB 0 dB and 5 dB is 75.77%, 89.38% and 93.94% which is much better than the accuracy of Mel Spectrogram features at SNR of −5 dB, 0 dB and 5 dB which is 51.96%, 70.82% and 85.3%. The results in the
Table 5, also shows that Cochleogram features achieved better performance than Mel Spectrogram features on the dataset with medium noise ratio. For example, the accuracy of Cochleogram using VGG-16 at SNR of 10 dB, 15 dB and 20 dB is 95.96%, 96.79% and 97.32%, respectively, which is better than the accuracy of Mel Spectrogram at SNR of 10 dB, 15 dB and 20 dB which is 91.64%, 92.81 and 95.77%, respectively. On the dataset without additive noise, Cochleogram features achieved comparative accuracy with Mel Spectrogram features. The accuracy of Cochleogram and Mel Spectrogram using VGG-16 network is 98% and 97%, which is comparatively approximate. Generally, Cochleogram features achieved better performance than Mel Spectrogram features in speaker identification on the noisy datasets.
Table 6, presents the Speaker verification performance of both Cochleogram and Mel Spectrogram features at SNR level from −5 dB to 20 dB using deep learning architectures which is discussed at
Table 1,
Table 2 and
Table 3, at
Figure 3 and
Figure 5. The results in the
Table 6, show that Cochleogram features have superior performance than Mel Spectrogram features in speaker verification at the high noise ratio (−5 dB, 0 dB and 5 dB) in the dataset. For instance, using VGG-16 architecture Cochleogram features achieved an EER of 15.42%, 12.86% and 9.10% at the SNR level −5 dB, 0 dB and 5 dB, respectively, which is minimum error rate compared to the EER of Mel Spectrogram at similar SNR level which is 18.83%, 15.71% and 11.92%. Cochleogram features also shows better performance than Mel Spectrogram feature in speaker verification at the medium noise ratio and without additive noise in the dataset. For example, VGG-16 architecture in the
Table 6 show that Cochleogram achieved an EER of 7.95%, 6.61% and 4.55% at SNR level of 10 dB, 15 dB and 20 dB which is minimum error rate compared to the EER of Mel Spectrogram at similar SNR level which is 9.74%, 8.37% and 5.86% at 10 dB, 15 dB and 20 dB.
Generally, the results in the
Table 3 and
Table 4 show that Cochleogram features achieved superior performance in both speaker identification and verification on the noisy data compared to the Mel Spectrogram features. In addition, in the improved deep learning architectures the Cochleogram features performance also shows better performance.
The comparison of the speaker identification and verification performance of the Cochleogram features with the existing works is presented in the
Table 7. The baselines such as: CNN-256-Pair Selection [
27], CNN [
24], Adaptive VGG-M [
28], CNN-LDE [
29], ECAPA-TDNN [
9], and TitaNet [
10] are selected for the comparison with the experiment results of our work. The results in the
Table 7, show that Cochleogram features have better performance in speaker identification and verification in the noisy condition. For instance, the identification accuracy of Cochleogram features using architectures ResNet-50, VGG-16, ECAPA-TDNN and TitaNet is 97.85%, 98.04%, 97.89% and 98.02%, respectively, which is better than the performance of the baselines CNN [
24], Adaptive VGG-M [
28] and CNN-LDE [
29] with the accuracies 92.10%, 95.31% and 95.70%, respectively. Similarly, Cochleogram features also achieved better performance in speaker verification compared to Mel Spectrogram features. For example, ECAPA-TDNN and TitaNet architectures using Cochleogram features achieved an EER of 0.61% and 0.54% which is error rate of the Mel Spectrogram features which is 0.87% and 0.68%.
Generally, the experiment results of this study and the comparison of results of this study with the existing works show that Cochleogram features have superior performance than Mel Spectrogram feature in deep learning-based speaker identification and verification on the high noise ratio in the dataset. It also has better and comparative performance with the Mel Spectrogram features on the medium noise ratio and low noise ratio in the dataset, respectively.