1. Introduction
Deep learning has made significant advancements in various fields, including computer vision, natural language processing, speech recognition, and recommendation systems. The widespread application of deep neural networks is evident in tasks such as image classification, object detection, and text generation [
1]. The complex structure and learning capabilities of deep neural networks enable them to outperform traditional methods in terms of accuracy and performance. With the increase in computing power and the availability of architectures that support parallel processing, training deep learning networks with numerous parameters and massive datasets has become feasible within reasonable timeframes. Consequently, there has been a remarkable improvement in the predictive accuracy of machine learning models.
Deep learning has demonstrated success in speech recognition, particularly in end-to-end deep learning models [
2]. Examples of these accomplishments include Baidu’s Deep Speech, which has displayed favorable results in speech recognition tasks. Using deep learning models, Google’s WaveNet can generate high-quality, natural-sounding speech, while Tacotron can directly convert text into natural speech. Additionally, deep learning technology has enabled the development of intelligent voice assistants such as Apple’s Siri and Amazon’s Alexa. Microsoft has developed a range of deep learning-based speech recognition technologies, such as the Cortana voice assistant and Azure voice service, which offer highly accurate and efficient speech recognition capabilities.
However, in 2013, Szegedy et al. revealed the vulnerability of neural networks, revealing their high sensitivity to small perturbations, and proposed the concept of adversarial samples. Through the use of slightly perturbed images as input, they successfully deceived image recognition models, resulting in misclassifications [
3]. Adversarial samples are samples that have been generated by applying small perturbations to existing data samples. These perturbations are undetectable to humans but can cause neural network models to misclassify [
4]. Models trained under normal conditions possess generalizability. To defend against adversarial attacks, many studies have incorporated adversarial samples into neural networks during training to enhance model robustness. However, this approach often leads to a loss in generalizability [
5].
Subsequent research has revealed that the threat of adversarial attacks is not restricted to image recognition but also extends to other domains, such as speech recognition. Speech adversarial samples are now extensively employed to safeguard against personal data breaches in speech recognition systems [
6] and to enhance the security of call equipment and voice assistants [
7]. Generating adversarial samples in the field of speech is more challenging than in computer vision. Speech recognition systems must contend with temporal changes in audio, and most speech files are sampled at a rate of 10,000 data points per second. In contrast to image recognition, speech recognition processes a significantly larger volume of data [
8,
9]. Moreover, sampled speech data require decoding following its output by the neural network [
10,
11]. Existing research on adversarial examples primarily concentrates on image recognition, with limited investigation into adversarial examples in speech.
In 2018, Moustafa Alzantot demonstrated an adversarial attack on a speech classification model [
12].
Figure 1 depicts the principle behind speech adversarial samples: an attacker introduces imperceptible noise to the speech, preserving its acoustic characteristics as perceived by the human ear while causing the speech recognition model to classify it as a different type of speech. These manipulated audio samples represent speech adversarial samples.
This study introduces a novel approach for the detection of speech adversarial samples, thereby enabling neural networks to defend against adversarial attacks based on black-box models. By conducting adversarial detection prior to the admission of speech data into the recognition model, identified speech adversarial samples are withheld from recognition, rather than being misclassified.
2. Related Work
In recent years, there have been numerous advancements in speech recognition technologies that are built upon deep learning models. Graves [
13] employed connectionist temporal classification (CTC) to develop a state-of-the-art end-to-end speech recognition model that directly maps input acoustic feature sequences to output word sequences. Building upon this work, Baidu successfully commercialized speech recognition models for both English and Chinese languages through extensive data training [
14]. Kubanek et al. proposed a novel approach that utilizes three independent convolutional layers: traditional temporal convolution and two different frequency convolutions. This technique enables the creation of sound patterns in the form of RGB images and proposes a method for segmenting continuous speech into syllables [
15].
The generation of speech adversarial samples primarily involves two methods: gradient-based methods and black-box optimization-based methods [
16,
17]. Current research on adversarial samples is rooted in the study of neural network robustness. This research began with the work of Biggio and Szegedy et al., who explored adversarial samples for deep neural networks. They utilized gradient descent and L-BFGS to implement optimization-based attack strategies, which resulted in the generation of adversarial image samples. This breakthrough paved the way for generating adversarial samples in the machine learning domain.
In 2014, Goodfellow et al. [
18] discovered that the linear characteristics of neural networks render them vulnerable to adversarial perturbations. They proposed the fast gradient sign method (FGSM) as the first proposed approach to provide adversarial samples for adversarial training, effectively enhancing the network’s robustness. Building upon these findings, Alexey Kurakin [
19] and others advanced the FGSM to the iterative gradient symbolic method (IGSM) through a more sophisticated iterative optimization strategy. Additionally, in 2017, Aleksander Madry et al. proposed the projected gradient descent (PGD) method [
20], which generates adversarial samples by repeatedly applying gradient descent steps during training. As a result, the generated adversarial samples exhibited improved performance and convincingly deceived the neural network model.
To address the issue that the defense system might analyze the output class of nontarget adversarial examples to determine the original class, Hyun Kwon et al. [
21] proposed a method for generating nontarget adversarial examples. In the field of speech, Dan Iter and colleagues utilized adversarial samples generated by the fast gradient sign method and the fooling gradient sign method [
22] to successfully deceive the automatic speech recognition model Wavenet. Notably, the generated adversarial audio exhibited imperceptible differences to the human ear. Furthermore, they proposed a method to convert adversarial mel-scale frequency cepstral coefficient (MFCC) features back into audio. This demonstrates the effectiveness of the adversarial sample generation method in the field of image recognition within the domain of speech recognition as well.
Carlini and Wagner [
23] further validated the existence of adversarial samples in the speech domain by applying a white-box iterative optimization-based attack algorithm to the end-to-end implementation of Mozilla’s DeepSpeech. This experiment provided evidence that adversarial samples within the speech field can be utilized for targeted attacks.
Adversarial training [
24] is a technique to bolster the robustness of speech recognition models against adversarial disturbances, which is achieved by incorporating adversarial examples during the training process. However, in 2018, Dimitris et al. [
25] found that while adversarial training increased model robustness, it concurrently reduced the model’s recognition accuracy.
Constructing an adversarial sample detector is a defensive method that screens for adversarial samples before they enter the recognition models. Samizade [
17] employed a Convolutional Neural Network (CNN) to detect minute perturbations in speech adversarial samples, while Li et al. [
26] introduced a detection network akin to a VGG network structure, utilizing convolutional operations to capture subtle discrepancies between adversarial and genuine samples. Nonetheless, both approaches are limited to defending against specifically targeted white-box attack models.
The study of learning the manifold where data points are located is an area of research that has attracted significant attention. Manifold learning refers to a technique of nonlinear dimensionality reduction that aims to comprehend the inherent structure of high-dimensional data and map it to a lower-dimensional space, thereby facilitating better visualization, understanding, and analysis of the data. The primary objective of manifold learning is to reduce the dimensionality of data while preserving the local characteristics inherent to the data. Consequently, it is often employed for dimensionality reduction and feature extraction of high-dimensional datasets.
In 2000, Roweis and Saul proposed the LLE method [
27] for manifold learning, which reconstructs the local structure of data based on local linear relationships and maps it to a lower-dimensional space while preserving these relationships. Shortly thereafter, Tenenbaum et al. proposed the ISOMAP method [
28], which maintains the geodesic distance between data points by constructing a graph based on nearest-neighbor relationships and utilizes the geometric structure of the graph for dimensionality reduction. In 2008, Laurens van der Maaten and others suggested the t-distributed stochastic neighbor embedding (t-SNE) method [
29], which constructs a probability distribution on pairs of high-dimensional data points in a manner that assigns a higher probability to similar objects. This method is often employed to preserve the similarity between data points in high-dimensional space while mapping the data to a lower-dimensional space, particularly for visualizing high-dimensional datasets.
In 2020, Leland McInnes proposed uniform manifold approximation and projection (UMAP), the most advanced manifold learning method [
30]. UMAP is a practical and scalable algorithm that builds upon the theoretical foundations of Riemannian geometry and algebraic topology. It is capable of processing real-world data. In terms of visualization quality, the UMAP algorithm is a strong competitor to t-SNE and can preserve more global structures. Additionally, UMAP does not impose any computational restrictions on embedding dimensions, making it a versatile dimensionality reduction technique for machine learning applications.
Figure 2 illustrates the visual results of dimensionality reduction using manifold learning methods on the MNIST dataset. Tanay et al. [
31] discovered that various types of data exhibit remarkable similarity in high-dimensional space, yet neural networks are capable of correctly classifying them. They proposed a boundary-tilted view, suggesting that adversarial samples tend to reside in close proximity to the classification boundary of the training data manifold. With regard to the manifold hypothesis of adversarial examples [
32,
33], it is assumed that adversarial examples deviate from the low-dimensional data manifold.
3. Audio Adversarial Sample Detection Method Based on Manifold Learning
Recent research has demonstrated that adversarial examples are located near the classification boundary of the training data manifold or deviate from the manifold. Building on this finding, this study proposes a speech adversarial sample detection method based on manifold learning.
In this section, we first discuss the manifold dimensionality reduction technique employed in this article. We will then introduce the adversarial sample detection method that relies on the results obtained from manifold learning dimensionality reduction. Finally, we present a speech adversarial sample detection method grounded in manifold learning.
3.1. Low-Dimensional Manifold Embedding of Speech Data
We utilize two leading manifold learning techniques, namely, t-SNE(t-distributed Stochastic Neighbor Embedding) and UMAP(t-distributed Stochastic Neighbor Embedding), to perform low-dimensional embedding on the speech dataset and compare the results.
3.1.1. t-SNE
t-SNE adopts a probability-based approach to measure the similarity between high-dimensional data points to preserve these similarities in low-dimensional space. By employing a specific probability distribution (t-distribution), t-SNE effectively handles outliers in high-dimensional data and generates improved clustering effects in low-dimensional space.
The calculation of similarity between data points in high-dimensional space involves computing a probability distribution for each data point, which determines its similarity to other data points. This distribution can be interpreted as a “neighbor relationship”.
Similarly, to determine a new position for each data point in the low-dimensional space and calculate the similarity between data points in this reduced space, t-distributed stochastic neighbor embedding (t-SNE) computes a probability distribution for each data point in the low-dimensional domain.
t-SNE leverages the Kullback–Leibler divergence (KL divergence) optimization technique to minimize the disparity between probability distributions in high- and low-dimensional spaces. By employing algorithms such as gradient descent, t-SNE aims to minimize this disparity and ensure that the distribution of data points in the low-dimensional space preserves as many of the similar relationships present in the high-dimensional space as possible.
The “t-distribution” used in t-SNE is a specific probability distribution function that effectively retains the local structure in low-dimensional space while placing more emphasis on distant data points. This approach greatly enhances the representation of the data structure in the reduced space.
3.1.2. UMAP
UMAP finds the nearest neighbors of a data point by using the nearest neighbor descent algorithm to identify the nearest neighbors of a given data point. Subsequently, UMAP constructs a graph by connecting these nearest neighbors. UMAP operates under the assumption that data points are uniformly distributed on the manifold, causing the spacing between points to stretch or compress based on local density. Consequently, the distance metric across space is not uniform but instead varies across regions. To control the dimensionality reduction process, UMAP employs the n_neighbors hyperparameter, which specifies the number of neighbors to consider.
During graph construction, it is essential to avoid numerous disconnected points that may hinder the learning of the desired manifold structure. To address this concern, UMAP utilizes the local_connectivity parameter (default value of 1). By setting local_connectivity to 1, each data point in the high-dimensional space is associated with another. The strength of the connections between the data points in the graph is represented by edge weights (w). Due to UMAP’s adoption of a different distance method, there may be discrepancies in edge weights from the perspective of individual points. For instance, the edge weights from points A to B may differ from the weights in the opposite direction. UMAP successfully resolves this issue by taking the union of both edges, resulting in a connected neighborhood graph.
UMAP calculates the distance between data points on the manifold using the standard Euclidean distance relative to the global coordinate system. The conversion from variable distance to standard distance also influences the distance between a data point and its nearest neighbor. Consequently, UMAP introduces a hyperparameter called min_dist (with a default value of 0.1) to define the minimum distance between the embedded points.
Upon specifying the minimum distance, UMAP proceeds to identify a superior low-dimensional representation of the manifold by minimizing the following cost function, also known as cross-entropy (
CE):
where
represents the edge connecting each pair of nearest neighbors,
represents the known edge weight from the high-dimensional manifold approximation, and
represents the edge weight to be discovered for the low-dimensional representation.
Whenever the weight associated with the high-dimensional case is larger, the first term () acts as the “attraction”. This is because this term will be minimized when is as large as possible, which occurs when the distance between points is as small as possible.
When the high-dimensional weight is small, the second term acts as a “repulsive force”. This is because by making as small as possible, the term will be minimized.
Ultimately, the interplay between these two “forces” brings the low-dimensional representation closer to an accurate representation of the overall topology of the original data.
The optimal weights of edges in a low-dimensional representation are sought as the ultimate goal. These weights are obtained by minimizing the cross-entropy function mentioned earlier. Finally, UMAP calculates the coordinates of each data point in the designated low-dimensional space.
3.1.3. Contrast
While t-SNE is limited to embedding dimensions of two and three, UMAP can preserve both local and global structures without this restriction.
The speech dataset and the target sample are subjected to manifold learning-based dimensionality reduction to be projected onto a low-dimensional manifold. The sample to be detected is then embedded on this low-dimensional manifold. By examining the geometric relationship between the speech adversarial sample and the speech dataset on the manifold, it is possible to determine whether the sample to be detected is a speech adversarial sample.
3.2. Detection Method for Speech Adversarial Samples
There are a total of
types of original audio data. After dimensionality reduction, the centroid of each type is calculated, and the point
nearest to the centroid is selected as the center point of that type of data. The distance
between the center points of each type of audio data is found. The first indicator is set to determine the geometric relationship between the sample to be detected and the original audio sample:
In the formula, is set to 0.9 for this experiment. The distance between the audio sample to be detected and the center point of each type of original audio data in the reduced low-dimensional space is calculated, and the minimum value is . Research suggests that adversarial samples are located near the training data manifold’s classification boundary or deviate from the manifold. Therefore, speech adversarial samples should be relatively far from the centroid of various types of original audio data. If , the sample to be detected is far from the centroid of various types of original audio data and is suspected to be a speech adversarial sample.
However, due to the unknown and irregular manifold structure of speech data, many normal audio samples are also situated far from the centroid of various types of original audio data. Therefore, another decision criterion is set: a maximum neighbor search range (the value is set to 0.5 in this experiment). First, the neighbors of all center points within this range are searched, and the minimum number of neighbors is selected.
Next, we search for other data points within the range of the sample to be detected if any of the following two conditions are met when is met:
The number of data points within the search range is much smaller than .
There are multiple categories of data points within the search range, with no single category accounting for more than 60% of the total.
Then, it can be inferred that the speech sample to be detected is an adversarial audio sample.
5. Conclusions
This study proposes a novel method for detecting speech adversarial samples by analyzing the geometric relationship between the sample to be detected and the original audio sample on a low-dimensional manifold. The experiment focuses on speech adversarial samples generated through black-box attack methods, as these adversaries do not require an understanding of the internal structure of the target network and can achieve model attacks simply by inputting speech adversarial samples into the target network. Furthermore, the commonly used method of training neural networks with adversarial samples to improve model robustness is not effective against adversarial samples generated through black-box attack methods.
Through the use of manifold learning, important features can be extracted, and data representation can be simplified, leading to a significant reduction in data dimensionality and the amount of computation required for the processing, storage, and analysis of data. This advantage allows manifold learning to train models more efficiently when dealing with large-scale datasets or high-dimensional data.
Additionally, this paper demonstrates through experiments that manifold learning can provide intuitive and easily understandable visualization results for high-dimensional speech data. By visualizing high-dimensional speech features in a more comprehensible space, the inherent structure and characteristics of speech data can be observed more clearly.
Recent research has shown that training neural networks using adversarial examples can effectively enhance model robustness. However, this approach also reduces model accuracy due to the incorporation of adversarial examples as inputs. The method proposed in this article allows for the detection of speech data adversarial examples prior to their input into the machine learning model, thus avoiding the negative impact of training on adversarial samples.