1. Introduction
Adversarial examples are a hot topic in current machine learning research. They consist of slight modifications of a neural network input (barely noticeable by human perception) which cause huge differences in its output. When applied to images, for example, small perturbations in carefully selected pixels can cause a classifier to predict a different class, even when the original and the perturbed image remain practically indistinguishable to human observers [
1]. For example,
Figure 1 shows an image which, with a small perturbation, is classified as a speedboat rather than a ballpoint.
This phenomenon has strong implications in terms of security, as more and more systems are relying on artificial intelligence. There is the risk that models can exhibit such absurd flaws in real life, with serious implications in their effective utility.
This threat is even more relevant due to the topics on which the deep learning methodology is applied. Since it was first proposed in [
2], the evolution of the field has increasingly grown. In that seminal work, the main concepts of convolution and pooling layers, their interaction and effects on the results, along with the parameters involved in the training procedure were described and analyzed. This first work was applied to a complex task, i.e., real-world image classification on ImageNet (more than a million images with 1000 different categories). From that moment on, the research and application of continuously evolving, increasingly complex, convolutional neural networks has grown exponentially. One of the most interesting fields of application is healthcare. In this respect, huge efforts have been made to advance in deep learning applied to medicine. For example [
3], covers this topic, with interesting insights about the main contributions in tomography, resonance image analysis or even cancer assessment. Regarding industry, deep learning has also been applied to quality control of manufacturing processes. For example [
4], covers the application of fruit quality evaluation for food factories. Finally, another interesting field of application is robotics, which is in a close relationship with computer vision. A combination of both could potentially build automated robots with the capability of full environment understanding. For example [
5], considers the main advances in 3D data interpretation algorithms. This is interesting for a wide variety of applications such as automated reconstruction of building diagrams or environment-aware robots to perform automated tasks both in indoor or outdoor conditions. Given this context of widespread application of the methodology in our society, it is understandable the need for research in threatening phenomena such as adversarial examples.
The literature available on adversarial examples has grown significantly in the past few years. In parallel with the so-called attack methods (algorithms to generate adversarial examples) several defense methods have been proposed which intend to build more robust networks. This can be achieved, for example, by performing more training stages with adversarial examples [
6], or by modifying the weights that the network learns during training [
7]. Other methods aim at performing a pre-classification of a potential adversarial input before it is fed into the network [
8].
We notice that deep convolutional networks can be considered chaotic systems. Adversarial examples represent tiny changes in the input that produce wildly different outputs. In the context of chaos theory, the Lyapunov exponents are a set of numerical values that measure how chaotic a system is [
9]. The largest one is usually known as the Maximal Lyapunov exponent (MLE) and it determines a notion of predictability for the dynamical system. In our case, the MLE (the first one) and other three subsequent exponents would be extracted.
Lyapunov exponents cannot be calculated analytically, so different estimation algorithms are available. In the algorithm of Eckmann et al. [
9], the one used in our work, the input is a flattened version of the original image matrix, with a total of 784 values vector (since images have 28 × 28 pixel size). After that, a Jacobian matrix is estimated and the Lyapunov exponents are obtained from its eigenvalues.
The spectrum of Lyapunov exponents can be defined as in Equation (
1), being the first so-called MLE.
The Lyapunov exponents describe the behavior of vectors (here, the image) in the tangent space of the phase space and are defined from the Jacobian matrix defined in Equation (
2).
This Jacobian defines the evolution of the tangent vectors, given by the matrix
YY, via Equation (3).
The preliminary condition is that, on the initial point
, the distance between different vectors in the space is infinitesimal. However, if the system is chaotic, the matrix
Y describes how a small change at the point
propagates to the final point
. The limit described in Equation (
4) defines a matrix
. Then, the Lyapunov exponents
are defined by the eigenvalues of
.
Given that each
is an eigenvalue of
, the corresponding Lyapunov exponent
is obtained as defined in Equation (
5). Usually, a single positive value is indicative enough for the presence of chaos in the system. To ensure all possible positive exponents are calculated, the first four exponents are estimated from any image.
Despite the appealing analogy with chaotic systems, few works have leveraged chaos theory in the context of adversarial examples. The seminal work [
10] used the Lyapunov exponents extracted from the flattened input image to classify it as legitimate or adversarial example. In that work, the authors showed that adversarial examples produce consistently higher values of the first Lyapunov exponents, which allowed for a good classification between legit images and adversarial images.
Chaoticity and entropy are known to be related. For example [
11], states that the relationship between the Maximum Lyapunov Exponent and the ‘permutation entropy’ has been observed both analytically and experimentally. Lyapunov exponents are also related with Kolmogorov-Sinai entropy (KSE) and Shannon entropy through Pesin’s theorem, see [
12]. The theorem states that the KSE is equivalent to the sum of the positive Lyapunov exponents. Moreover, the same author formalizes in [
13] the connection of KSE as a measure of randomness and Shannon entropy as a metric for disorder. Following the same line of research, the work in [
14] also relates the fields of dynamical systems and information theory. In this case, chaoticity in hidden Markov models can be described with Shannon entropy. Finally [
15], employs both entropy and Lyapunov exponents to study the chaos in the architecture of a layered deep neural network.
The connection exposed above suggests that entropy-changing image transforms (such as a simple histogram equalization) may have an effect on an adversarial classification system that is based on Lyapunov exponents. The experimentation performed in the aforementioned paper [
10] tested adversarial classification using a number of well-known adversarial attack methods. However, it did not consider the effect of other kinds of image noise in the robustness of the method, particularly those that may change image entropy. This is precisely the hypothesis that we test in this work. Based on our results, we further propose entropy itself as a useful feature to classify adversarial examples from legit images.
In [
16], an entropy based adversarial classification method was proposed. Saliency maps are computed over a three-channel input in the network gradients for a backward computation through the network. These maps are useful to visualize the pixels that contribute the most in the classifier decision. In legit images, they highlight the most discriminant regions of the images, like edges and important features. However, adversarial saliency maps are less sparse, with more spatially distributed highlighted areas. When entropy is computed in these single channel maps, values are a bit higher in adversarial images since more information needs to be codified. However their experimentation is not conclusive, as only one adversarial method with few parameters is tested, and entropy values do not differ much between both classes. In our experimentation, entropy is calculated over the raw images, so the information of the perturbations is kept with more fidelity and entropy values have greater difference when calculated over legit and adversarial images. Moreover, a comprehensive collection of adversarial attacks and images processing methods are tested to obtain significant results.
Regarding similar adversarial classification methods in the state of the art, there are some interesting proposals to be discussed. In [
17], the image space of legit and adversarial images are compared, along with the predicted class, to check whether the image has been potentially perturbed. The obtained results are useful with a 99% accuracy in detecting adversarials with different levels of perturbations, but only raw perturbations are studied and only the PGD attack method is tested. Using the internal parameters and layer activations to detect adversarial examples [
18], provides an interesting study that achieves a certifiable detection rate under certain restrictions in the
norm. Five different common attacks are tested, although the method is able to achieve good results only on the MNIST dataset. In [
19], an autoencoder is built for detection, where the reconstruction of the candidate image determines the probability of an adversarial origin. Moreover, the approach is reused to build an adversarial robust network with up to 76% accuracy for the Sparse attack. Similar to [
17], the work proposed in [
20] studies the image data to classify an image into a potential adversarial. In this case, through feature alignment with an external model, the method is able to detect the adversarial image and fix the classifier decision. However, the employed dataset is a customized version of PASCAL VOC, ImageNet and scraped images from the internet, which results in a difficult framework for comparison. Finally, a comparable approach to ours is given by [
21], where statistical analysis is used to predict whether an image behaves as an adversarial or not. Gaussian noise is compared to regular adversarial attacks to check the validity of the method. However, the studied attacks do not include the latest and more powerful proposals in the state of the art.
In order to compare the different features of the aforementioned methods,
Table 1 shows a comparison of their approaches, studied datasets, and the tested attacks or methods.
This paper is organized as follows:
Section 2 describes the dataset and specific adversarial methods that are used in this work.
Section 3 provides a comprehensive description of the experiments that have been carried out, including Lyapunov exponents for adversarial classification and the proposal of entropy for the same purpose, as well as a comparison of both methods. Finally,
Section 4 discusses the main results obtained, providing also an insight on relevant topics for future work.
2. Materials and Methods
In this work, we employed three datasets to compare the method on a wide range of images and conditions. First, the well-known MNIST dataset [
22], which contains handwritten digits with 10 different classes (from 0 to 9) and size 28 × 28 (grayscale single channel images), as shown in
Figure 2. The dataset has 60,000 train images and 10,000 test images. It is the same dataset used in the reference paper [
10] and one of the most employed datasets in adversarial example research. The reason is that the images are simple enough to study the perturbations in detail.
This work also used the Fashion-MNIST dataset. It was developed by the Zalando company to improve research in image processing [
23]. Inspired by the MNIST dataset, it contains 10 classes of popular kinds of clothing articles instead of digits, with 60,000 training and 10,000 test images. The structure is the same, with 28 × 28 single channel images. A sample of this dataset is shown in
Figure 3.
Finally, a third widely used dataset was employed in this work. CIFAR-10 was developed as a subset of miniatures from the ImageNet dataset [
24], to make research available with less computational power and less time-consuming processes. It contains 32 × 32 color images of common objects, such as birds, ships or airplanes. This dataset contains 50,000 images for training and 10,000 images for test.
Figure 4 shows a sample for the different classes. For the purpose of this work the grayscale variant of the dataset was employed.
To craft the adversarial examples from these datasets, the target model is a LeNet architecture [
25], one of the most common architectures in adversarial research. It was used because it has enough parameters so adversarial images are also considered adversarials by other more complex networks [
26], in what is called adversarial transferability. Furthermore, it has enough input size for the datasets employed in this work while maintaining an affordable and bounded computation complexity for Lyapunov exponents.
In order to cover a wide range of adversarial attacks, twelve different attack methods have been tested, from the most common in adversarial research comparisons to the latest contributions on the topic. These have been proposed during the recent years in which adversarial example research has grown significantly.
There are two main approaches in adversarial attacks. The first one is the so-called white box, in which the algorithm has full access to the data and internal parameters of the model under threat. The other approach is called black box. In this case, the algorithm crafts the adversarials estimating the model behaviour with different techniques. For example, by querying the model and observing its response to different perturbations, or by training a similar model to the original to develop the adversarials.
In the publication that serves as the reference for this study [
10], five of the most common attacks are employed, which are included in this work for reproducibility and comparison purposes. In the following, those attacks are described.
One of the most widely used attacks is Carlini and Wagner (CW), the attack proposed in [
27]. It defines a cost function to modify the input of the network, in which the distance to the original sample has to be minimized at the same time as the perturbations are introduced. As a result, the examples crafted with this method are much closer to the original inputs, and, in consequence, they are more difficult to be detected visually or through defense techniques like defensive distillation in [
28]. As one the first steps in adversarial attacks, Fast Gradient Sign Method (FGSM) was introduced by [
29]. This method was one of the first to show the phenomenon of adversarial examples in deep neural networks. They proposed a methodology which calculates the gradient of the cost function with respect to the input of the network. Using this information, this method computes a single perturbation to the image to maximize the resulting gradient in the direction of a class different than the groundtruth. As a variant of this attack, the Madry [
30] method applies some restrictions for the random initialization of the datapoint seed are fixed. The different values for the initial conditions in the algorithm can lead to more robust adversarials, depending on the network complexity. Finally, the Jacobian Saliency Map Attack (JSMA), which was introduced in [
31], is a method that takes advantage of the features that have more impact in network decision. Building the saliency maps, the method discovers the key pixels that have a significant impact in the decision when perturbed, so it can minimize the number of pixels and amount of perturbations to perform the adversarial attack.
In this work, a wider range of attack methods is considered. Variants of the FGSM and Madry methods are tested. They have been developed to make the approach even more robust or adaptive. For example, the Basic Iterative Method (BIM), as described in [
32], was a revision that performs the gradient estimation iteratively, so the adversarial example is built using several small steps. For this reason, the crafted perturbations are more effective with less perturbation in comparison. Projected Gradient Descent (PGD) as described in [
30], is a further step variant of the Basic Iterative Method [
32]. In this attack, after each iteration, the selected perturbation is projected on a theoretical Lp-ball (which uses a selected distance, being
L0,
L2 or Linf, for example) of specified radius, thus keeping the perturbation small and in the range of the input data. As the top and most recent step in this family of attacks, the Sparse-L1 Descent (SL1D) [
33] computes their perturbations in a projection over the L1 distance. This has significant advantages in terms of adversarial robustness, since images are less perturbed (and therefore less detectable).
Regarding more classic approaches, DeepFool was one of the first attack methods to successfully craft adversarials in large scale datasets (such as ImageNet), as exposed in [
34]. It estimates the theoretical plane for perturbations within a boundary in which the classifier remains predicting the same class, in order to overcome this frontier and calculate the necessary perturbations to produce the adversarial example. Coming from a defensive approach, Virtual Adversarial (Virtual), proposed in [
35], applies a specific algorithm extracted from adversarial training, to produce the adversarials even without the information of the output labels, only with the gradients and parameters. For this purpose, it was one of the most successful methods to take advantage from adversarial training to craft better adversarials. With a different approach, Elastic-Net Attack (EAD), described in [
36], proposes a regularization optimization problem to compute small L1 distances (which can be extended to the
L2 domain). In comparison to other methods, it is able to break some popular defenses such as distillation and adversarial training, although the general performance is similar to other methods such as C&W.
Finally, regarding the black box paradigm, we first consider the Spatial Transformations method (Spatial) [
37]. Without any knowledge about the internal parameters of the model, this method computes slight spatial transformations such as translations and rotations, checking whether the output class changes or not, to iterative build the adversarial examples. The most recent approach in this paradigm is the HopSkipJump attack [
38] (previously called Boundary Attack). Only by querying the model with slight perturbations, this algorithms is able to estimate the responses of the model to new perturbations and compute them efficiently.
The aforementioned attacks have been applied with the specific parameters that are detailed in
Table 2. Most of them are common to several attacks, such as the number of iterations of the maximum perturbation allowed (usually so-called epsilon). These are the most important to control the quality of the adversarial examples. For this reason, they have been set to the default values proposed by their respective authors in each referenced publication. Therefore, these configurations are proved to be the most suitable to perform as robust as possible adversarials, with optimal computation time. If recommended values were not employed, the attacks may produce over perturbed images, which would lead to sub-optimal adversarials. Moreover, increasing the number of iterations or search steps would increase the processing time exponentially, with marginal benefits. For example, axis and angle rotation limits in the Spatial attack have been chosen to stay in a visually acceptable range (over rotated numbers or objects may look suspicious for humans). Other parameters, such as gamma/xi/tradeoff, common in several attacks, are used as corrective factors of the algorithms and they have not substantial influence on the final results. Finally, secondary parameters, not mentioned in the detailed table, are used to adapt the adversarials to each specific dataset. Usually they are employed to clip and bound the resulting adversarials to the representation range of the images. These are not relevant and constitute low-level implementation details.
On the other hand, image processing methods can produce noise and perturb an image, in some cases even producing missclasifications. In this respect, an optimal adversarial discriminant method should be robust to those, neglecting the effect of those other sources of noise when they do not affect class decision, and pointing the corresponding adversarials when the model is fooled. In this work, these image processing transforms are performed over both clean (legit) test and adversarial images. The objective is to check whether the Lyapunov-exponents method for detecting adversarial images is robust to such simple sources of noise. That is, we intend to analyse failure rates when confronted with such simple image transformations. Such robustness assessment was also done in [
10]. In this respect, if the method classifies a (transformed) legit test image as an adversarial, then this legit image will be rejected by the system, which must be considered an error. Similarly, if a (transformed) adversarial image is no longer detected by the method, that is also an error. Another purpose of the image processing is to modify the entropy for both legit test images and adversarial examples. This also allows to check the robustness of the Lyapunov method, since we know that entropy and Lyapunov exponents are somehow related [
11]. That is, with this we want to detect cases in which entropy-altering transformations (that do not change class) make the method classify legit images as adversarials and vice versa.
Regarding the collection of image processing methods, hereinafter the ones employed in this work are explained. Histogram Equalization (EQ) is usually employed to enhance image contrast. In EQ, the image intensity histogram is transformed to be equally distributed over the intensity range. As a variant of this method, Contrast Limited Adaptive Histogram Equalization (CLAHE) applies a rule in which the maximum value of the histogram for a pixel is clipped in relationship with the the values of its vicinity. Other methods, such as Gaussian filtering, apply a kernel over the image. The filtering is performed with a normally distributed unidimensional convolution, with a more or less intense blurring effect depending on the parameters. In our experiments, the standard deviation of the kernel distribution is 1 (sigma parameter), and the order of magnitude is 0. With these values, the result is partially diffused without making the object unrecognizable. Finally, there are methods that introduce more random effects. Gaussian noise performs a distributed additive noise that follows the normal distribution. It can be applied over the whole image, according to the variance of vicinity at each pixel. Another example is Poisson noise (Poisson distributed additive noise), which is only defined for positive integers. To apply this noise type, the number of unique values in the image is found and the next power of two is used to scale up the floating-point result, after which it is scaled back down to the floating-point image range. Other kinds of noise perform simpler operations, such as ‘pepper’ noise (random pixels to 0), ‘salt’ noise (random pixels to 1), ‘salt & pepper’ noise (random pixels to 0 or 1). Finally, ‘speckle’ noise is produced by a multiplicative factor (n × ). In this case, to preserve the image integrity the “n” factor is chosen to have 0 mean and 0.01 variance.
Again, the objective is to alter the adversarial examples with these transformations and check if the detector can still distinguish between legit and adversarial images.
A good adversarial classifier should be robust to different sources of noise, such as those described above and shown in
Figure 5. Regarding the other datasets,
Figure 6 shows the examples for CW attack in the Fashion-MNIST dataset and
Figure 7 shows the same methods for a sample in the CIFAR dataset.
3. Experiments
In order to compare the original methods proposed in the reference publication and the claims of our approach, several experiments have been proposed. First, all the attacks and processing methods described in the previous sections are compared in the task of using the Lyapunov exponents to classify between non adversarial (legit test images) and adversarially perturbed images (adversarial examples, obtained by attacking the legit test images). For this purpose, a state of the art LeNet-like architecture (i.e., an end-to-end deep network) is used for all the adversarial related processes (testing of datasets, adversarial crafting with the different attacks, etc). In consequence, the method is able to obtain the best generalizable model possible, therefore reducing the impact of overfitting in the results. Then, using the test set of each dataset, a random subset of 100 images are taken to craft the adversarials and apply the processings, in each experimental set. This is applied in a 10-fold cross validation procedure, so the results are shown as the average of the whole set of runs. Finally, in each execution, a classic machine learning method is employed to classify the features (Lyapunov exponents, entropy values, ...) extracted from the images, as a simple linear classification problem of 1, 4 or 5 features (if entropy, Lyapunov exponents, or both are used) to characterize each sample, and 2 classes: adversarial or legit image. The employed method is a Support Vector Machine (SVM) classifier, since it is one of the most powerful and optimized methods for this kind of task (linearly classifying a small set of features). This decision is in line with other works [
10]. Regarding other options, no further benefits have been observed (neither accuracy nor computation time) in the initial tests by using different classifiers from the supervised machine learning family, such as linear discriminant, k-nearest neighbors or boosted trees. For the kind of data employed in this work, SVM is powerful enough to obtain the maximum accuracy with the available features. The specific parameters to obtain the best results from the method are the following: Gaussian kernel function (as the most suitable to fit the data in only two classes), 0.5 kernel scale and standardization applied to normalize the values in a closer range (which helps the Gaussian function to fit the data) and a box constrain of 1 to fix the number of support vectors per class. As a summary of this experimentation,
Table 3,
Table 4 and
Table 5 show, for the three datasets, the classification accuracy. In the first column, we can observe the “base” adversarial images, with no processing method applied (i.e., the raw adversarial perturbed images). The rest of columns show the accuracy when adversarial images are further processed with the corresponding methods.
In general, all of the image processing operations have a detrimental effect in the adversarial example classification method. We illustrate this in the case of MNIST.
Figure 8 shows the first two Lyapunov exponents for some of the processing methods in the CW attack in this dataset. As it can be observed, the classification is more accurate when no processing is applied, since the test images (blue points) are better separated from the adversarial images (red dots). This is mainly because the values for the first exponents are higher (more positive), and shifted to the right in the
x-axis. Nevertheless, when image processing transformations are applied, such as equalization (normal or adaptive), adversarial images are no longer distinguished from test images so clearly. For this reason, the performance in classifications drops from 100% to 57.71% and 84.85%, respectively. Finally, local variance Gaussian noise does not show any impact on the Lyapunov exponents, and, in consequence, the classification of adversarial images remains with perfect score.
In contrast, for the PGD attack (
Figure 9) it is observed that there are no major changes in the Lyapunov exponents distribution when image processing methods are applied, being that the reason to have near 100% accuracy for every single method for adversarial classification (as observed in the PGD row from
Table 3). Themost recent black box attack also exhibit a good performance, ranging from 99% to 100% in all combinations. Finally, the attacks that reduce the Lyapunov accuracy the most are EAD and JSMA. The powerful Sparse attack is also affected, but with better values than the former attacks. A visualization of the Lyapunov exponents from test and adversarial images is shown in
Figure 10. The adversarial images have, for the attacks considered, negative exponent values that are contained within the same region of the legit test images.
As it is observed in
Table 3,
Table 4 and
Table 5, depending on the combination of attack and processing, the results can vary substantially. In consequence, we have conducted an additional investigation to explain the different behaviour depending on the attacks. As a result of this investigation, we have discovered that the
L0 metric (
L0 represents the total number of pixels that have been modified) shows a strong correlation with the accuracy of the Lyapunov exponents method.
Table 6 shows the mean
L0 value for the base (unprocessed) adversarial images along with the accuracy obtained by Lyapunov exponents method.
Over a maximum L0 value of 784 (since images are 28 × 28 pixel in size), we can see that when the method modifies less than a half of the pixels, the accuracy is penalized. Thus, it can be inferred that the Lyapunov exponents are very sensible to the number of pixels that are modified. Regarding the specified attacks FGM, PGD, BIM and Madry, those four adversarial attacks belong to the same family of algorithms, which may be the reason why they behave in a similar way, producing a similar kind of adversarials.
When we extend the observation of
L0 values to the whole set of attacks and image processing methods in the MNIST dataset, the correlation is even more clear. In
Figure 11, the 84 data points in the horizontal axis represent the total 84 different combinations of attack method and image processing effect. Regarding the vertical axis, note that the
L0 data has been normalized between 0 and 1 for visualization purposes. Furthermore, accuracies are presented in the 0–1 range too, for the same purpose. As it is observed, both follow the same trend, in which valleys on
L0 (less perturbed pixels) point to drops in accuracy and vice versa.
The same pattern is observed in the three datasets with a slight overall decrease in the accuracy for both the Fashion-MNIST and CIFAR datasets. The complexity of those datasets make it more difficult to classify the perturbations in the adversarial images. However, some combinations are still robust, such as Poisson, salt and pepper and speckle transformations for Fashion-MNIST, along with PGD, BIM and Madry attacks. Regarding the CIFAR dataset, PGD and particularly Spatial attacks remain with high performance in adversarial classification.
3.1. Effects of Entropy
Since entropy is known to be related to Lyapunov exponents [
11], we propose to study how the aforementioned image processing methods affect the entropy of the images. Different image processing methods affect entropy differently so we first checked if the changes in entropy are correlated with the the accuracy of classification using Lyapunov exponents.
Entropy computes the quotient between frequency and probability of each possible intensity value to occur in the image range. Depending on the choice for the logarithmic base (
b), the formula described in Equation (
6) can be used to compute the Shannon entropy (base 2) [
39], natural unit (base
e) or Hartley (base 10).
In this expression, represents the probability of a pixel having the intensity value i. In this case, it ranges from 0 to 255 as the images are encoded with 8-bit single channel greyscale. These probabilities can be extracted from the image histogram. As a result, an image with a more compact histogram would have a lower Shannon entropy value and vice versa.
We can observe that images have a different “base” value depending on the attack method. When comparing each pair of test images (first row) and a given attack (the rest of the rows) it is observed that when the image processing produces an increasing entropy, Lyapunov exponents seem to classify the image with more accuracy. When the entropy of an adversarial attack row remain in the same average as the test row, it is more difficult to classify these images using the exponents. On the opposite, the processing methods that had their Lyapunov exponents on the same distribution of the base images, exhibit a marked decrease in their entropy values.
To further show that there is a relationship between both Lyapunov accuracy in adversarial prediction and entropy, we build a normalization method to exhibit empirically whether an increment in entropy has a correlation with better adversarial predictability with Lyapunov exponents. The aim of the normalization process is to point out that variations in entropy for the different attacks and processings (in comparison to the base (unaltered) images) are related to Lyapunov exponents accuracy. However, due to the different kinds of attacks and number of perturbations, the entropy for each attack range of values is very wide (between 2.75 and 9.3 in FGM, or 2.37 and 6.19 in EAD, for example). To compare them in relative terms, each row of entropy from
Table 7 is normalized between 0 and 1. After that, each processing value for a given attack is subtracted from the corresponding base value. As a result, each resulting value represent the proportional increment/decrement in entropy for a given processing, with respect to the base images. As a result, we obtain
Table 10.
Using this data, the correlation between the entropy data and the accuracy obtained with Lyapunov exponents has been studied. There are a total of 6 (methods) × 10 (attacks) = 60 different cases in which the accuracy of the Lyapunov exponents can be related with its corresponding increasing or decreasing mean value of the entropy for these images. Using the Pearson correlation index [
40] with these pairs of data, the result shows a positive correlation of +0.43 and a
p-value of 0.1 × 10
. This shows that indeed there is a moderate (and statistically significant) correlation between the two quantities: when the entropy of the adversarial images is reduced, the accuracy of the Lyapunov exponents method is also lower.
The previous results lead us to check if entropy can be a feature that can be employed for adversarial prediction (instead of Lyapunov exponents). Following the same experimental conditions of the Lyapunov exponents, we extract the Shannon entropy of each image, as a single feature to determine whether an image is legit or adversarial (with or without any additional processing method). In this case the data has been classified using a linear discriminant analysis, which is enough for a single feature model. The results are shown in
Table 11,
Table 12 and
Table 13 for the reference datasets.
The absolute difference of Lyapunov performance in Lyapunov exponent-based classification and Shannon entropy-based adversarial classification has been calculated in
Table 14,
Table 15 and
Table 16.
Using the entropy as a feature to classify between legit and adversarial inputs for the MNIST datasets has some benefits but also some drawbacks. In total, 38% of the cases (32/84) increase the accuracy with respect to Lyapunov exponents, 24% perform the same (20/84) and 38% decrease the performance (32/84). So, at least, in 62% of the cases entropy is able to perform equal or better than Lyapunov exponents. Regarding the drawbacks, JSMA attack and localvar Gaussian show an even deeper negative impact, noting that this method is much more prominent to failure when perturbations are reduced in number and range (similar as stated for Lyapunov exponents in previous sections). Following the same pattern, the performance on the other two datasets is weaker, but still remarkable, considering than a single value/feature is used to characterize a legit or adversarial image, for such a wide variety of attacks and processings.
Furthermore, we have used the same format to compare between using both features vs using only Lyapunov exponents. The data was extracted from the comparison between
Table 3 and
Table 17, showing that in 52% (43/84) of cases the performance was improved, in 34% (29/84) it was equal and only in 14% (12/84) it was decreased. In this case, the correct statement is that in 86% of the cases the accuracy of the combination was equal or better than Lyapunov alone.
In comparison to MNIST, entropy is not so good at distinguishing between adversarial and legit images in Fashion-MNIST and CIFAR, since these datasets have similar entropy values in both kinds of images. As observed in
Table 8 and
Table 9 the entropy for the test images and the corresponding attacked ones have very similar values, so this feature by itself is not able to classify the two types of images with the same performance as Lyapunov exponents did.
The main reason is that backgrounds and objects are more defined in MNIST, so perturbations are better detected by the entropy. Fashion-MNIST and (especially) CIFAR have much more diffuse backgrounds and grey levels around and inside the object.
3.2. Lyapunov and Entropy Combined for Classification
In previous sections, Lyapunov exponents have been confirmed as a reliable method to classify adversarial images in most of the attacks and processing transforms. However, entropy also performed accurately for a significant amount of combinations in the three datasets, even though it was used as a single feature for discrimination. Although it did not perform as well in some datasets with specific methods, entropy shows to be a powerful and descriptive metric and provides useful information. For this reason, we decided to test the performance of a combination of both Lyapunov exponents and entropy, leading to a 5-feature classification problem (four Lyapunov exponents plus the Shannon entropy). To perform this task, a medium Gaussian kernel support vector machine classifier is employed, as in the raw exponents classification. The results are summarized in
Table 17,
Table 18 and
Table 19 regarding the MNIST, Fashion-MNIST and CIFAR datasets, respectively.
In order to quantify the variation of performance, the following
Figure 12,
Figure 13 and
Figure 14 show the accuracy variation when Lyapunov exponents, entropy and both both are used to classify adversarial images. Each bar represents the mean accuracy of the image processing operations for the 10 different adversarial attacks.
In the MNIST dataset, as observed in
Figure 12, an average of 2% accuracy increment is obtained: 91% for Lyapunov, 88% for entropy and 93% for the combination. The overall performance is increased significantly in two specific attacks, EAD and Spatial (around +14%).
In the Fashion-MNIST dataset, as observed in
Figure 13, an average of 5% accuracy increment is obtained: 81% for Lyapunov, 74% for entropy and 86% for the combination. The overall performance is increased significantly in two specific attacks, DeepFool and EAD (around +8%).
In the CIFAR dataset, as observed in
Figure 14, an average of 5% accuracy increment is obtained: 69% for Lyapunov, 62% for entropy and 74% for the combination. The overall performance is increased significantly in BIM attack, from 69% accuracy when using Lyapunov exponents, to 81% when the combination of both is employed.
To validate that the results from the combination (Lyapunov exponents and entropy) with respect to only Lyapunov exponents, an statistical test is performed to check whether there is a significant improvement. For this purpose, a two-sample
t-test has been conducted to compare both
Table 3 and
Table 17. The data employed in this case are the performance values for the 7 methods × 12 attacks × 10 folds from cross validation, in each of the experiments, giving 840 pairs of data to perform the statistical evaluation. The results indicate that the null hypothesis is rejected, with a
p-value of 0.0036. This means that both experiments reflect different means, indicating that the performance improvement exists and is statistically relevant.
Another aspect that is studied is the distribution of data and the presence of outliers, which could bias the results. As observed in
Figure 10, some of the points in the Lyapunov exponents are far from the main distribution, which could lead to misinterpretations. However, most of them are in the same blob depending whether they belong to the adversarial set or not. For this purpose, an outlier detector test is performed, using the “median” method to highlight points of data that are further than three scaled median absolute deviations from the data distribution. This is performed for each Lyapunov exponent dimension, that are contained in different ranges of values. For the same experiments compared in the previous test, less than 2% of outliers are detected, so the data can be considered as meaningful enough to support the conclusions provided in this work.
In summary, the addition of Lyapunov exponents and entropy supposes and improvement in the classification for all the methods tested in this work. The good results of the combination can be explained in the following way. As stated in the introduction, entropy has been related to the information provided in the positive Lyapunov exponents, which in turn are the most discriminant to measure chaoticity. The addition of the entropy to the classifier, therefore, may have a boosting effect on these most discriminating first two exponents, leading to better overall performance.
4. Conclusions
In this paper we have studied the potential of a chaos theory framework in the problem of adversarial example classification. To study the chaotic (adversarial) inputs given to a neural network, two main approaches are considered. First, Maximal Lyapunov Exponents (MLE) have been tested as a suitable method on a wide range of conditions, such as with different adversarial attacks and using other sources of image noise that may cause the classification to fail. Then, based on a connection between entropy and Lyapunov exponents, we verified that the image processing transforms that reduced accuracy the most were also altering image entropy the most. More specifically, the experiments showed that there is a correlation between the image entropy variation and Lyapunov-based accuracy. Finally, while entropy alone was not useful for this problem, we showed that the combination of Lyapunov exponents and entropy produced better results than using either method alone. We explain this based on the relationship between entropy and positive Lyapunov exponents, which in fact are the most discriminant ones.
The exposed theoretical relationship between Lyapunov exponents and entropy, through the chaos theory translates into more powerful features to predict whether an image is adversarial. Moreover, this is achieved when adversarial perturbations are added, but also when entropy changing transformations fool the network.
The combination of both methods seems to support the results in a wider range of conditions, that could be applied to real world scenarios, where the noise of cameras, for example, can produce similar imperfections such as the ones covered in the different noise methods.
We suggest that future work should consider whether chaos theory concepts can be equally applied to the internal representation of the network (as opposed to the input image, which is the line followed in [
10] and also in this work), as it has been initially proposed in [
15]. The analogy of deep networks with chaotic systems (in which adversarial examples represent a tiny variation in the input that produces wildly different outputs) is very promising.