Comparative Study of Adversarial Defenses: Adversarial Training and Regularization in Vision Transformers and CNNs

Dingeto, Hiskias; Kim, Juntae

doi:10.3390/electronics13132534

Open AccessArticle

Comparative Study of Adversarial Defenses: Adversarial Training and Regularization in Vision Transformers and CNNs

by

Hiskias Dingeto

and

Juntae Kim

^*

Department of Computer Engineering, Dongguk University, Seoul 04620, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(13), 2534; https://doi.org/10.3390/electronics13132534

Submission received: 13 May 2024 / Revised: 21 June 2024 / Accepted: 25 June 2024 / Published: 27 June 2024

(This article belongs to the Special Issue Recent Trends in Image Processing and Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

Transformer-based models are driving a significant revolution in the field of machine learning at the moment. Among these innovations, vision transformers (ViTs) stand out for their application of transformer architectures to vision-related tasks. By demonstrating performance as good, if not better, than traditional convolutional neural networks (CNNs), ViTs have managed to capture considerable interest in the field. This study focuses on the resilience of ViTs and CNNs in the face of adversarial attacks. Such attacks, which introduce noise into the input of machine learning models to produce incorrect outputs, pose significant challenges to the reliability of machine learning models. Our analysis evaluated the adversarial robustness of CNNs and ViTs by using regularization techniques and adversarial training methods. Adversarial training, in particular, represents a traditional approach to boosting defenses against these attacks. Despite its prominent use, our findings reveal that regularization techniques enable vision transformers and, in most cases, CNNs to enhance adversarial defenses more effectively. Through testing datasets like CIFAR-10 and CIFAR-100, we demonstrate that vision transformers, especially when combined with effective regularization strategies, demonstrate adversarial robustness, even without adversarial training. Two main inferences can be drawn from our findings. Firstly, it emphasizes how effectively vision transformers could strengthen artificial intelligence defenses against adversarial attacks. Secondly, it shows how regularization, which requires much fewer computational resources and covers a wide range of adversarial attacks, can be effective for adversarial defenses. Understanding and improving a model’s resilience to adversarial attacks is crucial for developing secure, dependable systems that can handle the complexity of real-world applications as artificial intelligence and machine learning technologies advance.

Keywords:

machine learning; security; adversarial attack; adversarial robustness; adversarial defense; vision transformers; convolutional neural networks; regularization

1. Introduction

The modern renaissance of neural networks, particularly in the realm of computer vision, can be attributed to the groundbreaking works, including that of Krizhevsky [1], with the introduction of AlexNet. This deep architecture, equipped with multiple convolutional layers, outperformed its competitors significantly in the ImageNet Large Scale Visual Recognition Challenge. The success of AlexNet heralded the era of deep learning, making CNNs the architecture of choice for image-related tasks. Subsequent models, like VGG [2], GoogLeNet [3], and the ResNet family [4], introduced deeper architectures, novel modules, and residual connections, continually pushing the boundaries of accuracy. These advancements were not merely architectural; they were supported by the exponential growth in computational power, particularly the capabilities of GPUs, and the availability of large, annotated datasets [1,2,3,5,6,7]. As neural networks evolved in complexity and depth, so did their capabilities, culminating in models that could outperform human benchmarks in specific classification tasks and setting the stage for the diversification of neural architectures, including the introduction of vision transformers [8].

Convolutional neural networks (CNNs) and vision transformers (ViTs) represent two distinct paradigms in the domain of image classification. CNNs leverage layers of convolutional filters to extract and process spatial features from images hierarchically [9,10,11]. They thrive on local spatial dependencies, gradually aggregating information through a series of convolutions and pooling operations [11]. On the other hand, vision transformers, an adaptation of the transformer architectures designed initially for natural language processing tasks, treat images as sequences of non-overlapping patches and rely heavily on self-attention mechanisms to weigh features based on their contextual relevance [8]. Rather than capitalizing on local spatial hierarchies, ViTs capture both local and long-range dependencies across an image [12]. As the research community delves deeper, early indications suggest that while CNNs remain highly efficient for smaller datasets or when computational resources are constrained, ViTs, especially when pre-trained on massive datasets, are challenging the supremacy of CNNs in large-scale image classification tasks, marking a potential shift in the foundational architectures of computer vision, as through previous research [13,14,15].

Adversarial attacks in deep learning present a paradoxical challenge: while neural networks can achieve great performance in various vision tasks, they are also alarmingly vulnerable to perturbations often imperceptible to the human eye. The genesis of these attacks can be traced back to Szegedy et al.’s seminal work in 2014 [16], where they demonstrated that strategically crafted noise added to an image could lead a trained model to misclassify it with high confidence. These perturbed images, termed “adversarial examples,” showcased previously unexplored frailty in deep learning models. Over the years, a taxonomy of adversarial attacks has emerged, ranging from white-box attacks, where the attacker has complete knowledge of the model, to black-box attacks, where they have no such knowledge. Furthermore, the methods to craft these adversarial examples have evolved in sophistication, from gradient-based techniques like the Fast Gradient Sign Method (FGSM) [17] to optimization-based approaches like the Carlini and Wagner (C&W) attack [18]. The discovery of these vulnerabilities has spurred a vast body of research to understand the underlying reasons for these susceptibilities better and devise defenses against them.

One of the most known defense mechanisms against adversarial attacks is adversarial training, which was first proposed by [17]. It essentially augments the training data with adversarial examples in order to improve adversarial robustness. It has been found to be effective in several cases but has recently been facing challenges in various ways. One of the main challenges we addressed in this paper is the fact that adversarial training is very resource-intensive and can overfit very easily on the adversarial samples, limiting their effectiveness [19,20]. We argue that regularization techniques can solve both problems since they do not require as much computational resources and, by definition, avoid overfitting the training dataset.

Considering that adversarial attacks are one of the major issues plaguing modern-day machine learning, robustness against these attacks is a significant challenge that needs to be resolved. Despite the popularity of transformer-based models and their potential benefits, more research is still needed regarding the effects of adversarial attacks and potential defenses against them. Previous research, explored in Section 2, includes different methods to attack vision transformers and test their robustness against adversarial attacks [21,22,23,24,25,26,27,28,29,30].

The main contributions of this paper are summarized below:

This paper provides novel insights into the relative strengths and vulnerabilities of convolutional neural networks (CNNs) and vision transformers (ViTs) under adversarial conditions by employing adversarial training and model regularization techniques;
We explore the application of regularization-based strategies to enhance the adversarial robustness of vision transformers and convolutional neural networks. Our investigation not only proves the effectiveness of these strategies but also demonstrates their tangible benefits in fortifying ViTs against adversarial attacks;
This study offers a detailed explanation of the underlying reasons why regularization techniques particularly benefit vision transformers in the context of adversarial robustness. Through this exploration, we contribute to a deeper understanding of the interaction between model architecture and regularization, highlighting its significance in developing more secure machine learning systems.

2. The Literature Review

Vision transformers (ViTs) and convolutional neural networks (CNNs) represent two fundamentally different paradigms in the field of computer vision, each with its distinct architectural advantages and application domains. CNNs, historically the cornerstone of deep learning in image processing, excel through their use of convolutional layers that locally connect neurons to process pixels in close proximity, thus effectively capturing spatial hierarchies in images. There has been in-depth research on the usage of convolutions in deep learning, exploring their design, applications, benefits, and prospects [31,32]. There are several advanced studies that still use CNNs, including [33,34,35]. Their continued predominance in such areas points out that CNNs are still widely used and are here to stay.

On the other hand, vision transformers, inspired by the transformer architecture initially developed for natural language processing tasks [36], apply the self-attention mechanism to model relationships between all parts of the image across global contexts. ViTs do not assume any inherent spatial hierarchy, making them more flexible in learning representations but often requiring more data and computational resources to reach their full potential. Unlike CNNs, which rely on their structured, local processing of information, ViTs’ global processing capability allows them to excel in tasks where understanding the broader context of relationships between distant parts of the image is crucial, as shown by M. Raghu et al. [13]. In this paper, we explore the differences between these two architectures with respect to adversarial robustness using different methodologies introduced in the subsections below.

After the initial paper on Transformers was published in 2017 by A. Vaswani et al. [36], transformers have taken the world of machine learning by storm. Transformers have impacted various fields, from natural language processing (NLP) to image and audio and image generation and classification [37,38]. In this research paper, we focus on vision transformers [8] and an implementation of the transformer architecture for image processing tasks. In a ViT, an input image is first divided into a series of fixed-size patches. These patches are then flattened and transformed into a sequence of embeddings, similar to words in a sentence, with a learnable positional encoding to retain positional information. This sequence of patch embeddings is then processed through multiple layers of self-attention and feed-forward neural networks, characteristic of the transformer architecture.

The self-attention mechanism allows the model to weigh the importance of different patches relative to one another, enabling it to capture global dependencies across the entire image. This approach contrasts with convolutional neural networks (CNNs), which process images through local receptive fields. The output of the vision transformer can be tailored for various tasks, such as classification, by using the representation of a special classification token or aggregating the output embeddings. ViTs have demonstrated impressive performance on a range of vision tasks, benefiting from the transformer’s ability to model complex relationships within the data [39,40].

Despite their impressive performance and unique architecture, vision transformers still suffer from the same type of adversarial attacks affecting CNNs. Various research has been conducted on adversarial attacks in vision transformers, some proposing and exploring the effects of adversarial samples in vision transformers [25,29,41], some comparing ViTs and CNNs’ adversarial robustness [23,26], and some suggesting different types of adversarial defenses [24,42]. One of the most notable pieces of research comparing the two architectures was performed by [26]. The authors tested various white-box and transfer attacks against both CNNs and ViTs to show that vision transformers had superior robustness. Their analysis of the results reveals that features learning by vision transformers exhibit fewer high-frequency patterns with irrelevant associations, which explains why they are not as affected by adversarial noise as their CNN counterparts. In this paper, we explore a similar concept but expand on the different types of regularization techniques as adversarial defenses in comparison and draw a conclusion from the results.

First proposed by C. Szegedy et al. [16], adversarial robustness via adversarial training is a direct approach to enhancing the resilience of machine learning models against adversarial attacks. This method involves incorporating adversarial examples into the training process, thereby explicitly teaching the model to recognize and correctly classify inputs that have been intentionally perturbed to induce errors. It has shown resilience in various cases, such as [19,43,44,45]. Adversarial training operates on the premise that exposure to a wide range of adversarial tactics during the training phase equips the model with the necessary defenses to withstand similar attacks once deployed. This process typically involves generating adversarial examples using various attack strategies and then mixing these examples with the original training data. As a result, the model learns to generalize from both clean and adversarially modified inputs, potentially leading to improved robustness against attacks. While effective to some extent, adversarial training can be computationally intensive and may not cover all possible types of attacks, leaving some vulnerabilities unaddressed [46,47].

Adversarial robustness via regularization, on the other hand, focuses on modifying the training process to encourage the model to learn more generalizable and stable features that are less sensitive to the addition of small perturbations. Regularization techniques, such as weight decay, dropout, and early stopping, are designed to prevent overfitting the training data, which can inadvertently make the model more susceptible to adversarial attacks. By penalizing complexity and encouraging simplicity in the learned models, regularization can help ensure that small changes in the input data (such as those introduced by adversarial examples) do not lead to significant changes in the output. This approach aims to strengthen the model’s defenses by fostering a more robust representation of the data rather than directly training on adversarial examples [48,49,50].

In our research of adversarial robustness through regularization, our empirical results provide a nuanced comparison between vision transformers (ViTs) and convolutional neural networks (CNNs) in the context of resisting adversarial attacks. Interestingly, our study reveals that ViTs, when enhanced with carefully chosen regularization techniques, exhibit a notable improvement in adversarial robustness compared to their CNN counterparts. This finding proves the potential of ViTs to not only excel in standard vision tasks but also to offer a viable defense mechanism against adversarial manipulation when augmented with regularization strategies. We also consider the relevance of more recent and advanced related schemes to provide a comprehensive evaluation of our proposed methodology. For instance, the tfl-dt trust evaluation scheme for federated learning in digital twin for mobile networks presents a sophisticated approach to trust evaluation in dynamic and decentralized environments, as outlined by [51]. Similarly, the anomaly detection system for in-vehicle networks using a CNN-LSTM architecture with an attention mechanism offers a cutting-edge technique for enhancing security and reliability within vehicular networks [52]. Additionally, TROVE, a context-awareness trust model for vehicular ad hoc networks (VANETs) using reinforcement learning, exemplifies the integration of context-awareness and machine learning for improved trust management in highly mobile and variable environments [53]. These contemporary schemes illustrate the ongoing advancements in trust evaluation, anomaly detection, and context-aware models, highlighting the need for integrating such innovative approaches into our comparative analysis. By acknowledging and referencing these advanced methodologies, we aim to situate our work within the broader landscape of state-of-the-art research, ensuring that our contributions are both relevant and impactful in the current technological context.

3. Methodology

As mentioned above, adversarial attacks affect convolutional neural networks and vision transformers. Hence, formulating a defense that can resist these attacks is of the utmost importance. It has been shown in [23,26] that vision transformers are better at defending against adversarial attacks than convolutional neural networks. The authors attribute the low sensitivity of ViTs to adversarial noise to them containing less low-level information. In this paper, we take this one step further by exploring different avenues of implementing adversarial defenses in vision transformers and comparing them with CNNs. We used adversarial training and regularization to improve model generalization and adversarial robustness and found regularization to be the most effective method. We explain this by using a mathematical formulation and a theoretical explanation backed by different visualizations.

Let

f_{θ} (x)

denote a model, either a vision transformer or a convolutional network, parameterized by

θ

for an input

x

. An adversarial attack that is carried out on the model aims to find a perturbation

δ

that maximizes the loss

L

, as shown below:

L (f_{θ} (x + δ), y)

(1)

subject to

{‖δ‖}_{p} \leq ε

, where

y

is the true label, and

ε

is the maximum perturbation allowed, as explained in [17]. Given the model

f_{θ} (x)

and an adversarial perturbation

δ

, we aim to introduce a regularization term to be applied to the model parameters

θ

to improve the adversarial robustness of the model. Let

R (θ)

be the regularization term applied to the model. We use the widely known

L 1

and

L 2

regularization popularized in machine learning by Andrew Ng [54]. The formulation is as shown below in Equations from (1) to (3):

R_{L 1} (θ) = {‖θ‖}_{1} = \sum_{i} |θ_{i}|

(2)

R_{L 2} (θ) = {‖θ‖}_{2}^{2} = \sum_{i} {θ_{i}}^{2}

(3)

R (θ) = λ_{1} \sum_{i = 1}^{n} |θ_{i}| + λ_{2} \sum_{i = 1}^{n} {θ_{i}}^{2}

(4)

The

L 1

component

λ_{1} \sum_{i = 1}^{n} |θ_{i}|

encourages sparsity in the parameters, pushing the model to rely on fewer and more robust features. This sparsity can make the model less sensitive to adversarial perturbations in the featured space. On the other hand, the

L 2

component leads to a smoother decision function, reducing model complexity and overfitting. The outcome of this process can reduce sensitivity to small input perturbations. We use both

L 1

and

L 2

because of the combined effects of both types of regularization techniques and how that results in a more adversarially robust model. Using either component shows good robustness but is not as effective as using the techniques together.

The objective function for training the model using both

L 1

and

L 2

regularizations, known as Elastic Net regularizations [55], including adversarial noise for our specific case, is shown in Equation (5) below:

J (θ; x, y, δ) = L (f_{θ} (x + δ), y) + λ_{1} {‖θ‖}_{1} + {λ_{2} ‖θ‖}_{2}^{2}

(5)

By combining

L 1

and

L 2

regularizations, Elastic Net takes advantage of both sparsity and smoothing by compacting learned intermediate representations and allowing for a smooth model surface, enhancing model robustness against adversarial perturbations. Visualizing the learned intermediate representation within the models gives a better understanding of the effect of regularization. Figure 1 and Figure 2 show features plotted on a scatter graph after applying PCA (principal component analysis) for 3D modeling. To simplify the graph, we picked the first three digits from the MNIST dataset and trained them on a vision transformer. Figure 1a shows features with a regularization factor of 0.0001, while Figure 1b shows intermediate representations of ViTs with no regularization.

The non-regularized graph shows that the points representing the network’s features are loosely packed. This sparsity within the clusters suggests that the model has learned to create low-density regions in the feature space for each class. The flexibility allows the model to capture intricate details specific to the training dataset. However, such a detailed fit might not be generalized well to new data and can be particularly vulnerable to adversarial attacks that exploit these overfitted aspects, as shown in various research studies [56,57,58,59]. This loose packing means that a small perturbation—often imperceptible to the human eye but easily constructed by an adversarial attack—might be enough to shift a data point across the boundary into a neighboring cluster, resulting in misclassification.

Our research shows that the vulnerability explained above is addressed in regularized models. Here, the points are more closely packed, which could be interpreted as a consequence of the regularization imposing a constraint that discourages the features in the network from becoming too sparse. By enforcing a more packed feature distribution, regularization adds robustness against adversarial perturbations. The increased distance between clusters and the loose packing of the features within the clusters necessitates larger movements for an adversarial example to cross from one class into another within the feature space.

As illustrated in Figure 1b, the feature space appears less clustered, and each cluster grows further apart upon the introduction of regularization. By definition, regularization serves as a complexity penalty, restricting the model from overly adapting to the training data, typically leading to more distant clusters with denser features and arguably more meaningful representations of the data, which was explained by J. Kukacka et al. [56] and many other studies. Such smoothness is intrinsically linked to improved generalization, as the model’s predictions become less variable in response to small input changes, which is highly desirable for adversarial robustness.

Despite the efficacy of regularization regarding adversarial robustness in vision transformers (ViTs), its impact on convolutional neural networks (CNNs) is constrained by the innate architectural properties of these models. This phenomenon can be shown in Figure 2 below, where the misclassified adversarial samples, represented by red stars, are found within the clusters themselves, as opposed to ViTs that have these adversarial samples as outliers. The comparison between the graphs indicates a clear separation from genuine data points in the case of ViTs, thereby allowing for a more adversarially robust model. We hypothesize that this is because CNNs utilize convolutional layers to extract features, which inherently focus on local information and shared weights across the image, fostering translational invariance and detecting localized patterns [9]. This design is fundamentally efficient for standard image recognition tasks; however, it may not capture globalized patterns within an image as effectively as the self-attention mechanisms of ViTs [13,60]. Therefore, when adversarial perturbations target the complex interrelationships of features across an image, the localized processing of CNNs—even with regularization—may not provide sufficient robustness.

On the other hand, ViTs, with their self-attention mechanisms, are inherently equipped to integrate features from the entire input field, providing a natural defense against attacks that manipulate spatially localized features to mislead the model [8,36]. In ViTs, we hypothesize that regularization further refines this capacity, reinforcing the model’s ability to prioritize semantically relevant features and ignore deceptive adversarial noise, which also aligns with the findings of R. Shao et al. [26]. Thus, while regularization enhances the adversarial robustness of both CNNs and ViTs by encouraging learning more general features and penalizing complexity, ViTs exhibit a greater improvement in robustness owing to their superior ability to capture globalized feature patterns.

In order to better evaluate the distances between the clusters and further show that regularization contributes to the separation within the feature space, Table 1 shows the distance from the center of clusters by using K-Means clustering with the three clusters at hand. This algorithm provides the center of each cluster. By calculating the mean, median, and standard deviation of these clusters, it is possible to effectively show how spread apart they are. As shown in the table below, regularization leads to a feature space where clusters are not only more distinct but also positioned further apart. The separation is critical as it improves the model’s capacity to distinguish between different classes, thereby improving its robustness against adversarial attacks. The increased inter-cluster distances mean that adversarial examples need to have a greater degree of perturbation to cross from one cluster to another, which is less likely to occur unnoticed. This phenomenon can especially be seen with vision transformers; as the distance between the clusters becomes wider, the adversarial robustness is more likely to improve due to the factors mentioned above.

4. Experimental Results

Our experiments were carried out utilizing TensorFlow 2, within which we constructed models based on both the vision transformer (ViT) and convolutional neural network (CNN) architectures. All experiments were run on an NVIDIA Ampere A100 GPU with 80 gigabytes of video RAM using TensorFlow version 2 for both experiments. We generated adversarial examples from each model using the Fast Gradient Sign Method (FGSM) [17], Projected Gradient Descent (PGD) [61], and Momentum Iterative Method (MIM) [62]. Firstly, we employed the Fast Gradient Method (FGM), which perturbs the original image by adjusting the input in the direction of the gradient of the loss function with respect to the input, scaled by a factor epsilon, to create an adversarial example. Secondly, we utilized Projected Gradient Descent (PGD), a more iterative approach where multiple small perturbations are applied iteratively, each time projecting the perturbed image back into the allowed epsilon-ball around the original image. This method provides a stronger adversarial attack compared to FGM. Lastly, we applied the Momentum Iterative Method (MIM), which enhances the basic iterative method by integrating a momentum term. This term accumulates the gradients over iterations, stabilizing the update direction and often resulting in more effective adversarial examples. By comparing these methods, we comprehensively assessed the adversarial robustness of CNNs and vision transformers against various types of attacks. When generating PGD adversarial samples for adversarial training, we used an alpha value ranging from 0.1 to 0.3. On the other hand, during the testing phase, we evaluated the robustness of the models against a variety of adversarial attacks, including FGSM, PGD, and MIM, across a spectrum of alpha values. We used Cleverhans by N. Papernot et al. [63] to generate these attacks. Our models are trained on 80% of the complete CIFAR-10 or CIFAR-100 dataset, which contains 120,000 data samples. When performing adversarial training, we used 14,000 data samples modified by PGD, as recommended in [61].

Our CNN model consists of a sequence of layers, starting with three convolutional layers with 16, 32, and 64 filters, each with a kernel size of 3 × 3 and ReLU activation functions. Following these layers, a max-pooling layer reduces the spatial dimensions by half. The convolutional output is then flattened and fed into two dense layers, each with 256 units and ReLU activations. Batch normalization follows each dense layer to aid in stabilizing the learning process, and a dropout layer with a rate of 0.3 is placed between the two dense layers to reduce overfitting. The network concludes with a softmax output layer that classifies the images into 10 categories. This model is compiled with the Adam optimizer, uses sparse categorical cross-entropy as the loss function, and is trained over 100 epochs.

On the other hand, our vision transformer is designed with a dynamic input size of 72 × 72 pixels, where images are subdivided into patches of 6 × 6 pixels, resulting in 144 patches per image. Each patch is encoded to a dimensionality of 64. The transformer architecture comprises 26 layers, each featuring two sub-layers: a multi-head self-attention mechanism with four heads; and a feed-forward network (MLP). The MLP in each transformer block has hidden units scaled to double and then back to the size of the projection dimension. Dropout rates of 0.1 are applied within the attention and MLP components to prevent overfitting, while the output from the final transformer block passes through layer normalization and dropout (0.5) before proceeding to the MLP head. This head is structured with dense layers of 2048 and 1024 units. The ViT model is also trained for 100 epochs, optimized using a custom learning rate of 0.001 and a batch size of 256. We tested various configurations to determine what vision transformer configuration suits this specific model. The configurations that suit our specific use case are shown in Table 2 below. The best setup for our experiments was having a patch size of 6, 6 attention heads, and 26 transformer layers. We have determined the experiment setup based on the generally accepted model architecture for either CNN or ViT for the specific dataset. The setup may change for different datasets.

Our experiments prove that regularization in vision transformers shows promising results even compared to traditional adversarial training, as shown in Figure 3. The figure shows the robustness of CNN and ViT models against the three attacks mentioned at the beginning of this section and a clean CIFAR-10 dataset. From the accuracy provided, it is clear that regularized CNNs and ViTs, labeled “CNN/ViT Classifier Reg (regularization_factor),” outperform the adversarially trained models, labeled “CNN/ViT Classifier Adv”. We have chosen a regularization factor between

10^{- 4}

and

10^{- 6}

based on the various experiments we have performed. We believe that this might differ based on the vision transformer architecture or the dataset.

In our study, we compare the effectiveness of our proposed methodology against both adversarially trained models and state-of-the-art (SOTA) CNN and vision transformer (ViT) models. The results are comprehensively presented in various figures and tables. Specifically, we evaluate the performance of our regularized CNN and ViT models, demonstrating significant improvements in clean data accuracy and adversarial robustness.

We have summarized the results in Figure 3 in Table 3. This table shows the average adversarial accuracy of different model configurations against the three attacks mentioned above. It also shows, for comparison, the accuracy against a clean dataset. The fact that regularized ViTs are performant on both clean and adversarial data samples shows their resilience in both scenarios. This phenomenon also holds for models trained on CIFAR-100, as shown in Table 4.

These results show that vision transformers, especially if regularization is applied, have improved robustness against adversarial attacks. As explained in the previous section, ViTs have the ability to capture global dependencies within the image, and because of that architectural difference, as well as the use of regularization, they outperform CNN models. The results for CIFAR-100, which are shown in Table 4, also reflect the same properties, proving that regularization in vision transformers improves adversarial robustness throughout different types of attacks, making regularization the simplest and most effective defense. Compared to adversarial training, regularization requires less computational resources and is much easier to implement. Table 5 also shows the comparison of our proposed methodology against state-of-the-art models against adversarial attacks. Compared to adversarial training, regularization offers several advantages: it requires fewer computational resources and is easier to implement. The comparative analysis in Table 5 highlights that our proposed methodology exceeds the adversarial robustness of SOTA ViT models on both CIFAR-10 and CIFAR-100 datasets. For CIFAR-10, our regularized ViT achieves the highest average adversarial accuracy of 82.2%, indicating a substantial improvement over other models. Similarly, for CIFAR-100, the regularized ViT maintains superior performance metrics, clearly illustrating the efficacy of our approach. These results demonstrate that regularization in vision transformers can significantly improve adversarial robustness without requiring an extra step, as we do in methods like adversarial training.

5. Conclusions

In this paper, we present a comprehensive analysis of vision transformers in terms of adversarial robustness compared to convolutional neural networks. Recently, vision transformers have gained considerable attention as a feasible architecture capable of addressing vision-associated workloads, in which it often performs better than CNNs. Adversarial attacks have been a vulnerability in terms of ML model robustness, characterized by adding perturbation to input samples in order to affect prediction. In this context, we have examined multiple model configurations with and without regularization and adversarial training. Our study finds that vision transformers are more resilient to adversarial attacks and do not need adversarial training but instead can use regularization to surpass adversarial and non-adversarial CNNs.

Our results with established benchmarks such as CIFAR-10 and CIFAR-100 clearly show the adversarial robustness of ViTs when equipped with regularization. These findings have several important implications as they indicate that this can improve model accuracy in vision recognition and potentially various other tasks. As a future direction, we believe in implementing this methodology on bigger datasets beyond CIFAR-10 and CIFAR-100 and expanding to different domains, such as natural language processing. Hybrid CNN and ViT models with regularization are also another research direction to further boost adversarial robustness. As we move closer to using increasingly advanced artificial intelligence and machine learning techniques, the role of resilience regarding adversarial behaviors is proving to be critical, and we believe that regularization plays an important role.

Author Contributions

Conceptualization, H.D. and J.K.; methodology, H.D.; software, H.D.; formal analysis, H.D. and J.K.; data curation, H.D.; writing—original draft, H.D.; writing—review and editing, J.K.; visualization, H.D.; supervision, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2021R1A2C2008414), the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2024-2020-0-01789), and the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2024-RS-2023-00254592) supervised by the IITP (Institute forInformation & Communications Technology Planning & Evaluation).

Data Availability Statement

We utilized publicly available CIFAR-10 and CIFAR-100 datasets in order to train our models. The datasets can be accessed using the following link: https://www.cs.toronto.edu/~kriz/cifar.html (accessed last on 27 June 2024).

Acknowledgments

The writing process involved the use of AI language models. We declare that the content generated by these models was reviewed and edited by the authors to ensure accuracy and relevance.

Conflicts of Interest

The authors declare no conflicts of interest.

Correction Statement

This article has been republished with a minor correction to the Funding statement. This change does not affect the scientific content of the article.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2012; Volume 25. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. arXiv 2014, arXiv:1409.4842. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
Raina, R.; Madhavan, A.; Ng, A.Y. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QB, Canada, 14–18 June 2009; Association for Computing Machinery: New York, NY, USA, 2009; pp. 873–880. [Google Scholar]
Chetlur, S.; Woolley, C.; Vandermersch, P.; Cohen, J.; Tran, J.; Catanzaro, B.; Shelhamer, E. cuDNN: Efficient Primitives for Deep Learning. arXiv 2014, arXiv:1410.0759. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
O’Shea, K.; Nash, R. An Introduction to Convolutional Neural Networks. arXiv 2015, arXiv:1511.08458. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
Sattarzadeh, S.; Sudhakar, M.; Lem, A.; Mehryar, S.; Plataniotis, K.N.; Jang, J.; Kim, H.; Jeong, Y.; Lee, S.; Bae, K. Explaining Convolutional Neural Networks through Attribution-Based Input Sampling and Block-Wise Feature. arXiv 2020, arXiv:2010.00672. [Google Scholar] [CrossRef]
Lin, H.; Han, G.; Ma, J.; Huang, S.; Lin, X.; Chang, S.-F. Supervised Masked Knowledge Distillation for Few-Shot Transformers. arXiv 2023, arXiv:2303.15466. [Google Scholar] [CrossRef]
Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do Vision Transformers See Like Convolutional Neural Networks? arXiv 2022, arXiv:2108.08810. [Google Scholar] [CrossRef]
Shi, R.; Li, T.; Zhang, L.; Yamaguchi, Y. Visualization Comparison of Vision Transformers and Convolutional Neural Networks. IEEE Trans. Multimed. 2023, 26, 2327–2339. [Google Scholar] [CrossRef]
Sultana, M.; Naseer, M.; Khan, M.H.; Khan, S.; Khan, F.S. Self-Distilled Vision Transformer for Domain Generalization. arXiv 2022, arXiv:2207.12392. [Google Scholar] [CrossRef]
Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. arXiv 2014, arXiv:1312.6199. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. arXiv 2015, arXiv:1412.6572. [Google Scholar]
Carlini, N.; Wagner, D. Towards Evaluating the Robustness of Neural Networks. arXiv 2017, arXiv:1608.04644. [Google Scholar]
Bai, T.; Luo, J.; Zhao, J.; Wen, B.; Wang, Q. Recent Advances in Adversarial Training for Adversarial Robustness. arXiv 2021, arXiv:2102.01356. [Google Scholar] [CrossRef]
Wang, Z.; Li, X.; Zhu, H.; Xie, C. Revisiting Adversarial Training at Scale. arXiv 2024, arXiv:2401.04727. [Google Scholar]
Aldahdooh, A.; Hamidouche, W.; Deforges, O. Reveal of Vision Transformers Robustness against Adversarial Attacks. arXiv 2021, arXiv:2106.03734. [Google Scholar] [CrossRef]
Bhojanapalli, S.; Chakrabarti, A.; Glasner, D.; Li, D.; Unterthiner, T.; Veit, A. Understanding Robustness of Transformers for Image Classification. arXiv 2021, arXiv:2103.14586. [Google Scholar] [CrossRef]
Mahmood, K.; Mahmood, R.; van Dijk, M. On the Robustness of Vision Transformers to Adversarial Examples. arXiv 2021, arXiv:2104.02610. [Google Scholar] [CrossRef]
Mo, Y.; Wu, D.; Wang, Y.; Guo, Y.; Wang, Y. When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture. arXiv 2022, arXiv:2210.07540. [Google Scholar]
Naseer, M.; Ranasinghe, K.; Khan, S.; Khan, F.S.; Porikli, F. On Improving Adversarial Transferability of Vision Transformers. arXiv 2022, arXiv:2106.04169. [Google Scholar] [CrossRef]
Shao, R.; Shi, Z.; Yi, J.; Chen, P.-Y.; Hsieh, C.-J. On the Adversarial Robustness of Vision Transformers. arXiv 2022, arXiv:2103.15670. [Google Scholar] [CrossRef]
Shi, Y.; Han, Y.; Tan, Y.; Kuang, X. Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal. arXiv 2022, arXiv:2112.03492. [Google Scholar] [CrossRef]
Wang, Y.; Wang, J.; Yin, Z.; Gong, R.; Wang, J.; Liu, A.; Liu, X. Generating Transferable Adversarial Examples against Vision Transformers. In Proceedings of the Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 5181–5190. [Google Scholar]
Wei, Z.; Chen, J.; Goldblum, M.; Wu, Z.; Goldstein, T.; Jiang, Y.-G. Towards Transferable Adversarial Attacks on Vision Transformers. arXiv 2022, arXiv:2109.04176. [Google Scholar] [CrossRef]
Zhang, J.; Huang, Y.; Wu, W.; Lyu, M.R. Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization. arXiv 2023, arXiv:2303.15754. [Google Scholar] [CrossRef]
Li, Z.; Yang, W.; Peng, S.; Liu, F. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. arXiv 2020, arXiv:2004.02806. [Google Scholar] [CrossRef]
Younesi, A.; Ansari, M.; Fazli, M.; Ejlali, A.; Shafique, M.; Henkel, J. A Comprehensive Survey of Convolutions in Deep Learning: Applications, Challenges, and Future Trends. arXiv 2024, arXiv:2402.15490. [Google Scholar] [CrossRef]
Mokayed, H.; Quan, T.Z.; Alkhaled, L.; Sivakumar, V. Real-Time Human Detection and Counting System Using Deep Learning Computer Vision Techniques. Artif. Intell. Appl. 2023, 1, 221–229. [Google Scholar] [CrossRef]
Chen, H.; Long, H.; Chen, T.; Song, Y.; Chen, H.; Zhou, X.; Deng, W. M³FuNet: An Unsupervised Multivariate Feature Fusion Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5513015. [Google Scholar] [CrossRef]
Bhosle, K.; Musande, V. Evaluation of Deep Learning CNN Model for Recognition of Devanagari Digit. Artif. Intell. Appl. 2023, 1, 114–118. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A Survey of Transformers. arXiv 2021, arXiv:2106.04554. [Google Scholar] [CrossRef]
Islam, S.; Elmekki, H.; Elsebai, A.; Bentahar, J.; Drawel, N.; Rjoub, G.; Pedrycz, W. A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks. arXiv 2023, arXiv:2306.07303. [Google Scholar] [CrossRef]
Papa, L.; Russo, P.; Amerini, I.; Zhou, L. A Survey on Efficient Vision Transformers: Algorithms, Techniques, and Performance Benchmarking. arXiv 2023, arXiv:2309.02031. [Google Scholar] [CrossRef] [PubMed]
Nauen, T.C.; Palacio, S.; Dengel, A. Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers. arXiv 2023, arXiv:2308.09372. [Google Scholar] [CrossRef]
Guo, C.; Sablayrolles, A.; Jégou, H.; Kiela, D. Gradient-based Adversarial Attacks against Text Transformers. arXiv 2021, arXiv:2104.13733. [Google Scholar] [CrossRef]
Wang, X.; Wang, H.; Yang, D. Measure and Improve Robustness in NLP Models: A Survey. arXiv 2022, arXiv:2112.08313. [Google Scholar] [CrossRef]
Chen, G.; Zhao, Z.; Song, F.; Chen, S.; Fan, L.; Wang, F.; Wang, J. Towards Understanding and Mitigating Audio Adversarial Examples for Speaker Recognition. arXiv 2022, arXiv:2206.03393. [Google Scholar] [CrossRef]
Chen, E.-C.; Lee, C.-R. Towards Fast and Robust Adversarial Training for Image Classification. In Computer Vision—ACCV 2020; Ishikawa, H., Liu, C.-L., Pajdla, T., Shi, J., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2021; Volume 12624, pp. 576–591. ISBN 978-3-030-69534-7. [Google Scholar]
Yoo, J.Y.; Qi, Y. Towards Improving Adversarial Training of NLP Models. arXiv 2021, arXiv:2109.00544. [Google Scholar] [CrossRef]
Zhang, H.; Chen, H.; Song, Z.; Boning, D.; Dhillon, I.S.; Hsieh, C.-J. The Limitations of Adversarial Training and the Blind-Spot Attack. arXiv 2019, arXiv:1901.04684. [Google Scholar] [CrossRef]
Gowal, S.; Qin, C.; Uesato, J.; Mann, T.; Kohli, P. Uncovering the Limits of Adversarial Training against Norm-Bounded Adversarial Examples. arXiv 2021, arXiv:2010.03593. [Google Scholar] [CrossRef]
Ma, A.; Faghri, F.; Papernot, N.; Farahmand, A. SOAR: Second-Order Adversarial Regularization. arXiv 2020, arXiv:2004.01832. [Google Scholar] [CrossRef]
Tack, J.; Yu, S.; Jeong, J.; Kim, M.; Hwang, S.J.; Shin, J. Consistency Regularization for Adversarial Robustness. arXiv 2021, arXiv:2103.04623. [Google Scholar] [CrossRef]
Yang, D.; Kong, I.; Kim, Y. Improving Adversarial Robustness by Putting More Regularizations on Less Robust Samples. arXiv 2023, arXiv:2206.03353. [Google Scholar] [CrossRef]
Guo, J.; Liu, Z.; Tian, S.; Huang, F.; Li, J.; Li, X.; Igorevich, K.K.; Ma, J. TFL-DT: A Trust Evaluation Scheme for Federated Learning in Digital Twin for Mobile Networks. IEEE J. Sel. Areas Commun. 2023, 41, 3548–3560. [Google Scholar] [CrossRef]
Sun, H.; Chen, M.; Weng, J.; Liu, Z.; Geng, G. Anomaly Detection for In-Vehicle Network Using CNN-LSTM With Attention Mechanism. IEEE Trans. Veh. Technol. 2021, 70, 10880–10893. [Google Scholar] [CrossRef]
Guo, J.; Li, X.; Liu, Z.; Ma, J.; Yang, C.; Zhang, J.; Wu, D. TROVE: A Context Awareness Trust Model for VANETs Using Reinforcement Learning. IEEE Internet Things J. 2020, 7, 6647–6662. [Google Scholar] [CrossRef]
Ng, A.Y. Feature Selection, L1 vs. L2 Regularization, and Rotational Invariance. In Proceedings of the Proceedings of the Twenty-First International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004; Association for Computing Machinery: New York, NY, USA, 2004; p. 78. [Google Scholar]
Zou, H.; Hastie, T. Regularization and Variable Selection Via the Elastic Net. J. R. Stat. Soc. Ser. B Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef]
Kukačka, J.; Golkov, V.; Cremers, D. Regularization for Deep Learning: A Taxonomy. arXiv 2017, arXiv:1710.10686. [Google Scholar] [CrossRef]
Moradi, R.; Berangi, R.; Minaei, B. A Survey of Regularization Strategies for Deep Models. Artif. Intell. Rev. 2020, 53, 3947–3986. [Google Scholar] [CrossRef]
Kotsilieris, T.; Anagnostopoulos, I.; Livieris, I.E. Special Issue: Regularization Techniques for Machine Learning and Their Applications. Electronics 2022, 11, 521. [Google Scholar] [CrossRef]
Sánchez García, J.; Cruz Rambaud, S. Machine Learning Regularization Methods in High-Dimensional Monetary and Financial VARs. Mathematics 2022, 10, 877. [Google Scholar] [CrossRef]
Maurício, J.; Domingues, I.; Bernardino, J. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Appl. Sci. 2023, 13, 5521. [Google Scholar] [CrossRef]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv 2019, arXiv:1706.06083. [Google Scholar]
Dong, Y.; Liao, F.; Pang, T.; Su, H.; Zhu, J.; Hu, X.; Li, J. Boosting Adversarial Attacks with Momentum. arXiv 2018, arXiv:1710.06081. [Google Scholar] [CrossRef]
Papernot, N.; Faghri, F.; Carlini, N.; Goodfellow, I.; Feinman, R.; Kurakin, A.; Xie, C.; Sharma, Y.; Brown, T.; Roy, A.; et al. Technical Report on the CleverHans v2.1.0 Adversarial Examples Library. arXiv 2018, arXiv:1610.00768. [Google Scholar] [CrossRef]
Yuan, L.; Hou, Q.; Jiang, Z.; Feng, J.; Yan, S. VOLO: Vision Outlooker for Visual Recognition. arXiv 2021, arXiv:2106.13112. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. arXiv 2023, arXiv:2305.07027. [Google Scholar] [CrossRef]
Vasu, P.K.A.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization. arXiv 2023, arXiv:2303.14189. [Google Scholar] [CrossRef]
Wu, K.; Zhang, J.; Peng, H.; Liu, M.; Xiao, B.; Fu, J.; Yuan, L. TinyViT: Fast Pretraining Distillation for Small Vision Transformers. arXiv 2022, arXiv:2207.10666. [Google Scholar] [CrossRef]

Figure 1. ViT model visualization of MNIST dataset [digits 1–3] with (a) and without (b) regularization, including adversarial examples, represented by red and green crosses.

Figure 2. CNN model visualization of MNIST dataset [digits 1–3] with (a) and without (b) regularization, including adversarial examples, represented by red and green crosses.

Figure 3. Adversarial robustness of CNN and ViT models with different configurations trained on the CIFAR-10 dataset.

Table 1. Distance between learned feature clusters and model accuracy for CNNs and ViTs trained on MNIST dataset [digits 1–3] and adversarial accuracy against FGSM (eps = 0.2).

Model	Distance between Cluster Centroids			Adversarial Accuracy
Model	$Mean (\bar{x}$ )	$Median (\tilde{x}$ )	$Standard Deviation (σ$ )	Adversarial Accuracy
CNN	0.15	0.13	0.10	27.8%
Regularized CNN	0.18	0.15	0.11	66.7%
ViT	1.3	1.02	1.05	66.7%
Regularized ViT	2.4	2.2	1.4	83.3%

Table 2. Adversarial robustness of different vision transformers configurations on CIFAR-10.

Path Size	Number of Heads	Transformer Layers	Average Accuracy
6	4	26	80.1%
12	4	26	68.8%
24	4	26	77.9%
6	8	26	81.2%
6	16	26	74.6%
6	4	36	80.5%
6	4	72	79.7%

Table 3. Adversarial robustness of vision transformers and convolutional networks using adversarial training and regularization on CIFAR-10.

Model Architecture	Adversarial Defense	Clean Data Accuracy ¹	Average Adversarial Accuracy ²
CNN	Clean Model ³	74.2%	73.0%
	Adversarial training (α ≤ 1.4)	61.2%	76.5%
	$Regularized model (α = 10^{- 6}$ )	72.5%	71.0%
	$Regularized model (α = 10^{- 5}$ )	74.7%	73.8%
	$Regularized model (α = 10^{- 4}$ )	71.1%	69.3%
Vision Transformer	Clean Model ³	83.0%	80.1%
	Adversarial training (α ≤ 1.4)	83.4%	81.9%
	$Regularized model (α = 10^{- 6}$ )	84.0%	82.2%
	$Regularized model (α = 10^{- 5}$ )	83.6%	81.2%
	$Regularized model (α = 10^{- 4}$ )	81.9%	79.5%

¹ Clean data accuracy shows the accuracy of the model against a clean dataset. ² Average adversarial accuracy shows the average accuracy of the model against the aforementioned adversarial attacks. ³ A clean model is a model trained without any regularization or adversarial training.

Table 4. Adversarial robustness of vision transformers and convolutional networks using adversarial training and regularization on CIFAR-100.

Model Architecture	Adversarial Defense	Clean Data Accuracy ¹	Average Adversarial Accuracy ²
CNN	Clean Model ³	55.0%	39.5%
	Adversarial training (α ≤ 1.4)	42.8%	42.2%
	$Regularized model (α = 10^{- 6}$ )	35.6%	35.6%
	$Regularized model (α = 10^{- 5}$ )	34.9%	35.4%
	$Regularized model (α = 10^{- 4}$ )	34.0%	33.9%
Vision Transformer	Clean Model ³	53.9%	49.9%
	Adversarial training (α ≤ 1.4)	45.7%	51.7%
	$Regularized model (α = 10^{- 6}$ )	34.9%	35.3%
	$Regularized model (α = 10^{- 5}$ )	54.6%	52.8%
	$Regularized model (α = 10^{- 4}$ )	49.7%	47.7%

¹ Clean data accuracy shows the accuracy of the model against a clean dataset. ² Average adversarial accuracy shows the average accuracy of the model against the aforementioned adversarial attacks. ³ A clean model is a model trained without any regularization or adversarial training.

Table 5. Comparison of adversarial accuracy between our proposed method and SOTA models on CIFAR-10 and CIFAR-100 datasets.

Model	CIFAR-10		CIFAR-100
Model	Clean Data Accuracy ¹	Average Adversarial Accuracy ²	Clean Data Accuracy ¹	Average Adversarial Accuracy ²
ResNet-50 [4]	76.5%	74.0%	40.4%	40.4%
VOLO [64]	74.9%	74.3%	42.1%	41.7%
EfficientViT [65]	75.0%	75.2%	40.1%	39.5%
FastViT [66]	80.0%	80.9%	52.0%	51.7%
TinyViT [67]	80.0%	80.3%	49.9%	49.1%
Regularized CNN (ours)	74.7%	73.8%	35.6%	35.6%
Regularized ViT (ours)	84.0%	82.2%	54.6%	52.8%

¹ Clean data accuracy shows the accuracy of the model against a clean dataset. ² Average adversarial accuracy shows the average accuracy of the model against the aforementioned adversarial attacks.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dingeto, H.; Kim, J. Comparative Study of Adversarial Defenses: Adversarial Training and Regularization in Vision Transformers and CNNs. Electronics 2024, 13, 2534. https://doi.org/10.3390/electronics13132534

AMA Style

Dingeto H, Kim J. Comparative Study of Adversarial Defenses: Adversarial Training and Regularization in Vision Transformers and CNNs. Electronics. 2024; 13(13):2534. https://doi.org/10.3390/electronics13132534

Chicago/Turabian Style

Dingeto, Hiskias, and Juntae Kim. 2024. "Comparative Study of Adversarial Defenses: Adversarial Training and Regularization in Vision Transformers and CNNs" Electronics 13, no. 13: 2534. https://doi.org/10.3390/electronics13132534

APA Style

Dingeto, H., & Kim, J. (2024). Comparative Study of Adversarial Defenses: Adversarial Training and Regularization in Vision Transformers and CNNs. Electronics, 13(13), 2534. https://doi.org/10.3390/electronics13132534

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparative Study of Adversarial Defenses: Adversarial Training and Regularization in Vision Transformers and CNNs

Abstract

1. Introduction

2. The Literature Review

3. Methodology

4. Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Correction Statement

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI