ConvNext as a Basis for Interpretability in Coffee Leaf Rust Classification

Chavarro, Adrian; Renza, Diego; Moya-Albor, Ernesto

doi:10.3390/math12172668

Open AccessArticle

ConvNext as a Basis for Interpretability in Coffee Leaf Rust Classification

by

Adrian Chavarro

^1,†

,

Diego Renza

^1,†

and

Ernesto Moya-Albor

^2,*,†

¹

Facultad de Ingeniería, Universidad Militar Nueva Granada, Carrera 11 101-80, Bogotá 110111, Colombia

²

Facultad de Ingeniería, Universidad Panamericana, Augusto Rodin 498, Ciudad de México 03920, Mexico

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2024, 12(17), 2668; https://doi.org/10.3390/math12172668

Submission received: 18 July 2024 / Revised: 16 August 2024 / Accepted: 22 August 2024 / Published: 27 August 2024

(This article belongs to the Special Issue Artificial Intelligence and Algorithms with Their Applications)

Download

Browse Figures

Versions Notes

Abstract

:

The increasing complexity of deep learning models can make it difficult to interpret and fit models beyond a purely accuracy-focused evaluation. This is where interpretable and eXplainable Artificial Intelligence (XAI) come into play to facilitate an understanding of the inner workings of models. Consequently, alternatives have emerged, such as class activation mapping (CAM) techniques aimed at identifying regions of importance for an image classification model. However, the behavior of such models can be highly dependent on the type of architecture and the different variants of convolutional neural networks. Accordingly, this paper evaluates three Convolutional Neural Network (CNN) architectures (VGG16, ResNet50, ConvNext-T) against seven CAM models (GradCAM, XGradCAM, HiResCAM, LayerCAM, GradCAM++, GradCAMElementWise, and EigenCAM), indicating that the CAM maps obtained with ConvNext models show less variability among them, i.e., they are less dependent on the selected CAM approach. This study was performed on an image dataset for the classification of coffee leaf rust and evaluated using the RemOve And Debias (ROAD) metric.

Keywords:

class activation mapping (CAM); deep learning; model evaluation; model interpretation; rust classification; RemOve And Debias (ROAD); eXplainable Artificial Intelligence (XAI); green; environment

MSC:

68T05; 68T45; 92B20

1. Introduction

Agriculture has assumed a significant role in contemporary society due to the increasing global population and climate change, both of which can impact the worldwide food supply. Consequently, promoting the use of technology to optimize production processes is essential for enhancing productivity, which can be vulnerable to diseases. Many crop supervision tasks require intensive labor and considerable time for expert analysis to determine the current state of crops.

Recent advancements in computer vision and artificial intelligence techniques offer promising solutions for automating the detection and classification of leaf diseases. Deep-learning methods have shown impressive performance across various crops. However, despite their effectiveness, these models often lack interpretability, making it challenging to understand the reasoning behind their decisions [1]. Some efforts have focused on enhancing models for detecting plant diseases, with a particular emphasis on the critical role of visualization [2]. These studies have demonstrated significant success in accurately localizing infected regions in leaf images and highlighting them with heat map overlays. Despite the promising outcomes of using eXplainable Artificial Intelligence (XAI) methods in plant disease detection, it is recognized that the field of XAI is still developing. Current explanation maps remain inadequate for fully supporting decision-making processes in this domain [1].

Interpretability in Convolutional Neural Networks (CNNs) refers to the ability to understand, explain and attribute the decisions and predictions that the network makes. It is a crucial aspect for analyzing why a model makes certain decisions, which can be of particular importance in critical applications.

The initial visualization and interpretation works on CNNs were developed for AlexNet CNN to explore and find architectures that outperform a specific problem [3] and to provide tools to visualize the neurons of a CNN in response to specific inputs, as well as to analyze activation patterns in response to different types of activation [4]. Local Interpretable Model-agnostic Explanations (LIME) is then introduced as a solution based on the approximation of the original model by identifying a model that is locally faithful to the classifier on an interpretable representation, such as a binary vector indicating the presence or absence of a contiguous zone of similar pixels [5]. On the other hand, using game theory, SHapley Additive exPlanations (SHAP) is presented as a unified measure of feature importance that is approximated by several methods and, according to its authors, best matches human intuition [6].

In order to identify the areas of an image that represent greater importance when classifying with a CNN-based model, recent initiatives have been oriented towards methods based on class activation mapping (CAM) [7]. One of the most representative methods is Gradient-Weighted CAM (GradCAM), which computes the local importance from the gradient of the output of a given convolutional layer (usually the last one) with respect to the class predicted by the model [8]. GradCAM++ is presented as an enhancement to GradCAM that uses second-order gradients to improve object localization or multiple occurrences of the same class in an image [9], while XGradCAM is presented as an improvement to GradCAM in terms of both sensitivity and conservation, where sensitivity is understood as the consistency between the removal of input features and the respective change in output, and conservation refers to the magnitude of the output equivalent to the sum of the explanation responses [10].

Other recent examples of class activation mapping methods include HiResCAM, LayerCAM and EigenCAM. HiResCAM starts from the computation of the gradient with respect to the feature maps, and then obtains an attention map from the sum along the feature dimension of the element-wise products of the gradient and the feature maps. In order to reflect the model computations, such operations are performed when the gradients indicate that some elements of the feature map should be scaled or have their sign reversed [11]. A variant of the latter method, GradCAMElementWise, performs an element-wise multiplication of activations with gradients and then applies a ReLU (Rectified Linear Unit) operation before summation [12]. In LayerCAM, to obtain the class activation map of a layer, the activation value of each location in the feature map is multiplied by a weight that depends on the backward class-specific gradients; for locations with positive gradients, their gradients are used as the weight, while locations with negative gradients are assigned zero [13]. An alternative that, according to its authors, does not rely on feature weighting such as gradient backpropagation or class relevance scoring, is the so-called EigenCAM method, which uses the principal components of the learned representations of the convolutional layers to create the visual explanations [14].

The purpose of the methods related so far is primarily designed to illustrate which parts of an image influence the decision of a convolutional neural network. These methods have been validated on different CNNs architectures, including Inception (LIME), ResNet (SHAP, GradCAM, XGradCAM) and VGG16 (SHAP, GradCAM, GradCAM++, XGradCAM, LayerCAM, EigenCAM), with VGG16 and ResNet being the most commonly used architecture types when validating feature visualization methods using class maps. In addition to the above, it is important to keep in mind that the great diversity of methods for the identification of key regions for a classifier makes the selection of a specific method not straightforward [15]. Also, the diversity of CNN architectures and their evolution can hinder the process of selecting and analyzing CAM-type methods.

As is well known, CNNs revolutionized computer vision research by involving feature extraction learning in training without the need for manual extraction processes [16,17]. In addition to this, Vision Transformers (ViT), derived from the original work on natural language processing tasks, have become a competitive option thanks to their scalability in relation to model and dataset size [18]. Under this scenario, standard CNNs were also modernized towards the structure of a hierarchical Vision Transformer, without introducing any attention-based modules [16]. In this way, CNNs can be as scalable as vision transformers, highlighted especially in scenarios requiring low complexity. This family has been oriented to maintain the simplicity and efficiency of CNNs and has been named ConvNext [16], and its evolution, ConvNext2, is based on self-supervised learning [18].

As previous research has pointed out, the quality of heat maps depends both on the methods by which they are calculated and on the performance of the classifier (model+data) [19]. Consequently, this paper evaluates the interpretability of CNN models in the classification of coffee leaf rust. In particular, three CNN architectures (VGG16, Resnet50 and ConvNext-T) are evaluated via the seven CAM techniques both qualitatively and quantitatively using a ROAD metric.

The rest of the paper is structured as follows: Section 2 provides the fundamentals of ConvNext networks, the selected CAM methods, and the ROAD metric. Section 3 describes the coffee leaf dataset, the pre-trained models and the evaluation method. Section 4 presents preliminary qualitative results for some example data and qualitative results evaluated on the test dataset. Finally, Section 5 concludes the research results.

2. Background

In this section, the fundamentals of the ConvNext architecture, its detailed specifications and its variants are explained. Moreover, the theory of CAM methods and its evolution are given. Finally, this section details the ROAD metric and its calculation.

2.1. ConvNext

The ViT model [20], launched on 2021, quickly overcame the performance of the state-of-the-art CNN models for recognition tasks. It uses patches of an image in a similar way as the natural language processing uses tokens from structured text. Thus, by using CNNs or ConvNet modules in ViT, it was shown that the accuracy increased whereas the computational cost was reduced [21], thereby enabling the transformer to be used in various visual tasks, such as image classification, object detection and semantic segmentation problems. On the other hand, the Swim Transformer [22], a new transformer model that uses a shifted windowing scheme, has also been also used to improve the performance of models. This is because the shifted windows limit the self-attention computation to non-overlapping local windows, but allows the cross-window connections.

In contrast to transformers, in Liu et al. [16] was presented ConvNext, a natural CNN network that uses a ResNet-50 network containing ConvNet modules and is based on the design of the Swim Transformer. ConvNext overcomes the performance of the Swim Transformer, e.g., in accuracy and scalability, whereas it holds the simplicity and efficiency of standard ConvNet modules.

A ConvNext network is composed by a stem unit, four stages of ConvNext blocks, a Global Average Pooling (GAP) layer followed a Layer Normalization and a last linear layer. The typical structure of a ConvNext network is shown in Figure 1 and its internal structure on Figure 2 [23].

The Stem Block in Figure 1 processes the input images via a patchify structure composed by a convolutional layer and a layer normalization, as can be seen in Figure 2a. Thus, for an input image that

H \times W \times 3

, with H and W as the height and width of the image, respectively, the convolution layer downsampled it through C kernels of

4 \times 4

(

k = 4

) and stride of 4 (

s = 4

), generating features of a quarter of the input size with C channels for each. Finally, a layer normalization is applied to the generated features.

Moreover, the four stages of the ConvNext network in Figure 1 are each composed of a Downsample Block and the so-called ConvNext Block. In addition, each of the ConvNext blocks cycles for a different number of times, as indicated at the bottom of the block, e.g., ×3 or ×9 cycles.

On the one hand, the Downsample Block (Figure 2b) generates hierarchical features at multi-scale by performing convolutions with kernels of

2 \times 2

(

k = 2

) and stride of 2 (

s = 2

), which reduces the resolution of the input feature maps and doubles the number of channels (

2 C

). On the other hand, each ConvNext block (Figure 2c) contains three convolutional layers, a depth-wise convolution (

k = 7

,

s = 1

), two

1 \times 1

convolution layers (

k = 1

,

s = 1

), and a Gaussian Error Linear Unit (GELU) as the activation function between them. At the output of the second convolutional layer a Layer Scale and a Drop Path are applied. Finally, a Residual Connection is used.

In applying the depth-wise convolutions in the ConvNext architecture, a kind of group-wise convolution that groups channels is performed. Thus, joining the depth-wise convolution and the

1 \times 1

convolution layers generates a similar effect as the shared property in vision transformers [23].

Finally, there are several ConvNext versions. They follow the notation ConvNext-variant, where “variant” could be Tiny (T), Small (S), Base (B), Large (L) and X-Large (XL). These variants differ by the number of channels in each stage and the number of cycles of each ConvNext block (see Table 1) [23]. Thus, the ConvNext network shown in Figure 1 corresponds to the Tiny variant or ConvNext-T depending on the number or cycles of each ConvNext block (3, 3, 9, 3). The detailed specifications of the ConvNext-T architecture are shown in Table 2 [10].

2.2. CAM Methods

One of the weaknesses of deep learning-based classification models is that most of the time they are seen as “black boxes”, since it is not intuitive to relate the weight of the parameters to the type of features identified by the model. Within the current state of the art, several methods have been proposed that precisely focus on the visualization and interpretation of the patterns learned by the CNN models. Thus, CAM methods provide a visual output of the most interesting areas found by the CNN models [8,24]. The main characteristics of the CAM methods used in this work are presented below.

GradCAM: The original CAM method generates localized visual explanations, but requires architecture adjustments for its application. In contrast, GradCAM generates visual explanations of the decisions made by CNNs, locating the regions of greatest importance to the classifier’s decision-making [25]. This method calculates the gradient of the network output with respect to the feature maps of the last convolutional layer, indicating how a small change in each feature map would affect the network output for a specific class. The weighted average of the gradients of each neuron in the last convolutional layer is then calculated, where each value represents the relative importance of that region of the image for classification [8].

GradCAM++: Considering that GradCAM calculates the weighted average of the gradients in the last convolutional layer, which can lead to visualizations that are fuzzy or do not accurately capture the most important regions, GradCAM++ was proposed. To solve this problem, GradCAM++ is not limited to the last convolutional layer, but calculates gradients in multiple layers of the network and uses a more sophisticated weighting function to combine gradients from different layers [9].

XGradCAM: A method based on a linear combination of the feature maps of the last convolutional layer, whose weights are calculated in such a way that the visualizations are sensitive to changes in the input (sensitivity) and that the sum of the values in the activation map is constant and related to the confidence of the classification (conservation) [10].

HiResCAM: A technique proposed as an enhancement to GradCAM to generate more accurate and detailed visual explanations. HiResCAM is not limited to the last convolutional layer, but generates activation maps for multiple layers of the network, including the initial layers. These activation maps are combined using a weighting function that assigns higher weights to layers that contribute more to the classification. In addition, a refinement process is applied to the combined activation maps to improve their quality and reduce noise [11].

LayerCAM: LayerCAM is an attention method for generating reliable class activation maps in shallow layers with large spatial resolutions, not only in the final convolutional layer, where classic approaches such as GradCAM fail to obtain accurate subtle localization of objects [13]. This CAM method harvests reliable and complementary class activation maps for all layers to generate an individual weight for the spatial location in the feature map, benefiting weakly supervised tasks. In addition, LayerCAM is easily used in the common CNN models without modifying their architectures and backpropagation form.

EigenCAM: Unlike the previous methods, the EigenCAM method leverages the principal components, calculated from learned representations at the final convolutional layer, to generate visual and explicable representations of the CNN model [14]. It is a simple method to calculate CAM from output layers independent of the class relevance score. Furthermore, it has the advantage of locating objects in the presence of adversarial noise without modification, training, or backpropagating parameters across layers of the network architecture, yet overcomes gradient-based CAM methods, e.g., GradCAM.

2.3. ROAD

The RemOve And Retrain (ROAR) framework is an initiative to evaluate interpretability methods, testing how the performance of a re-trained model is affected as important input features are removed. In this method, the dataset images are modified (training and testing) such that the pixels identified as the most important are replaced by a fixed non-informative value. Using the modified datasets, new models are trained, repeating the process several times in order to minimize the variance in accuracy [15].

Based on the ROAR proposal, a new proposal called RemOve And Debias (ROAD) was proposed by Rong et al. [26]. It offers greater consistency and substantially reduces the computational cost since it does not require the retraining phase. In general, this method takes as input the result of an explanation method that indicates the most important pixels and by means of a selection operator discriminates the relevant features. These features go to an imputation operator that reduces mutual information and bias. This process, in turn, can balance both a strategy of eliminating the most relevant pixels first or the least relevant pixels first [26].

ROAD is a conceptual framework for evaluating attribution methods that is based on the idea of removing features from the input data and then compensating for possible information leakage. Generally speaking, ROAD proposes to remove a subset of features (e.g., pixels in an image) to quantify the degree of importance of the features identified by the attribution method. However, it does not rely on simple feature removal, as the removed area may unintentionally influence the remaining features, leading to misleading results. Consequently, ROAD proposes a technique called “Noisy Linear Imputation” which consists in replacing the removed features by values obtained from a linear interpolation [26].

To use ROAD to evaluate a feature attribution method, first, a prediction is performed with a classification model, keeping all the input features intact. The behavior of the classifier is analyzed through a feature attribution method (e.g., CAM), which allows assigning to each input dimension an importance value. These values in turn can be ordered from two points of view: descending (Most Relevant First, MoRF) and ascending (Least Relevant First, LeRF), and a given number of features are selected per instance that will form the subset of features that will be replaced in the input data (masked features).

For the calculation of the ROAD metric, both the most relevant and the least relevant features of the image are first removed independently. To eliminate the most relevant features, a threshold value is calculated, which is defined from a percentile given by the user. Subsequently a binary mask is calculated, given by Equation (1):

b i n a r y_m a s k = m a s k < t h r e s h o l d,

(1)

where mask corresponds to the importance values given by the CAM method. In the case of no specifying percentile, the threshold value is calculated via Otsu’s method. Next, an imputation of the input tensor is performed using the binary mask obtained according to Equation (2).

M o R F = i m p u t e r (i n p u t_t e n s o r, b i n a r y_m a s k) .

(2)

This imputer is responsible for filling in the missing values in the input tensor after applying the binary mask. For the case of an image, each pixel (

x_{i, j}

) is approximated by a weighted average of its neighbors according to Equation (3):

x_{i, j} = w_{d} (x_{i, j + 1} + x_{i, j - 1} + x_{i + 1, j} + x_{i - 1, j}) + w_{i} (x_{i + 1, j + 1} + x_{i - 1, j + 1} + x_{i + 1, j - 1} + x_{i - 1, j - 1}) + X,

(3)

where

w_{d} = 1 / 6

and

w_{i} = 1 / 12

are constant coefficients for the direct neighbors and the indirect neighbors (diagonals). X is a random variable whose purpose is to add random noise (

σ = 0.1

) to the solution.

To obtain the modified image after eliminating the less relevant features, a similar procedure is applied but previously inverting mask and applying similar equations:

m a s k_{l e r f} = 1 - m a s k

(4)

b i n a r y_m a s k_{l e r f} = m a s k_{l e r f} < t h r e s h o l d,

(5)

L e R F = i m p u t e r (i n p u t_t e n s o r, b i n a r y_m a s k_{l e r f}) .

(6)

Since the outcome of evaluation strategies can be very sensitive to parameters such as ordering, (i.e., MoRF, LeRF) ROAD metric can consider the inclusion of these two approaches by semi-differencing them (Equation (7)).

R O A D_{c o m b i n e d} = (s c o r e s_{l e r f} - s c o r e s_{m o r f}) / 2,

(7)

Thus, the higher the value of the ROAD combined metric (

R O A D_{c o m b i n e d}

), the more important the features identified by the attribution method are to the classifier.

For the calculation of the ROAD metric, a library available under MIT license and published on GitHub [12] has been used. This library has been implemented based on the article where this metric has been proposed [26], in which they point out that this method has significantly reduced the cost with respect to previous methods such as RemOve And Retrain (ROAR) [15]. It is important to note that as evidenced by Equations (1)–(7), the computational cost of the ROAD metric itself is not high. The computational cost of applying the metric depends primarily on the cost of the CAM method used.

3. Materials and Methods

3.1. Dataset

In this study, several datasets were leveraged to aggregate a comprehensive repository of images utilized across the training, validation and testing stages of CNN model design and evaluation for the detection of coffee leaf rust. These datasets include Rocole [27], Bracol [28], Digipathos [29], D&P [30] and Licole [31]. They encompass instances of both healthy leaves and various disease types such as rust (R: Hemileia vastatrix), leaf miner (LM: Leucoptera coffellum), phoma (P: Phoma spp), cercospora (C: Cercospora coffeicola), spider mite (RSM: Oligonychus yothersi), bacterial blight (BB: Pseudomonas syringae), blister spot (BS: Colletotrichum gloeosporioides) and sooty mold (SM: Capnodium spp).

The datasets under consideration demonstrate a notable breadth of diversity, encompassing various elements such as the presence of distractors, diverse illumination conditions (whether from artificial sources or natural sunlight), a spectrum of backgrounds and varying quantities of leaves per individual image. To prepare the dataset for analysis, meticulous preparatory procedures were executed. These tasks encompassed the precision cropping of images to isolate coffee leaves, the meticulous removal of distractors including undesired objects such as the hands of coffee growers, the exclusion of images featuring diseases other than rust, and the elimination of artificial backgrounds. The resultant dataset consists of 1686 images classified as healthy, 1723 images depicting rust, and 1928 images portraying other diseases (specifically, LM, P, C, RSM, BB, BS, and SM). The allocation for training, validation, and testing purposes involved distributing 70%, 20%, and 10% of the total images, respectively. Additionally, all the jpg images were uniformly resized to dimensions of

224 \times 224

pixels to facilitate consistent computational processing and analysis using RGB color space.

3.2. Pre-Trained Models

The CAM maps were calculated from three CNN classifiers based on VGG, ResNet, and ConvNext. The first two correspond to the two most used architectures in the evaluation of CAM methods proposed in the literature. The third one corresponds to one of the most recent architectures designed to be competitive in the context of computer vision applications. In this case, the two most widely used CNN architectures in the literature for evaluating CAM methods (VGG and ResNet) were selected, and one of the most recent pre-trained CNN architectures (ConvNext) that are widely available (e.g., in frameworks such as Tensorflow or PytTorch) for transfer learning and/or fine tuning tasks was added.

The selected architectures were used from the pre-trained models and weights available in PyTorch. The version selected in the case of VGG was VGG16, in the case of ResNet it was ResNet50, and for ConvNext the lighter version of ConvNext was used (ConvNext-T). In all three cases, the architecture classification stage was adjusted to three classes.

A fine tuning process was applied to the three pre-trained models using the training and validation subsets of the dataset explained above. This pre-training was performed on all model layers for 50 epochs, with an Adam optimizer, and learning rate decay. A checkpoint was also used to save the model, only upon an improvement in validation accuracy. The performance achieved during model fine tuning is shown in Table 3 and Table 4.

3.3. Selected CAM Methods

To identify the regions of importance in the decision-making of the selected classification models, class activation mapping methods were used, starting from the pioneering GradCAM scheme towards those that have made proposals to improve different aspects of the interpretation of the model output. The selected CAM schemes are listed below:

GradCAM;
XGradCAM;
HiResCAM;
LayerCAM;
GradCAM++;
GradCAMElementWise;
EigenCAM.

3.4. Evaluation Using ROAD

As discussed so far, feature attribution methods such as those based on CAM aim to identify the importance of input features for a model’s decision, within what is known as XAI. As the supply of such methods has increased, so has the need to establish techniques to evaluate these methods [26].

This type of strategy for evaluating distribution methods is generally based on perturbing the input features that have been identified by the model as important/unimportant. Here, it is assumed that affecting the most important input features affects the performance of the model, while affecting less important features has little impact on the performance of the model [26]. An example of such features can be the pixels of the regions identified via the attribution method (e.g., CAM).

According to the proposed methodology, the output feature map of the selected CNN architectures was evaluated using a CAM-type method. Subsequently, this CAM map was evaluated using the ROAD metric and the results were compared in order to evaluate the consistency of the selected CNN architectures. The objective here was to address the veracity/fidelity of explanation with respect to model prediction, i.e., to assess correctness, in the context of a recent classification of quantitative evaluation of Explainable AI methods [32]. The type of evaluation method selected consisted of removing features from the input as a function of explanation (CAM) and measuring the change in predictive model output [32].

4. Results and Discussion

This section shows the results of the application of the proposed methodology on the dataset outlined above. These results are presented first for the visual analysis of the results, and then the consolidated quantitative results for the test data are shown.

4.1. Preliminary Results (Qualitative)

In order to illustrate the differences between the regions of importance identified by the CAM methods in any of the three architectures, three examples with different levels of rust severity are shown. Figure 3, Figure 4 and Figure 5 first show the original image evaluated and then the seven CAM maps (GradCAM, XGradCAM, HiResCAM, LayerCAM, GradCAM++, GradCAMElementWise, and EigenCAM) obtained with each of the three architectures.

In the case of Figure 3, it is observed that the CAM maps obtained after applying the VGG16-based model show significant differences among them, mainly associated with the localization of the zones. However, some results show high similarity as in the case of HiResCAM, LayerCAM, and GradCAMElementWise. The results obtained using the ResNet50 architecture show a higher correlation between them, in terms of the localization and intensity of the most important zones (red tones). The differences found are mainly related to the size of the area. On the other hand, the CAM maps obtained with the ConvNext-T architecture present greater consistency among themselves in terms of localization and intensity, particularly for the first six methods shown. In this case, only the last method (EigenCAM) differs significantly from the other six.

The second test image shows a higher number of areas with rust presence (Figure 4). First, the results for VGG16 show a great diversity of results among the different CAM methods evaluated. For example, there are methods that identify a single zone (GradCAM), methods that identify a large number of regions (HiResCAM, LayerCAM, GradCAMElementWise) or methods that identify regions outside the affected zone (GradCAM++). In the case of ResNet50-based results, from a similarity point of view, it is possible to group the results into two groups: The first identifies a small region of great importance inside the leaf, but also identifies regions outside the leaf (GradCAM, XGradCAM, HiResCAM). The second group highlights a wider area inside the leaf, covering a larger number of affected areas, but also identifies areas outside the leaf (LayerCAM, GradCAM++, GradCAMElementWise, EigenCAM). Something similar happens with the results obtained by ConvNext-T. In this case, the first group shows a small region near the leaf tip with the presence of rust, which is cataloged as being of great importance (GradCAM, XGradCAM, HiResCAM, LayerCAM). The second group identifies much of the leaf, with the leaf tip being of greatest importance, although the heat map includes areas that fall outside the leaf.

The third example (Figure 5) shows a leaf with little presence of rust in specific areas. Again, some of the results obtained with the VGG16 architecture are similar to each other, but others vary substantially. In the case of ResNet50, the results show a similar trend to the two previous examples, where they can be framed in two groups that vary mainly by the size of the identified area. On the other hand, the results obtained with ConvNext-T show a great consistency, being very similar to each other.

4.2. Consolidated Results (Quantitative)

The proposed methodology was applied on the entire test dataset in order to evaluate the consistency of the results. In this regard, for each of the three selected CNN architectures (VGG16, ResNet50, and ConvNext-T), tests were performed with each of the seven CAM methods evaluated on the 480 images. In other words, in each architecture we have

480 \times 7 = 3360

ROAD metric results for each type of CNN architecture, that is, a total of

10, 080

results.

Since the visual results show considerable similarities between the maps obtained with different CAM methods, especially for the ConvNext-T architecture, scatter plots were made to show the relationship between the ROAD metrics of two given CAM methods. For comparison purposes, the ROAD metric obtained with the fundamental CAM method, i.e., GradCAM, was fixed on the X-axis. In this way, the Y-axis varies with each of the other six selected CAM methods.

Figure 6 shows the scatter plots after calculating the ROAD metric on the test dataset evaluated with the VGG16 architecture. The results in this plot show that there is no correspondence between the results obtained with different CAM methods. For example, in methods such as XGradCAM, GradCAM++ or EigenCAM, it is observed that in a large number of cases, positive metric values in GradCAM correspond to negative metric values for the other CAM method and vice versa. Another aspect that stands out in this figure is that there is no proportionality between the values of the two axes, i.e., there is no relationship that illustrates that if one of the two metrics increases, the other also increases and vice versa.

Figure 7 shows the scatter plots after calculating the ROAD metric on the test dataset evaluated with the ResNet50 architecture. There is a fundamental aspect that stands out in the scatter plots of this graph: the proportionality of the results. In this case, a trend of proportionality is observed around a line of slope 1. That is, in these results, although there is a large variability around the hypothetical line of slope 1, there is a trend that as the X-axis values increase, the Y-axis values increase. This characteristic is an important difference with respect to the results obtained with VGG16.

Finally, Figure 8 shows the scatter plots after calculating the ROAD metric on the test dataset evaluated with the ConvNext-T architecture. The first thing that stands out in this plot is that the results of the XGradCAM and HiResCAM methods are highly similar to the results obtained with GradCAM. The values in these two plots, although they look like a perfect straight line, have slight differences as illustrated by the ROAD values shown in Figure 3, Figure 4 and Figure 5d. However, what is corroborated here is that the importance of the regions identified through the GradCAM, XGradCAM, and HiResCAM methods have a nearly equal impact according to the ROAD metric values. Thirdly, the results obtained with the LayerCAM method show a trend proportional to the results obtained with GradCAM, i.e., it articulates around a straight line of slope 1, but with much less variation compared to the results of the ResNet50 architecture. In the case of the other three methods evaluated, the results do not present high similarities. However, the results using ConvNext-T are similar for four of the seven CAM methods evaluated, as shown in Figure 3, Figure 4 and Figure 5d.

According to the results shown, the identification of regions of importance for CNN-based classifiers presents more stable results when using the ConvNext-T architecture. As shown in Figure 3, Figure 4 and Figure 5d, the locations of the regions identified by all CAM methods in the ConvNext architecture are very similar to each other, and their importance in the classifier decision process evaluated with the ROAD metric shows the same trend for four of the seven CAM methods evaluated.

These results show a large variability between the regions identified by either method in two of the most commonly used architectures when evaluating CAM methods. In contrast, the regions of importance identified by the methods in the ConvNext-T architecture are highly similar in most cases.

The consistency in the output CAM maps delivered by the ConvNext architecture represents a competitive advantage thanks to the great diversity of evaluation methods that support this type of explanation. Although the present study was focused on evaluating correctness, in order to describe the fidelity of the explanation with respect to the model, by masking input features and the respective evaluation of the change in the output of the predictive model, it is possible to use heat maps to evaluate other properties of the quality of the explanation. For example, it may be possible to evaluate completeness, continuity, contrastivity, covariate complexity, compactness or coherence, as described in a recent study on Explainable AI [32].

In terms of the type of architecture, an aspect that can make a difference has to do with the use of depth-wise convolutions. This ensures that in each operation performed by the network, information is only mixed across one dimension (either spatial or channel), but not both at the same time [16]. This validation can be further studied by considering architectures that use this type of convolution such as Xception-[33] or MobileNet [34]-based architectures.

Another aspect to consider here has to do with the type of data or specific problem on which the proposed methodology has been evaluated. Although the present study has focused on coffee leaf rust classification, a first alternative may be to consider extending these results to the detection of rust on other crops, or other leaf diseases. Beyond this, it is possible to apply similar studies on other classification problems or even object detection.

5. Conclusions

This paper presented an analysis of CNN architectures against the interpretation of model behavior by class activation maps. In particular, it was found that the two most frequent architectures when evaluating CAM methods, VGG16 and ResNet50, present a large variability in the CAM maps obtained. In other words, the interpretation of the results of these models depends to a large extent on the CAM method selected. This behavior is particularly evident in the VGG16 architecture, such that both the zones identified by the CAM method vary in location and intensity, as well as in the value of the ROAD metric. In the case of ResNet50, the results show an improvement with respect to VGG16 in terms of location and intensity, but may vary with respect to the area identified. Regarding the values of the ROAD metric with ResNet50, they show a more stable trend with respect to VGG16, but an important variability around the mean. Finally, the behavior of the CAM maps obtained with the ConvNext-T architecture presents results with a greater similarity between the different CAM methods, and even at the level of interpretability, the ROAD metric is highly similar for four of the seven CAM techniques evaluated. The importance of the above lies in the fact that the interpretation and adjustment of CNN models in terms of rust classification is simplified by the use of ConvNext-based models, in such a way that the selection of a type of CAM technique loses relevance, giving way to the adjustment of the model or dataset.

Author Contributions

Conceptualization, A.C. and D.R.; methodology, A.C.; software, D.R.; validation, E.M.-A.; formal analysis, D.R.; investigation, A.C. and E.M.-A.; resources, A.C. and D.R.; data curation, E.M.-A.; writing—original draft, D.R.; writing—review and editing, A.C. and E.M.-A.; visualization, A.C.; supervision, D.R. and E.M.-A.; project administration, D.R.; funding acquisition, E.M.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been funded by Universidad Panamericana under the Program “Fomento a la Investigación UP 2023” grant UP-CI-2023-MX-10-ING.

Data Availability Statement

The image datasets used in this work are publicly available from: Rocole dataset [27]: https://data.mendeley.com/datasets/c5yvn32dzg/2 (accessed on 10 August 2023); Bracol dataset [28]: https://data.mendeley.com/datasets/yy2k5y8mxg/1/ (accessed on 10 August 2023); Digipathos dataset [29]: https://www.redape.dados.embrapa.br/dataset.xhtml?persistentId=doi:10.48432/XA1OVL (accessed on 27 August 2024); D&P dataset [30]: https://data.mendeley.com/datasets/vfxf4trtcg/4/ (accessed on 10 August 2023); Licole dataset [31]: https://github.com/francismontalbo/swatdcnn/ (accessed on 14 November 2022).

Acknowledgments

Adrian Chavarro and Diego Renza thank the Universidad Militar Nueva Granada, since their contribution to this research is a product of the academic practice within this university. Ernesto Moya-Albor thanks Universidad Panamericana for all support in this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional Neural Network
CAM	Class Activation Mapping
GAP	Global Average Pooling
GELU	Gaussian Error Linear Unit
GradCAM	Gradient-Weighted CAM
LeRF	Least Relevant First
LIME	Local Interpretable Model-agnostic Explanations
MoRF	Most Relevant First
ReLU	Rectified Linear Unit
ROAD	RemOve And Debias
ROAR	RemOve And Retrain
SHAP	SHapley Additive exPlanations
ViT	Vision Transformers
XAI	eXplainable Artificial Intelligence

References

Sagar, S.; Javed, M.; Doermann, D.S. Leaf-Based Plant Disease Detection and Explainable AI. arXiv 2023, arXiv:2404.16833. [Google Scholar]
Yebasse, M.; Shimelis, B.; Warku, H.; Ko, J.; Cheoi, K.J. Coffee disease visualization and classification. Plants 2021, 10, 1257. [Google Scholar] [CrossRef]
Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part I 13. Springer: Cham, Switzerland; Berlin/Heidelberg, Germany, 2014; pp. 818–833. [Google Scholar]
Yosinski, J.; Clune, J.; Nguyen, A.; Fuchs, T.; Lipson, H. Understanding neural networks through deep visualization. arXiv 2015, arXiv:1506.06579. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?”. Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13 April 2016; pp. 1135–1144. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]
Renza, D.; Ballesteros, D. Sp2PS: Pruning Score by Spectral and Spatial Evaluation of CAM Images. Informatics 2023, 10, 72. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 15 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 839–847. [Google Scholar]
Fu, R.; Hu, Q.; Dong, X.; Guo, Y.; Gao, Y.; Li, B. Axiom-based grad-cam: Towards accurate visualization and explanation of cnns. arXiv 2020, arXiv:2008.02312. [Google Scholar]
Draelos, R.L.; Carin, L. Use HiResCAM instead of Grad-CAM for faithful explanations of convolutional neural networks. arXiv 2020, arXiv:2011.08891. [Google Scholar]
Gildenblat, J. PyTorch Library for CAM Methods. 2021. Available online: https://github.com/jacobgil/pytorch-grad-cam (accessed on 10 August 2023).
Jiang, P.T.; Zhang, C.B.; Hou, Q.; Cheng, M.M.; Wei, Y. Layercam: Exploring hierarchical class activation maps for localization. IEEE Trans. Image Process. 2021, 30, 5875–5888. [Google Scholar] [CrossRef]
Muhammad, M.B.; Yeasin, M. Eigen-cam: Class activation map using principal components. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–7. [Google Scholar]
Hooker, S.; Erhan, D.; Kindermans, P.J.; Kim, B. A benchmark for interpretability methods in deep neural networks. Adv. Neural Inf. Process. Syst. 2019, 32, 9737–9748. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 24 June 2022; pp. 11976–11986. [Google Scholar]
Pachón, C.G.; Ballesteros, D.M.; Renza, D. An efficient deep learning model using network pruning for fake banknote recognition. Expert Syst. Appl. 2023, 233, 120961. [Google Scholar] [CrossRef]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 24 June 2023; pp. 16133–16142. [Google Scholar]
Samek, W.; Binder, A.; Montavon, G.; Lapuschkin, S.; Müller, K.R. Evaluating the visualization of what a deep neural network has learned. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2660–2673. [Google Scholar] [CrossRef] [PubMed]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Xiao, T.; Singh, M.; Mintun, E.; Darrell, T.; Dollár, P.; Girshick, R.B. Early Convolutions Help Transformers See Better. arXiv 2021, arXiv:2106.14881. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Zhang, Y.; Xu, A.; Lan, D.; Zhang, X.; Yin, J.; Goh, H.H. ConvNeXt-based anchor-free object detection model for infrared image of power equipment. Energy Rep. 2023, 9, 1121–1132. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 1–74. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 30 June 2016; pp. 2921–2929. [Google Scholar]
Rong, Y.; Leemann, T.; Borisov, V.; Kasneci, G.; Kasneci, E. A consistent and efficient evaluation strategy for attribution methods. arXiv 2022, arXiv:2202.00449. [Google Scholar]
Parraga-Alava, J.; Cusme, K.; Loor, A.; Santander, E. RoCoLe: A robusta coffee leaf images dataset for evaluation of machine learning based methods in plant diseases recognition. Data Brief 2019, 25, 104414. [Google Scholar] [CrossRef] [PubMed]
Krohling, R.A.; Esgario, J.; Ventura, J.A. BRACOL—A Brazilian Arabica Coffee Leaf Images Dataset to Identification and Quantification of Coffee Diseases and Pests. Mendeley Data. 2019. Available online: https://data.mendeley.com/datasets/yy2k5y8mxg/1 (accessed on 10 August 2023).
Barbedo, J.G.A.; Koenigkan, L.V.; Halfeld-Vieira, B.A.; Costa, R.V.; Nechet, K.L.; Godoy, C.V.; Junior, M.L.; Patricio, F.R.A.; Talamini, V.; Chitarra, L.G.; et al. Annotated plant pathology databases for image-based detection and recognition of diseases. IEEE Lat. Am. Trans. 2018, 16, 1749–1757. [Google Scholar] [CrossRef]
Brito Silva, L.; Cavalcante Carneiro, A.L.; Silveira Almeida Renaud Faulin, M. Rust (Hemileia vastatrix) and leaf miner (Leucoptera coffeella) in coffee crop (Coffea arabica). Mendeley Data. 2020. Available online: https://data.mendeley.com/datasets/vfxf4trtcg/4/ (accessed on 10 August 2023).
Montalbo, F.J.P.; Hernandez, A.A. Classifying Barako coffee leaf diseases using deep convolutional models. Int. J. Adv. Intell. Inform. 2020, 6, 197–209. [Google Scholar] [CrossRef]
Nauta, M.; Trienes, J.; Pathak, S.; Nguyen, E.; Peters, M.; Schmitt, Y.; Schlötterer, J.; Van Keulen, M.; Seifert, C. From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai. ACM Comput. Surv. 2023, 55, 1–42. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 26 July 2017; pp. 1251–1258. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]

Figure 1. ConvNext architecture.

Figure 2. (a) The internal structure of Stem Block. (b) The internal structure of Downsample Block. (c) The internal structure of ConvNext Block.

Figure 3. CAM maps obtained from a first test image, from left to right: GradCAM, XGradCAM, HiResCAM, LayerCAM, GradCAM++, GradCAMElementWise, and EigenCAM. (a) Test image. (b) CAD maps from VGG16 architecture. (c) CAM maps from ResNet50 architecture. (d) CAM maps from ConvNext-T architecture.

Figure 4. CAM maps obtained from a second test image, from left to right: GradCAM, XGradCAM, HiResCAM, LayerCAM, GradCAM++, GradCAMElementWise, and EigenCAM. (a) Test image. (b) CAD maps from VGG16 architecture. (c) CAD maps from ResNet50 architecture. (d) CAM maps from ConvNext-T architecture.

Figure 5. CAM maps obtained from a third test image, from left to right: GradCAM, XGradCAM, HiResCAM, LayerCAM, GradCAM++, GradCAMElementWise, and EigenCAM. (a) Test image. (b) CAD maps from VGG16 architecture. (c) CAD maps from ResNet50 architecture. (d) CAM maps from ConvNext-T architecture.

Figure 6. GradCAM ROAD metric versus ROAD metrics from six different CAM methods. CAM maps evaluated on the last convolutional layer of a VGG16 architecture adapted for three classes, on 480 test data. The X-axis corresponds to the ROAD metric evaluated on the GradCAM map, while the Y-axis corresponds to the ROAD metric evaluated on the respective CAM map.

Figure 7. GradCAM ROAD metric versus ROAD metrics from six different CAM methods. CAM maps evaluated on the last block of a ResNet50 architecture adapted for three classes, on 480 test data. The X-axis corresponds to the ROAD metric evaluated on the GradCAM map, while the Y-axis corresponds to the ROAD metric evaluated on the respective CAM map.

Figure 8. GradCAM ROAD metric versus ROAD metrics from six different CAM methods. CAM maps evaluated on the last block of a ConvNext tiny architecture adapted for three classes, on 480 test data. The X-axis corresponds to the ROAD metric evaluated on the GradCAM map, while the Y-axis corresponds to the ROAD metric evaluated on the respective CAM map.

Table 1. Variants of ConvNext architecture.

Variant ConvNext	Number of Channels in Each Stage	Number of Cycles of Each ConvNext Block
ConvNext-T	(96, 192, 384, 768)	(3, 3, 9, 3)
ConvNext-S	(96, 192, 384, 768)	(3, 3, 27, 3)
ConvNext-B	(128, 256, 512, 1024)	(3, 3, 27, 3)
ConvNext-L	(192, 384, 768, 1536)	(3, 3, 27, 3)
ConvNext-XL	(256, 512, 1024, 2048)	(3, 3, 27, 3)

Table 2. ConvNext-T architecture detailed specifications. The specifications in each Res stage indicate the type of convolution (d for depth-wise convolutions), the size of the convolution kernel, the number of channels and the number of blocks.

Stage	Output Size	Details
Stem	$56 \times 56$	$4 \times 4$ , stride 4
Res2	$56 \times 56$	$[\begin{matrix} d 7 \times 7 & 96 \\ 1 \times 1 & 384 \\ 1 \times 1 & 96 \end{matrix}] \times 3$
Res3	$28 \times 28$	$[\begin{matrix} d 7 \times 7 & 192 \\ 1 \times 1 & 768 \\ 1 \times 1 & 192 \end{matrix}] \times 3$
Res4	$14 \times 14$	$[\begin{matrix} d 7 \times 7 & 384 \\ 1 \times 1 & 1536 \\ 1 \times 1 & 384 \end{matrix}] \times 9$
Res5	$7 \times 7$	$[\begin{matrix} d 7 \times 7 & 768 \\ 1 \times 1 & 3072 \\ 1 \times 1 & 768 \end{matrix}] \times 3$

Table 3. Classifier performance in terms of accuracy for the three selected architectures when applying fine tuning on the rust dataset.

Model	Val. Loss	Val. Accuracy (%)	Test Accuracy (%)	Epoch	Size (MB)
VGG16	0.1998	96.46	93.33	16	512.3
ResNet50	0.2089	94.38	91.25	46	90
ConvNext-T	0.2825	94.45	92.08	23	106.2

Table 4. Evaluation metrics on test data derived from the confusion matrix for the models with fine tuning.

VGG16
Class	Accuracy (%)	Precision	Recall	F1-Score
1	96.25	0.91	0.98	0.94
2	95.42	0.97	0.91	0.94
3	95.00	0.92	0.92	0.92
ResNet50
Class	Accuracy (%)	Precision	Recall	F1-Score
1	94.38	0.86	0.98	0.91
2	95.42	0.97	0.91	0.94
3	93.00	0.91	0.85	0.88
ConvNext-T
Class	Accuracy (%)	Precision	Recall	F1-Score
1	94.58	0.89	0.95	0.92
2	96.04	0.96	0.94	0.95
3	93.54	0.92	0.87	0.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chavarro, A.; Renza, D.; Moya-Albor, E. ConvNext as a Basis for Interpretability in Coffee Leaf Rust Classification. Mathematics 2024, 12, 2668. https://doi.org/10.3390/math12172668

AMA Style

Chavarro A, Renza D, Moya-Albor E. ConvNext as a Basis for Interpretability in Coffee Leaf Rust Classification. Mathematics. 2024; 12(17):2668. https://doi.org/10.3390/math12172668

Chicago/Turabian Style

Chavarro, Adrian, Diego Renza, and Ernesto Moya-Albor. 2024. "ConvNext as a Basis for Interpretability in Coffee Leaf Rust Classification" Mathematics 12, no. 17: 2668. https://doi.org/10.3390/math12172668

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ConvNext as a Basis for Interpretability in Coffee Leaf Rust Classification

Abstract

1. Introduction

2. Background

2.1. ConvNext

2.2. CAM Methods

2.3. ROAD

3. Materials and Methods

3.1. Dataset

3.2. Pre-Trained Models

3.3. Selected CAM Methods

3.4. Evaluation Using ROAD

4. Results and Discussion

4.1. Preliminary Results (Qualitative)

4.2. Consolidated Results (Quantitative)

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI