SA-ConvNeXt: A Hybrid Approach for Flower Image Classification Using Selective Attention Mechanism

Mo, Henghui; Wei, Linjing

doi:10.3390/math12142151

Open AccessArticle

SA-ConvNeXt: A Hybrid Approach for Flower Image Classification Using Selective Attention Mechanism

by

Henghui Mo

and

Linjing Wei

^*

College of Information Science and Technology, Gansu Agricultural University, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(14), 2151; https://doi.org/10.3390/math12142151 (registering DOI)

Submission received: 17 June 2024 / Revised: 4 July 2024 / Accepted: 8 July 2024 / Published: 9 July 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In response to the current lack of annotations for flower images and insufficient focus on key image features in traditional fine-grained flower image classification based on deep learning, this study proposes the SA-ConvNeXt flower image classification model. Initially, in the image preprocessing stage, a padding algorithm was used to prevent image deformation and loss of detail caused by scaling. Subsequently, the model was integrated using multi-level feature extraction within the Efficient Channel Attention (ECA) mechanism, forming an M-ECA structure to capture channel features at different levels; a pixel attention mechanism was also introduced to filter out irrelevant or noisy information in the images. Following this, a parameter-free attention module (SimAM) was introduced after deep convolution in the ConvNeXt Block to reweight the input features. SANet, which combines M-ECA and pixel attention mechanisms, was employed at the end of the module to further enhance the model’s dynamic extraction capability of channel and pixel features. Considering the model’s generalization capability, transfer learning was utilized to migrate the pretrained weights of ConvNeXt on the ImageNet dataset to the SA-ConvNeXt model. During training, the Focal Loss function and the Adam optimizer were used to address sample imbalance and reduce gradient fluctuations, thereby enhancing training stability. Finally, the Grad-CAM++ technique was used to generate heatmaps of classification predictions, facilitating the visualization of effective features and deepening the understanding of the model’s focus areas. Comparative experiments were conducted on the Oxford Flowers102 flower image dataset. Compared to existing flower image classification technologies, SA-ConvNeXt performed excellently, achieving a high accuracy of 96.7% and a recall rate of 98.2%, with improvements of 4.0% and 3.7%, respectively, compared to the original ConvNeXt. The results demonstrate that SA-ConvNeXt can effectively capture more accurate key features of flower images, providing an effective technical means for flower recognition and classification.

Keywords:

flower classification; image preprocessing; attention mechanism; transfer learning; Grad-CAM++; convolutional neural networks

MSC:

68T20

1. Introduction

In the information era, the amount of image data has surged dramatically. As an important domain of fine-grained image classification, the accurate classification of flower images holds significant importance for botanical research and applications. Flower image classification presents unique challenges: different species of flowers may appear similar, while flowers of the same species can exhibit significant variations, such as missing or deformed petals [1]. Past classification methods [2,3,4] relied mainly on manually designed image features, such as color, texture, and shape. However, these methods were not ideal in terms of classification performance, being susceptible to image variations and lacking robustness and expressiveness. Thus, flower image classification is a challenging subject.

Flower image classification can usually be categorized into two types: with segmentation and without segmentation. Methods with segmentation attempt to improve classification accuracy by removing irrelevant background information. For example, Vezhnevets et al. [5] proposed a GrowCut automatic segmentation algorithm based on label extraction, preceded by threshold segmentation for preprocessing, but this method might be limited by the set threshold, affecting the segmentation quality. Zagrouba et al. [6] proposed a flower image segmentation scheme combining saliency detection and GrabCut. This method first trains a classifier for the foreground and background, then successfully separates the main parts of the flower from the background using the GrabCut algorithm. Mabrouk et al. [4] proposed a segmentation method based on color feature extraction, applying the maximum between-class variance (OTSU algorithm) in Lab color space to differentiate the foreground and background of flower images. Additionally, Yang et al. [7] introduced a background prior saliency detection algorithm that treats the image’s edges as prior background and calculates saliency using graph-based ranking algorithms based on the similarity of labeled nodes. This method effectively aids in extracting the main areas of flower images without the need for prior labeling or separate training for each image category. Non-segmented methods attempt classification by finding distinctive features, usually divided into those based on manual design and neural networks. Khah et al. [8] proposed a classification method combining color and shape features, extracting color and shape features at the same points on each flower image, building corresponding feature codewords, and statistically determining the probability of each flower’s color feature occurrence based on the codeword. Fernando et al. [9] proposed a feature fusion algorithm based on logistic regression models (LRFF), extracting HUE color features and SIFT features, creating codewords, and using logistic regression to calculate the most discriminative codeword weights for each flower on different codewords, then concatenating BoW vectors weighted for classification, typically allowing the model to accept flower image inputs and train and predict end-to-end.

Although various flower image classification methods have shown good results, they all rely on manually designed image features. For different imaging conditions, changes in color, texture, and local information often lead to changes in the features representing this information. The rise of deep learning has brought innovative solutions to the field of image classification. Advancements in artificial intelligence have largely been propelled by enhancements in neural network architectures. Krizhevsky et al. [10] introduced the AlexNet model, which won the ILSVRC image classification competition in 2012 with an error rate reduced to 16.4%. In the following years, a series of network architectures such as VGGNet (Visual Geometry Group Network), GoogLeNet, InceptionV2 (Inception Version 2), InceptionV3 (Inception Version 3), ResNet (Residual Network), and Inception-ResNet-V2 [11,12,13,14,15,16] were proposed, with increasing network layers and continually decreasing error rates, demonstrating the importance of network architecture innovation for improving image classification accuracy. Gogul et al. [17] proposed a method of flower image classification combining deep and manual features, achieving lower feature dimensions and higher classification accuracy. Xia et al. [18] proposed an InceptionV3 network combining saliency detection and transfer learning, effectively improving the accuracy of flower image classification. Zhang et al. [19] introduced a flower image classification algorithm based on an improved ResNet18, significantly enhancing the accuracy and efficiency of fine-grained flower image classification by integrating a fully convolutional structure and a mixed domain attention mechanism. Rahman et al. [20] use DenseNet201 flower identification using deep learning models like achieving the high accuracy. Li et al. [21] utilize a Swin Transformer enhanced with a weakly supervised object positioning module for wild mushroom classification by focusing on differentiated regions and reducing background disturbances. Lee et al. [22] introduced Plant-CNN-ViT, an ensemble model blending Vision Transformer and DenseNet-201, which achieved plant leaf classification.

In response to the issue that traditional classification methods rely on strong manual intervention for feature collection, making the model less robust in fine-grained classification scenarios, this study focuses on fine-grained flower image classification based on deep learning methods. Although many deep learning-based flower image classification methods have achieved certain success, a key issue remains: the lack of network attention. The lack of network attention means that when extracting features from flower images, the network cannot differentiate between the importance levels of features, leading to the model possibly allocating equal attention and weights to all input features during training and inference. Additionally, sample imbalance is another significant challenge, potentially biasing the model towards majority classes and exacerbating overfitting. Sample imbalance in training data also limits the model’s generalization ability.

In response to the aforementioned issues, this paper introduces a flower image classification model named SA-ConvNeXt, which is based on a Selective Attention Network combined with ConvNeXt. The main contributions and innovations of this paper are as follows: (1) Image Preprocessing and Feature Extraction: To address potential deformations and loss of detail due to image scaling, a padding algorithm is employed for size fixation around the image. (2) Selective Attention Network: utilizes Multi-level Efficient Channel Attention Mechanism (M-ECA) and Pixel Attention Mechanism (PA) networks, dynamically allocating weights while capturing channel and pixel features at various levels. (3) Dynamic Hierarchical Fusion Attention Mechanism: a novel dynamic hierarchical fusion attention mechanism is proposed, which dynamically learns the importance weights of features and accordingly adjusts their weighting, achieving the precise capture and optimization of key features. (4) Parameter-Free Attention Module: employs SimAM to reweight input features significantly. (5) Model Generalization and Training Optimization: the model’s generalization capability is enhanced through transfer learning by utilizing the pretrained weights of ConvNeXt on the ImageNet dataset. During training, the Focal Loss function is used to address the issue of sample imbalance, while the Adam optimizer is selected to reduce gradient fluctuations caused by the random initialization of initial weights.

Section 2 of the article discusses image preprocessing methods, while Section 3 will provide a detailed explanation of the aforementioned techniques and methods. Section 4 will present the experimental results and discussion on the model, and Section 5 will offer the final conclusions. These sections are designed to give readers a clear understanding and comprehensive evaluation of the results of this study.

2. Image Preprocessing in SA-ConvNeXt

In deep learning image classification tasks, a fixed input image size is required. The direct scaling of images can lead to deformation due to inconsistent aspect ratios, compressing morphological features in the image and resulting in a loss of detail information, which affects classification accuracy. This study, considering the characteristics of floral images, designed a comprehensive image preprocessing workflow to optimize the utilization of information and enhance the model’s generalization ability. The designed workflow is illustrated in Figure 1.

2.1. Image Interpolation and Image Filling

In digital image processing, interpolation is a key technique for calculating new pixel positions when scaling images. The choice of interpolation algorithm significantly impacts image quality and processing performance. Nearest neighbor interpolation is a simple and fast method that uses only the closest pixel values for computation, which is highly efficient but can lead to jagged edges that affect the visual outcome. In contrast, bicubic and Lanczos interpolations consider multiple neighboring pixels and apply complex mathematical formulas to improve image quality, although this makes the computations more complex and time-consuming. Bilinear interpolation, which lies between these two methods, calculates a weighted average of the four nearest neighboring pixels to balance computational efficiency and image quality [23]. Consequently, this study chooses bilinear interpolation for image scaling to optimize both efficiency and effectiveness.

After completing the interpolation of the input image, a padding strategy is employed, predominantly with a gray background, to ensure the image reaches the desired target size. This centers the core content of the image, maintaining its original proportions and integrity. It not only enhances the visual focal point but also effectively avoids potential image deformation or distortion, ensuring the accuracy and reliability of the image processing workflow. The result of the image after applying the padding algorithm is shown in Figure 2.

2.2. Image Normalization

The normalization of image data is a crucial preprocessing step in deep learning and other machine learning algorithms [24] as it addresses the issue of variable data scales that can destabilize the learning process. Data that are not normalized may cause disproportionate updates in network weights, leading to challenges in training convergence.

Min–max normalization, particularly beneficial for image data [25], is a method that rescales data to a specific range, typically [0, 1]. This rescaling is accomplished via a simple linear transformation that adjusts the original data values to a uniform scale, enhancing algorithmic stability and performance. The formula for min–max normalization is depicted in Equation (1), where

x_{i}

represents the original pixel values:

N_{n o r m} = \frac{x_{i} - \min (x)}{\max (x) - \min (x)}

(1)

This transformation ensures that all input features contribute equally to the model’s learning, preventing biases associated with the inherent scale differences in the raw data.

3. Materials and Methods

3.1. Model Exploration

In the field of image recognition, traditional Convolutional Neural Networks (CNNs) have long held a dominant position. However, the emergence of Vision Transformer (ViT) has introduced new perspectives in the field of computer vision, with its exceptional performance in image classification [26] even challenging the leading position of CNNs. In 2022, Fair and others introduced a new model named ConvNeXt, which combines the characteristics of ResNet and ViT. It demonstrated a performance that surpassed other models in the mainstream ImageNet-1K benchmark [27].

Within its architectural depth, the ConvNeXt structure is based on the four stages of ResNet50, adjusting the number of blocks in each stage with a ratio of 1:1:3:1. Unlike traditional ResNeXt, ConvNeXt introduces an inverted bottleneck mechanism similar to MobileNetV2 to enhance feature expression, aiming to retain more feature information during downsampling. At the same time, both the downsampling strategy and the core blocks of ConvNeXt have been meticulously optimized. Inspired by the grouped convolution strategy of ResNeXt, ConvNeXt adopts depth-wise separable convolutions to replace the traditional 3 × 3 convolutions, while increasing the network’s width to achieve a high-quality balance between model parameters and accuracy. Furthermore, influenced by the Transformer’s strategy of using fewer activation functions and normalization layers, ConvNeXt significantly reduces the number of activation functions and normalization layers in its design and replaces BatchNorm with LayerNorm, not only enhancing the model’s accuracy but also optimizing its computational efficiency. The network architecture of ConvNeXt is shown in Figure 3.

3.2. Improved ConvNeXt Block

The ConvNeXt Block is the core unit of the ConvNeXt network, with its unique basic structural unit being an inverted bottleneck structure, dedicated to achieving computational efficiency while maintaining high performance, as shown in Figure 4a. This module first independently processes each channel in every layer with a depth-wise separable convolution of a 7 × 7 kernel size, optimizing the model’s parameter efficiency through fewer channels. Subsequently, layer normalization is applied to the feature maps for stable processing, ensuring the network remains stable during training. To further refine the features, the module undergoes two consecutive 1 × 1 convolutions, initially increasing the number of channels from 96 to 4 × 96, and then reducing it back to 96 under the action of the GELU activation function. This design aims to expand and reshape the feature space, thereby enhancing its expressive capability. Moreover, to enhance training stability and resistance to overfitting, the module introduces Layer Scale and Drop Path strategies to combat the phenomenon of model overfitting. Finally, through residual connection, the original input is added to the processed feature maps, ensuring the stability of feature learning as the network depth increases.

In recent deep learning research, attention mechanisms have undergone significant enhancements in model performance and interpretability. Especially when dealing with complex data structures, such as sequences and images, attention mechanisms enable models to selectively focus on critical information, thereby improving the model’s feature selection capability and computational efficiency. To further enhance the ConvNeXt Block’s extraction of fine-grained features from flowers, this study incorporates the SimAM attention mechanism immediately after the depth-wise separable convolution to focus more on spatial feature extraction and reweighting. Furthermore, the introduction of the SANet (Selective Attention Network) at the end of the module aims to provide the network with the ability to filter and strengthen key features after a series of feature transformations, thereby dynamically adjusting the feature weights of different attention mechanisms.

3.3. SimAM

Flowers come in a wide variety of species and forms, and the differences between some species can be very subtle, such as the contours of the petals, the size or color of the stamens, etc. Furthermore, flower images are often affected by environmental factors such as lighting, cluttered backgrounds, and occlusions, further increasing the difficulty of classification. Traditional Convolutional Neural Networks (CNNs) struggle to capture these subtle differential features. To explore the associations and importance between different features in images, this study utilizes SimAM to directly infer three-dimensional attention weights in the network layer, fully considering the relevance of spatial and channel dimensions while avoiding the addition of extra parameters. Figure 5 demonstrates the basic principle of SimAM.

In the task of flower image classification, SimAM provides higher weight adjustments for neurons carrying key information. By applying spatial suppression to the neighboring neurons of specific features, SimAM is able to reduce the interference of complex backgrounds on feature recognition, thereby more prominently highlighting the core attributes of that feature. Its mechanism of action is based on an energy function, inspired by the phenomenon of spatial suppression in neuroscience. In this phenomenon, active neurons inhibit the surrounding inactive neurons [28]. Therefore, the importance of neurons can be measured by this energy function. The specific form of the energy function is described by Equation (2).

e_{t} (w_{t}, b_{t}, y, x_{i}) = {(y_{t} - \hat{t})}^{2} + \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(y_{0} - {\hat{x}}_{i})}^{2}

(2)

\hat{t} = (w_{t} t + b_{t})

(3)

{\hat{x}}_{i} = (w_{t} x_{i} + b_{t})

(4)

where

t

denotes the target neuron;

x_{i}

denotes other neurons;

b_{t}

denotes bias;

M

denotes the number of neurons; and the minimization process of the above Equation can be regarded as the enhancement of the linear differentiation ability between neuron t and other neurons. To simplify the operation, the scalars

y_{t}

and

y_{0}

are denoted by binary labels as 1 and −1, respectively, and finally the energy function is organized and expressed by adding regular terms as:

\begin{array}{c} e_{t} (w_{t}, b_{t}, y, x_{i}) = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(- 1 - {\hat{x}}_{i})}^{2} + \\ {(1 - \hat{t})}^{2} + λ w_{t}^{2} \end{array}

(5)

w_{t} = - \frac{2 (t - μ_{t})}{{(t - μ_{t})}^{2} + 2 σ_{t}^{2} + 2 λ}

(6)

b_{t} = - \frac{1}{2} (t + μ_{t}) w_{t}

(7)

where

μ_{t}

and

σ_{t}

denote the mean and variance of the image, respectively, so the final simplified minimum energy function is shown in Equation (8):

e_{t}^{*} = \frac{4 ({\hat{σ}}^{2} + λ)}{{(t - \hat{μ})}^{2} + 2 {\hat{σ}}^{2} + 2 λ}

(8)

\tilde{X} = s i g m o i d (\frac{1}{e_{t}^{*}}) ⊙ X

(9)

It can be seen in Equation (8), that when the neuron

t

is different to the surrounding neurons,

e_{t}^{*}

is smaller, and as the denominator of Equation (9), it becomes more important, and values that are too large are suppressed by Sigmoid so as to satisfy the definition of the attentional mechanism and to obtain the final feature-weighted attentional feature map.

SimAM references enable the improved ConvNeXt Block to focus on feature regions that are more critical to the final task, while ignoring less important information.

3.4. Multi-Level Efficient Channel Attention Mechanism (M-ECA)

In recent deep learning research, especially in the field of vision tasks, modeling the relationships between channels has been widely regarded as crucial. Efficient Channel Attention (ECA), as a novel channel attention strategy [29], has successfully achieved the goal of capturing long-range channel dependencies in features without introducing significant computational burden or additional parameters. Although ECA has shown convincing performance in multiple tasks, there is still potential for further optimization and expansion in its model design. Specifically, ECA models relationships directly in the channel dimension by introducing local one-dimensional convolutions, which, while enhancing computational and parameter efficiency to some extent, have certain limitations. Considering the actual image data, the relationships between channels may include different local scales and patterns. Relying solely on a single one-dimensional convolution kernel may not be sufficient to fully capture these complex multi-scale and multi-pattern relationships.

To address the above limitations, this study proposes a Multi-level Efficient Channel Attention mechanism (M-ECA). Unlike ECA, M-ECA combines multiple one-dimensional convolution kernels of different scales, enabling a more comprehensive capture of the relationships between channels, thus providing richer and more diverse information, its structure is shown in Figure 6. This multi-level modeling strategy is expected to further improve the ability to allocate attention and optimize the processing of image features across different scales and patterns.

The process of realizing the M-ECA attention mechanism is as follows:

Perform global average-pooling (GAP) on the input feature map, changing it from a matrix dimension $[B, C, H, W]$ to a vector dimension $[B, 1, 1, C]$ . Where $B$ denotes the batch of the input data, $C$ denotes the number of channels of the feature map, and $H, W$ denote the height and width of the feature map.
Calculate the size of the one-dimensional convolution kernel based on the number of channels.

k_{1} = {|\frac{\log_{2} (C)}{γ} + \frac{b}{γ}|}_{o d d}

(10)

k_{n} = k_{1} + 2 * (n - 1)

(11)

where

γ

and

b

are, respectively, set to 2 and 1,

{| • |}_{o d d}

denotes the closest odd number to

•

.

3.: Use $k_{1} \dots k_{n - 1}, k_{n}$ in a one-dimensional convolution operation to obtain the weights after the fusion of feature maps for different convolution kernels of $k$ . As shown in Equation (12), where $C_{1 d}^{k}$ denotes a one-dimensional convolution with convolution kernel $k$ , $y$ denotes the input feature map; the final rich long and short channel weights are obtained by a fusion of multi-level feature extraction.

$w = σ (C_{1 d}^{k_{1}} (y) + \dots C_{1 d}^{k_{n - 1}} (y) + C_{1 d}^{k_{n}} (y))$

(12)
4.: Using the extracted normalized weights, multiply them channel-by-channel with the original input feature map to obtain the generated weighted feature map.

$F_{m e c a} = w ⊙ I_{m e c a}$

(13)

Relying only on a single one-dimensional convolution kernel may make it difficult to fully capture these multi-scale and multi-mode complex relationships.

3.5. Selective Attention Network (SANet) Mechanism

When humans observe a certain area, they use an attention mechanism that allows the brain to focus on key areas and ignore irrelevant information. In short, the attention mechanism can flexibly capture the connection between global and local information. Its purpose is to enable the module to focus on the target area of interest, that is, by weighting this part, to highlight significant useful features and suppress and ignore irrelevant features.

Traditional attention mechanisms can help models capture key features. However, a single attention strategy often struggles to fully capture the varied feature dimensions in the data. The current mainstream hybrid attention mechanism strategies mostly use parallel or serial structures, but such approaches do not fully consider the complementarity and potential conflicts between different attention mechanisms. Considering the differences in features focused on by different attention strategies and the focus of weight distribution, inspired by the SKNet structure [30], this study proposes the SANet (Selective Attention Network) structure. SANet can dynamically extract different types of key features based on different attention mechanisms and adjust weight distribution in real time. Through this method, SANet can adapt more flexibly to changes in the data, thereby capturing and highlighting important features more accurately. Moreover, the dynamic re-parameterization strategy allows the model to adaptively adjust its focus, thereby better handling challenging, difficult-to-distinguish features.

From Figure 7, it is evident that the SANet structure is mainly divided into three parts: decomposition, fusion, and selection.

In the decomposition stage, the network processes the input features X using a variety of different attention mechanisms in parallel, thus obtaining different feature representations.

In the fusion stage, it first sums all the feature representations obtained from the decomposition stage to get a comprehensive feature representation U, and then performs global average pooling and linear transformation on the fused feature U to obtain a vector Z that represents global information.

In the selection stage, multiple linear layers generate weights for each attention mechanism, and the Softmax function is applied to ensure the sum of the weights equals 1. Finally, these weights are used to weight the various feature representations obtained from the decomposition stage and sum them up to obtain the final output feature V.

In this way, SANet can dynamically extract key features based on different attention mechanisms and adjust weights in real-time, thereby better capturing the various characteristics in the data. The specific process is as follows:

F_{m e c a} = M E C A (I_{s a})

(14)

F_{p a} = P A (I_{s a})

(15)

F_{s p l i t} = F_{m e c a} \oplus F_{p a}

(16)

F_{f u s e} = f c (A v g (F_{s p l i t}))

(17)

F_{s e l e c t} = F_{m e c a} ⊙ s m_{1} (f c_{1} (F_{f u s e})) + F_{p a} ⊙ s m_{2} (f c_{2} (F_{f u s e}))

(18)

where

I_{s a}

denotes the input feature map,

M E C A

denotes passing through the M-ECA module,

P A

denotes passing through the PA module,

f c

denotes the fully connected layer,

A v g

denotes the global average pooling operation, and

s m

denotes the Softmax operation.

The pixel attention mechanism is often applied in image processing tasks, where it can help the model recognize the relationships between different pixels in an image. By calculating the similarity of each pixel to others, the pixel attention mechanism assigns varying weights, thereby enabling the model to allocate attention differently based on these weights. The PA network structure in this study SANet is shown in Figure 8.

Through SANet, not only are the feature weights of M-ECA and PA dynamically and efficiently allocated but their respective advantages and characteristics are also fully utilized. It dynamically integrates the multi-level channel context information captured by the M-ECA module with the fine-grained relationships between pixels focused on by the PA module. This enables SANet to make the best feature extraction decisions in different contexts.

3.6. Focal Loss Function

In floral image classification, the uneven number of training species datasets often leads to the occurrence of over-biased or neglected problems in the model. This sample imbalance may limit the performance of the model in a real environment.

The cross-entropy loss function (CE) is a traditional loss function in classification tasks [31], which is used to quantify the difference in loss between model predictions and true labels, as is shown in Equation (19):

C E (p, q) = - \sum_{i = 1}^{n} p (x_{i}) \log (q (x_{i}))

(19)

where

C E (•)

denotes the cross-entropy loss function;

n

denotes how many categorization categories there are;

p

denotes the predicted value of the sample, ranging from [0, 1]; and

q

denotes the true value of the sample, which is 0 or 1.

However, since the cross-entropy of all training samples is directly summed up when processing sample weights, this means that each sample is given the same weight. When there is class imbalance in the data, samples from rare classes may be ignored by the model. Moreover, because CE applies the same penalty to every sample’s mistake, the model is not sufficiently sensitive to samples that are easily misclassified.

To address the issue of sample imbalance, this study introduces the Focal Loss function as the model’s loss function. The main idea of Focal Loss is to increase the weights of samples that are misclassified and decrease the weights of samples that are easily classified, making the model pay more attention to those samples that are misclassified during the training process [32]. This is specified in Equation (20):

F L (p_{t}) = - {(1 - p_{t})}^{γ} \log (p_{t})

(20)

where

p_{t}

denotes the probability predicted by the model and

γ

is a moderator to control the rate of weight reduction for easily categorized samples.

The focus of the model is controlled by adjusting the value of

γ

. When the value of

γ

is high, the model focuses more on the misclassified samples. When the value of

γ

is low, the model focuses more evenly on all samples.

4. Model Training and Result Analysis

4.1. Experimental Environment Configuration

In this study, to ensure accuracy and reproducibility, advanced hardware and software platforms were selected. The hardware environment for the experiments comprises an NVIDIA GeForce RTX 3090 Ti graphics card and a computer equipped with a 5.40 GHz Intel i9-13900HQ processor. NVIDIA Corporation, located in Santa Clara, California, USA, manufactures the graphics card, and the processor is produced by Intel Corporation, which is based in Santa Clara, CA, USA. Regarding the software environment, the operating system was Ubuntu 22.04, LTS version, and the deep learning framework used was Pytorch 1.8.1. The experimental programming environment was Pycharm 2021.3, the image processing library chosen was Opencv 3.4.6, and the programming language used was Python 3.8.0. This high-standard experimental setup was designed to provide sufficient computing resources to support the training and testing of deep learning models, thus ensuring the validity of the experimental results.

4.2. Dataset Construction

Currently, the Oxford Flowers102 [33] flower dataset is widely used in the industry. The Oxford Flowers102 flower dataset consists of 102 classes of flowers with a total of 8189 images, and each class of flowers contains about 40–250 images of different sizes. The Oxford Flower102 dataset status is shown in Figure 9.

Figure 10 illustrates the images of each category of flowers. For image classification and recognition tasks, the Oxford Flowers102 dataset is a fine-grained problem. The difficulty lies in the similarities between different flowers (different rows) in terms of color, shape, etc., which are difficult to distinguish with the naked eye, i.e., inter-class similarity; and the differences within the same class (same column) of flowers in terms of color, shape, etc., which are also difficult to identify with the naked eye, i.e., intra-class differences.

4.3. Model Training Parameters

In order to demonstrate the performance of SA-ConvNeXt, this study conducts experiments on the Oxford Flowers102 database published by the VGG Laboratory at the University of Oxford. The training data were enhanced by randomizing the level and flipping the rotation by 30 degrees. And all images were preprocessed by subtracting the average RGB value of the dataset. The model is trained using ADAM [34] optimizer and Focal Loss function. The first-order momentum parameter

β_{1} = 0.9

, the second-order momentum parameter

β_{2} = 0.999

and the stability constant

ε = 10^{- 8}

, and the input batch size is set to 32. To address overfitting, a weight decay parameter (L2 regularization) is set at 1 × 10⁻⁴, which helps to limit the growth of model weights, thereby enhancing the model’s generalization ability. Additionally, an early stopping mechanism is introduced, monitoring the loss on the validation set. If there is no significant reduction in loss over 20 consecutive epochs, training is halted to prevent overfitting. The learning rate is initialized to 1 × 10⁻⁴, and after several rounds of parameter tuning, it is finally determined that the number of training rounds (epoch) of the model is set to 300.

In order to further optimize the learning rate adjustment strategy, a cosine annealing (Cosine Annealing) strategy is introduced. The learning rate is adjusted at each cycle by the shape of the cosine function and reduced to 1 × 10⁻⁹ after reaching a predetermined maximum number of cycles (set to 20 in this study.) This strategy is designed to help the model converge stably in the later stages of training.

4.4. Transfer Learning

Transfer learning has been widely studied and applied in the field of deep learning [35]. This approach relies on utilizing the knowledge learned on the source task to provide an initial learning structure for the target task. The effectiveness of this approach has been widely validated, especially when there is some structural similarity between the source and target tasks.

For the Oxford Flowers102 dataset, a feasible strategy is to use deep learning models pre-trained on a large dataset, ImageNet [36]. The ImageNet dataset has wide coverage of categories and a large sample size, and thus models trained on this dataset can provide initial feature extraction capabilities for flower classification tasks. Specifically, in this study, the weights of the pre-trained ConvNeXt model on the ImageNet dataset are used to initialize the corresponding structure of the newly constructed SA-ConvNeXt. For the evaluation of the model, the following metrics are used:

Precision: the ratio of correctly predicted positive samples to all positively predicted samples.

P = \frac{T P}{T P + F P}

(21)

where

T P

denotes the number of samples correctly predicted as positive.

F P

denotes the number of samples incorrectly predicted as whole.

2.: Recall: the ratio of correctly predicted positive samples to all actual positive samples.

R = \frac{T P}{T P + F P}

(22)

where

F N

denotes the number of samples with negative false predictions.

3.: F1: the reconciled mean of precision and recall.

$F 1 = 2 * \frac{P * R}{P + R}$

(23)

The

F 1

score is the harmonic mean of precision and recall. When the

F 1

score is high, it indicates that the model performs well in both precision and recall.

4.: Accuracy: indicates the proportion of samples correctly predicted by the model over all samples.

A c c = \frac{T P + T N}{T P + T N + F P + F N}

(24)

where

T N

denotes the number of samples correctly predicted to be negative.

According to Table 1, SA-ConvNeXt utilizing migration learning improved at least 2% on the evaluation metrics of accuracy, precision, recall, and F1 in the Oxford Flowers102 dataset. The experiments demonstrate that migration learning can significantly enhance the performance and recognition accuracy of the model in the scenarios of this study.

4.5. Experimental Results and Analysis

4.5.1. Analysis of the Training Process

To verify the algorithm’s effectiveness, VGG16, ResNet50, ConvNeXt, and the improved ConvNeXt model were compared during training on the Oxford Flowers102 dataset. Detailed performance is in Table 2, while trends in average accuracy, recall, and loss among models are shown in Figure 11, Figure 12 and Figure 13.

In the validation accuracy graph, the SA-ConvNeXt model exhibited good stability and high accuracy during the training process, with the final accuracy stabilizing at about 0.96. In comparison, the accuracies of the VGG16, ResNet50, and ConvNeXt models stabilized at around 0.88, 0.90, and 0.92, respectively, after 100 rounds of iteration. This indicates that the SA-ConvNeXt model has a better predictive performance for various complex fine-grained flower image recognitions, especially after the introduction of the SimAM attention mechanism and SANet, the model’s feature extraction capability is significantly enhanced. Moreover, because it uses a pretrained model for training, SA-ConvNeXt completes the learning of the dataset and stabilizes it within the first 25 iterations.

In the validation recall graph, SA-ConvNeXt ultimately leads with a recall rate of nearly 0.98, benefiting from its multi-level feature fusion strategy in the M-ECA structure, which provides the model with a more accurate capturing ability for various target features.

In the validation loss plot, although all models showed a reduction in loss with the increase in training rounds, the loss reduction rate of SA-ConvNeXt, trained with the ADAM optimizer and Focal Loss function, was more rapid and stable in later training stages.

From the four algorithm models, it can be seen that SA-ConvNeXt, compared to VGG16, ResNet50, and ConvNeXt, has an average accuracy increase of 8.1%, 6.6%, and 4.0%, respectively, an average recall rate increase of 13.8%, 9.5%, and 3.7%, respectively, and a decrease in loss value of 0.23, 0.14, and 0.06, respectively. In summary, the SA-ConvNeXt algorithm demonstrates superior performance in the task of flower image classification.

4.5.2. Analysis of Test Results

A confusion matrix is a specific table layout commonly used to understand the performance of classification models, especially in multi-class classification problems. It not only shows the accuracy of the model’s predictions for each category but also clearly displays the model’s misclassification situations. Specifically, each row of the confusion matrix represents an actual category, while each column represents a predicted category. Data points on the diagonal represent true positives, i.e., the number of correct classifications by the model, while data points off the diagonal show the model’s misclassifications.

To more intuitively reflect the test results of SA-ConvNeXt on the task of flower image classification, this study uses a confusion matrix as an evaluation tool and tests it on the Oxford Flowers102 dataset.

From Figure 14, it can be observed that SA-ConvNeXt exhibits significant predictive performance on the Oxford Flowers102 test set. The deep colors in the diagonal area indicate that most flower categories have been correctly classified by the model. The light spots in some non-diagonal areas of the figure represent prediction errors in some categories by the model. Overall, even though predicting on test data not seen during training, SA-ConvNeXt still has a high capability to classify most flower images accurately.

Figure 15 displays the results of the SA-ConvNeXt model on the Oxford Flowers102 test set, with the correct flower label categories in green font at the top of the images and the top five labels predicted by the model in red. It can be observed that SA-ConvNext accurately identifies the correct categories even for flower images with complex features and small inter-class differences, and its prediction results have high confidence, demonstrating good feature extraction and generalization abilities.

4.6. Results of Ablation Experiments

Transfer learning has been widely studied and applied in the field of deep learning [37]. This approach relies on utilizing the knowledge learned on the source task to provide an initial learning structure for the target task. The effectiveness of this approach has been widely validated, especially when there is some structural similarity between the source and target tasks.

Learning using only the baseline ConvNeXt network model.
Introducing the SimAM attention mechanism module in the ConvNeXt Block based on the above baseline strategy.
Based on the baseline strategy, using Focal Loss as the model’s training loss function.
Keeping the baseline strategy unchanged, adding the SANet module in the ConvNeXt Block, removing the M-ECA attention mechanism, and replacing the PA attention mechanism with a standard 3 × 3 convolution.
Based on the fourth strategy, adding the M-ECA attention mechanism to replace one branch’s standard 3 × 3 convolution.
Based on the fourth strategy, adding the PA attention mechanism to replace one branch’s standard 3 × 3 convolution.
Based on the fourth strategy, simultaneously adding both M-ECA and PA attention mechanisms.
Combining all strategies into the baseline ConvNeXt network model.

All seven of these ablation experiments were evaluated by outcome, as shown in Table 3.

Ablation experiments on the various strategies of the benchmark ConvNeXt network model are described in detail in Table 3. The specific impact of each change on the model’s performance is made explicit by gradually introducing or removing specific modules or strategies.

Initially, using the baseline ConvNeXt model, the average accuracy was 92.7% and the recall rate was 94.5%. The introduction of the SimAM attention mechanism significantly improved model performance, indicating SimAM’s effective feature reweighting. Conversely, using Focal Loss as the training loss function also enhanced performance, addressing the issue of imbalanced training samples. Adding the SANet module also improved model performance. Through the ablation experiment of the dual-branch structure, M-ECA and PA attention mechanisms demonstrated their powerful effect within the ConvNeXt network, especially in significant increases in accuracy and recall rates. This means the combined use of these two attention mechanisms can effectively dynamically focus on different types of features and weight them, enhancing important information and weakening unnecessary information. Ultimately, by incorporating a comprehensive strategy into the baseline ConvNeXt flower image classification model, the highest average accuracy of 96.7% and average recall rate of 98.2% were achieved, with a detection speed of 142 FPS, meeting the needs of practical detection scenario applications. These experimental results all demonstrate the significant advantage of the proposed method in flower image classification tasks.

4.7. Performance Analysis of Attention Mechanism

SANet within the ConvNeXt model, this study designed a series of ablation experiments comparing SANet with several other popular attention mechanisms, such as SE, CA, CBAM, etc. The goal of the experiments is to analyze the performance differences of these various attention mechanisms in the task of flower image classification, with a special focus on key indicators such as average accuracy and recall rates.

The experimental results are shown in Table 4; these indicate that the ConvNeXt models that incorporate SE, CA, and CBAM exhibit slight performance improvements compared to the ConvNeXt model alone. Specifically, ConvNeXt + SE, ConvNeXt + CA, and ConvNeXt + CBAM involve the integration of the respective attention mechanisms at the end of the ConvNeXt Block. The improvements in average accuracy and recall rates for these variant models are relatively limited. ConvNeXt + SANet1 and ConvNeXt + SANet2 represent the separate inclusion of SANet at the beginning and end of the ConvNeXt Block, respectively, while ConvNeXt + SANetAll signifies the addition of SANet at both the beginning and end throughout the entire ConvNeXt Block. The results show that ConvNeXt models with SANet perform significantly better than the others, with ConvNeXt + SANet1 achieving an average accuracy of 96.1% and an average recall rate of 97.1%. ConvNeXt + SANet2 further increases the average accuracy to 96.7% and the recall rate to 98.2%, representing the best network configuration. Notably, the effect of adding SANet at the end even surpasses that of adding SANet at both the beginning and end, with the average accuracy and recall rate, respectively, increasing by 0.2 and 0.4. Therefore, the placement and number of attention mechanisms have a significant impact on enhancing model performance, especially in tasks like image classification. Adding SANet at the end of the model can more effectively adjust and optimize feature representation, thereby achieving higher accuracy and recall rates in flower image classification.

4.8. Classification Heat Map Analysis

The visual interpretability of flower image classification lies in deeply understanding the model’s decision-making process. Heatmaps provide us with an intuitive way to display the main areas focused on by the model in an image, directly related to the flower classification decision.

Grad-CAM++ [34] is an improved, gradient-based visualization strategy for deep networks. Unlike Grad-CAM, Grad-CAM++ provides more refined gradient information when estimating the influence weights of channels. These weights, combined with feature maps, generate heatmaps overlaid on the original image. Thus, each pixel in the heatmap displays its relative importance in the model’s flower classification decision. In this study, the Grad-CAM++ technique was used to analyze the decision-making process of the flower image classification model on the Oxford Flowers102 dataset. The results are shown in Figure 16.

In the heat map representation, similarly to Figure 16b, Figure 16c represents the results of the flower heat map analysis by different models, where the darker the area shown in the figure means that the part of the feature provides a higher certainty value for the prediction of the flower classification. The main floral structures with their features in each image can be seen from the original images in Figure 16a. Comparing the ConvNeXt heat map (b) and SA-ConvNeXt heat map (c) columns, it can be seen that this model appears to be more detailed and precise in capturing the features of flower images. For example, the first figure not only focuses on the center of the flower but also gives more attention to the petals, and SA-ConvNeXt concentrates more on the overall objective knowledge of the flower compared to ConvNeXt, which pays higher attention to the subjective features of the image background, as shown in the red box in the figure. The second graph shows that SA-ConvNeXt is more pronounced than ConvNeXt in capturing petal edges and morphology and does not suffer from the problem of incorrectly focusing on the red box in the graph. In addition, in the third figure, SA-ConvNeXt is more clearly focused on characterizing the fruit, by highlighting its specific morphology and texture, which is thus given a relatively higher level of attention.

In summary, SA-ConvNeXt, compared to ConvNeXt captures a wealth of details and key features more robustly in flower image classification, accurately classifying different types of flowers in various scenarios.

5. Conclusions

To address the current lack of annotations in floral images and the insufficient focus on key features in traditional deep learning-based fine-grained floral image classification, this paper presents a floral image classification model, SA-ConvNeXt, which integrates a Selective Attention Network with ConvNeXt. Unlike other experiments on the Oxford Flowers102 dataset, this study initially employed a padding algorithm to fix the dimensions of images to prevent the deformation and loss of detail caused by scaling. It also enhanced the model’s dynamic extraction capability for channel and pixel features by combining M-ECA and pixel attention mechanisms. Additionally, the parameter-free attention module SimAM provided a more efficient method for reweighting input features to capture essential floral characteristics. The model also utilized the Focal Loss function to effectively address sample imbalance issues, further reducing the risk of overfitting and enhancing the model’s generalization capability. The use of the Adam optimizer minimized gradient fluctuations caused by random initial weight settings, ensuring training stability. SA-ConvNeXt demonstrated superior performance on the Oxford Flowers102 dataset, achieving a high accuracy rate of 96.7% and a recall rate of 98.2%, which are improvements of 4.0% and 3.7%, respectively, compared to ConvNeXt. Finally, using Grad-CAM++ technology to generate heatmaps of classification predictions allowed researchers to visually understand the areas of focus of the model and gain deeper insights into its working principles. Overall, the SA-ConvNeXt model exhibited excellent performance in fine-grained floral image classification, providing effective technical means for floral recognition and classification. It promotes the development of intelligent horticulture and plant disease diagnosis technologies, brings innovation to agricultural technology, and enhances the monitoring of biodiversity.

Author Contributions

Conceptualization, H.M. and L.W.; methodology, H.M.; software, H.M.; validation, H.M. and L.W.; formal analysis, H.M.; investigation, H.M.; resources, H.M.; data curation, H.M.; writing—original draft preparation, H.M.; writing—review and editing, H.M.; visualization, H.M.; supervision, H.M.; project administration, H.M.; funding acquisition, L.W. All authors have read and agreed to the published version of the manuscript.

Funding

The project supported by the Ministry of Science and Technology’s National Foreign Experts Project (G2022042005L); the Gansu Province Higher Education Industry Support Project (2023CYZC-54); the Gansu Province Key R\&D Plan (23YFWA0013); the Lanzhou Talent Innovation and Entrepreneurship Project (2021-RC-47); the 2020 Gansu Agricultural University Graduate Education Research Project (2020-19); the 2021 Gansu Agricultural University-level “Three-dimensional Education” Pilot Extension Teaching Research Project (2022-9); and the 2022 Gansu Agricultural University-level Comprehensive Professional Reform Project (2021-4).

Data Availability Statement

The datasets used or analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guru, D.S.; Kumar, Y.H.S.; Manjunath, S. Textural features in flower classification. Math. Comput. Model. 2011, 54, 1030–1036. [Google Scholar] [CrossRef]
Baraldi, A.; Parmiggiani, F. An investigation of the textural characteristics associated with gray level cooccurrence matrix statistical parameters. IEEE Trans. Geosci. Remote Sens. 1995, 33, 293–304. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer visIon and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: New York, NY, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
Stricker, M.A.; Orengo, M. Similarity of color images. In Proceedings of the Storage and Retrieval for Image and Video Databases III. SPiE, San Jose, CA, USA, 20–24 August 1995; Volume 2420, pp. 381–392. [Google Scholar]
Vezhnevets, V.; Konouchine, V. GrowCut: Interactive multi-label ND image segmentation by cellular automata. Proc. Graph. 2005, 1, 150–156. [Google Scholar]
Zagrouba, E.; Gamra, S.B.; Najjar, A. Model-based graph-cut method for automatic flower segmentation with spatial constraints. Image Vis. Comput. 2014, 32, 1007–1020. [Google Scholar] [CrossRef]
Yang, C.; Zhang, L.; Lu, H.; Ruan, X.; Yang, M.H. Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 3166–3173. [Google Scholar]
Khan, F.S.; Van de Weijer, J.; Vanrell, M. Modulating shape features by color attention for object recognition. Int. J. Comput. Vis. 2012, 98, 49–64. [Google Scholar] [CrossRef]
Fernando, B.; Fromont, E.; Tuytelaars, T. Mining mid-level features for image classification. Int. J. Comput. Vis. 2014, 108, 186–203. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 84–90. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pmlr. 2015. pp. 448–456. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Gogul, I.; Kumar, V.S. Flower species recognition system using convolution neural networks and transfer learning. In Proceedings of the 2017 Fourth International Conference on Signal Processing, Communication and Networking (ICSCN), Chennai, India, 16–18 March 2017; IEEE: New York, NY, USA, 2017; pp. 1–6. [Google Scholar]
Xia, X.; Xu, C.; Nan, B. Inception-v3 for flower classification. In Proceedings of the 2017 2nd International Conference on Image, Vision and Computing (ICIVC), Chengdu, China, 2–4 June 2017; IEEE: New York, NY, USA, 2017; pp. 783–787. [Google Scholar]
Chang, D.; Zheng, Y.; Ma, Z.; Du, R.; Liang, K. Fine-grained visual classification via simultaneously learning of multi-regional multi-grained features. arXiv 2021, arXiv:2102.00367. [Google Scholar]
Rahman, M.M.; Mojumdar, M.U.; Jamil, M.M.; Chakraborty, N.R.; Hasan, R.; Gupta, V. Flower identification by deep learning approach and computer vision. In Proceedings of the 2024 11th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 28 February–1 March 2024; IEEE: New York, NY, USA, 2024; pp. 1355–1360. [Google Scholar]
Li, J.; Zhang, S.; Yang, J.; Yu, P.; Ge, F. Image Classification of Wild Mushroom Using Swin Transformer by Object Positioning. In Proceedings of the 2023 IEEE 3rd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Chongqing, China, 26–28 May 2023; IEEE: New York, NY, USA, 2023; Volume 3, pp. 7–11. [Google Scholar]
Lee, C.P.; Lim, K.M.; Song, Y.X.; Alqahtani, A. Plant-CNN-ViT: Plant classification with ensemble of convolutional neural networks and vision transformer. Plants 2023, 12, 2642. [Google Scholar] [CrossRef] [PubMed]
Parsania, P.S.; Virparia, P.V. A comparative analysis of image interpolation algorithms. Int. J. Adv. Res. Comput. Commun. Eng. 2016, 5, 29–34. [Google Scholar] [CrossRef]
Pei, S.C.; Lin, C.N. Image normalization for pattern recognition. Image Vis. Comput. 1995, 13, 711–723. [Google Scholar] [CrossRef]
Gorunescu, F. Data Mining: Concepts, Models and Techniques; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR, 2021. pp. 11863–11874. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Nilsback, M.E.; Zisserman, A. Automated flower classification over a large number of classes. In Proceedings of the 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, Bhubaneswar, India, 16–19 December 2008; IEEE: New York, NY, USA, 2008; pp. 722–729. [Google Scholar]
Da, K. A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Bousmalis, K.; Trigeorgis, G.; Silberman, N.; Krishnan, D.; Erhan, D. Domain separation networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2016; Volume 29. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: New York, NY, USA, 2009; pp. 248–255. [Google Scholar]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: New York, NY, USA, 2018; pp. 839–847. [Google Scholar]

Figure 1. Process of image preprocessing.

Figure 2. Before and after padding on all sides; (a) original figure and (b) picture after four weeks of filling.

Figure 3. The architecture of ConvNeXt.

Figure 4. Before and after improved ConvNeXt Block; (a) former ConvNeXt Block and (b) improved ConvNeXt Block.

Figure 5. The architecture of ConvNeXt.

Figure 6. The architecture of M-ECA.

Figure 7. The architecture of SANet.

Figure 8. The architecture of PA.

Figure 9. Oxford Flower102 dataset status.

Figure 10. Oxford Flowers102 floral dataset.

Figure 11. Average accuracy comparison chart.

Figure 12. Average recall comparison chart.

Figure 13. Average loss rate comparison chart.

Figure 14. Confusion matrix of the SA-ConvNeXt when predicting the Oxford Flowers102 test set.

Figure 15. SA-ConvNeXt model detection results.

Figure 16. Heatmap analysis of different model detection. (a) Original; (b) ConvNeXt Heat Map; (c) SA-ConvNeXt Heat Map.

Table 1. Performance comparison of SA-ConvNext model before and after migration learning.

Model	Accuracy	Precision	Recall	F1
SA-ConvNeXt	0.9432	0.9343	0.9355	0.9040
SA-ConvNeXt Post-transfer learnin	0.9672	0.9622	0.9821	0.9325

Table 2. Comparison of indicators from different models.

Model	Average Accuracy%	Average Recall%	Loss	Recognition Time t/s
VGG16	88.6	84.4	0.44	0.004
ResNet50	90.1	88.7	0.35	0.003
ConvNeXt	92.7	94.5	0.27	0.005
SA-ConvNeXt	96.7	98.2	0.21	0.007

Table 3. Ablation experiment results.

Baseline	SimAM	Focal Loss	SANet	M-ECA	PA	Average Accuracy%	Average Recall%	Recognition Time t/s
√						92.7	94.5	0.005
	√					93.6	95.8	0.005
		√				92.9	95.1	0.005
			√			93.0	94.9	0.006
			√	√		94.1	96.1	0.007
			√		√	93.2	95.3	0.006
			√	√	√	95.8	96.9	0.007
√	√	√	√	√	√	96.7	98.2	0.007

Table 4. Ablation experimental results.

Model	Average Accuracy	Average Recall	F1
ConvNeXt	92.7	94.5	93.59
ConvNeXt + SE	93.7	95.8	94.74
ConvNeXt + CA	93.3	95.3	94.29
ConvNeXt + CBAM	94.5	96.4	95.44
ConvNeXt + SANet1	96.1	97.1	96.60
ConvNeXt + SANet2	96.7	98.2	97.44
ConvNeXt + SANetAll	96.5	97.8	97.15

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mo, H.; Wei, L. SA-ConvNeXt: A Hybrid Approach for Flower Image Classification Using Selective Attention Mechanism. Mathematics 2024, 12, 2151. https://doi.org/10.3390/math12142151

AMA Style

Mo H, Wei L. SA-ConvNeXt: A Hybrid Approach for Flower Image Classification Using Selective Attention Mechanism. Mathematics. 2024; 12(14):2151. https://doi.org/10.3390/math12142151

Chicago/Turabian Style

Mo, Henghui, and Linjing Wei. 2024. "SA-ConvNeXt: A Hybrid Approach for Flower Image Classification Using Selective Attention Mechanism" Mathematics 12, no. 14: 2151. https://doi.org/10.3390/math12142151

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

SA-ConvNeXt: A Hybrid Approach for Flower Image Classification Using Selective Attention Mechanism

Abstract

1. Introduction

2. Image Preprocessing in SA-ConvNeXt

2.1. Image Interpolation and Image Filling

2.2. Image Normalization

3. Materials and Methods

3.1. Model Exploration

3.2. Improved ConvNeXt Block

3.3. SimAM

3.4. Multi-Level Efficient Channel Attention Mechanism (M-ECA)

3.5. Selective Attention Network (SANet) Mechanism

3.6. Focal Loss Function

4. Model Training and Result Analysis

4.1. Experimental Environment Configuration

4.2. Dataset Construction

4.3. Model Training Parameters

4.4. Transfer Learning

4.5. Experimental Results and Analysis

4.5.1. Analysis of the Training Process

4.5.2. Analysis of Test Results

4.6. Results of Ablation Experiments

4.7. Performance Analysis of Attention Mechanism

4.8. Classification Heat Map Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI