ConvNeXt-Based Fine-Grained Image Classification and Bilinear Attention Mechanism Model

Li, Zhiheng; Gu, Tongcheng; Li, Bing; Xu, Wubin; He, Xin; Hui, Xiangyu

doi:10.3390/app12189016

Open AccessArticle

ConvNeXt-Based Fine-Grained Image Classification and Bilinear Attention Mechanism Model

by

Zhiheng Li

^1,2,

Tongcheng Gu

^2,3,*,

Bing Li

^2,3,

Wubin Xu

^2,3,

Xin He

^2,3 and

Xiangyu Hui

^2,3

¹

Guangxi Liugong Machinery Co., Ltd., Liuzhou 545006, China

²

Guangxi Earthmoving Machinery Collaborative Innovation Center, Guangxi Science and Technology University, Liuzhou 545006, China

³

College of Mechanical and Automotive Engineering, Guangxi Science and Technology University, Liuzhou 545006, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(18), 9016; https://doi.org/10.3390/app12189016

Submission received: 10 August 2022 / Revised: 30 August 2022 / Accepted: 5 September 2022 / Published: 8 September 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

This paper studies attention-related optimizations and innovations for the ConvNeXt network proposed in January 2022, providing a reference for subsequent researchers to optimize this network.

Abstract

Thus far, few studies have been conducted on fine-grained classification tasks for the latest convolutional neural network ConvNeXt, and no effective optimization method has been made available. To achieve more accurate fine-grained classification, this paper proposes two attention embedding methods based on ConvNeXt network and designs a new bilinear CBAM; simultaneously, a multiscale, multi-perspective and all-around attention framework is proposed, which is then applied in ConvNeXt. Experimental verification shows that the accuracy rate of the improved ConvNeXt for fine-grained image classification reaches 87.8%, 91.2%, and 93.2% on fine-grained classification datasets CUB-200-2011, Stanford Cars, and FGVC Aircraft, respectively, showing increases of 2.7%, 0.3% and 0.4%, respectively, compared to those of the original network without optimization, and increases of 3.7%, 8.0% and 2.0%, respectively, compared to those of the traditional BCNN. In addition, ablation experiments are set up to verify the effectiveness of the proposed attention framework.

Keywords:

deep learning; convolutional neural network; image classification; fine grained; attention mechanism

1. Introduction

The main task of fine-grained classification is to further classify categories, such as different categories of birds, airplanes, and cars. Due to the specific small interclass differences and large intraclass differences and the effects of changes in target object position, it is very hard to achieve fine-grained image classification with a convolutional neural network (CNN). Based on the rapid development of deep learning, the classification accuracy of existing deep learning models has surpassed what humans can achieve. The classification accuracies of common image classification algorithms, such as ResNet [1], ResNext [2], SENet [3], MobileNet [4], ShuffleNet [5], DenseNet [6], EfficientNet [7] and their variant networks [8,9,10,11,12,13], as well as the transformer models based on natural language processing (NLP) and derived from a self-attention mechanism [14], such as vision transformer [15] and Swin transformer [16], have reached new highs. However, their fine-grained classification accuracy needs to be further improved.

Although the existing fine-grained image classification algorithms are becoming increasingly accurate, they cannot be trained to match nonlinear relationships between the channels of a feature map during the process of network training or realize information interaction between channels. Thus, it is impossible for these network models to accurately identify fine-grained features. Even if the classification accuracy is improved, the final optimization method has some limitations. Moreover, due to the large number of computations required for fine-grained classification tasks, there are more stringent requirements for the hardware equipment used in network training. This means existing algorithms are unlikely to run on equipment with general hardware conditions.

To solve these problems, in this paper, based on existing optimization methods, ConvNeXt-Tiny is used to innovate an attention framework for nonlinear relationships in different directions of a feature map, and then this framework is embedded into the backbone network with the optimal technique to accurately locate fine-grained features and to integrate feature maps in an all-around manner, from different aspects and at multiple scales, to further improve the accuracy of network in fine-grained classification. For the purpose of providing more research ideas and network optimization methods for the fine-grained image classification field, especially the current latest CNN ConvNeXt to promote further development in this field, another effective optimization method is put forward, which provides strong technical support for this network in fine-grained classification. The research progress in the direction of fine-grained can not only realize the fine-grained classification of goods and faces with high accuracy in the society, but also provide a strong guarantee for the future application of AI technology in medicine and criminal investigation.

2. Related Work

2.1. Analysis of Traditional Methods

At present, there are two algorithms that can achieve fine-grained image classification. The first one is the classification algorithm based on strongly supervised learning. As this method is costly, time consuming and labor intensive, this method is gradually being replaced by a classification algorithm based on weak supervision. The main advantage of this algorithm is that only the class label of the image is used; no additional annotation is required for training. The network’s ability to extract features and locate features and then achieve fine-grained image classification is improved by using special network mechanisms, such as residual, bilinear, and attention methods, together with special processing methods, such as bilinear pooling, data enhancement (such as Mixup and Cutmix), or bilinear cutting technology through the inductive bias characteristics of CNN convolution.

The use of weakly supervised learning for fine-grained image classification is a general current trend. The most representative technique is the integration of the cascade attention mechanism based on a CNN network proposed by Zhu et al. [17] The network has three modules: spatial confusion attention, cross network attention and network fusion attention. They complement each other and are jointly trained and optimized. Yan et al. [18] proposed an attribute-guided attention network. They use long short-term memory (LSTM) to gradually accumulate the learned information for judging semantics from the starting node to the deeper ones, providing a feature representation method that can be used for both judgement and learning. The classification accuracy of this algorithm on the fine-grained bird dataset CUB200-2011 reaches 85.1%. The bilinear convolutional neural network (BCNN) is a network model designed by Lin et al. [19] for fine-grained image classification tasks. Figure 1 presents the structure chart of a bilinear network. It shows two feature extractors, i.e., CNN1 and CNN2. After the outputs of CNN1 and CNN2 are processed and pooled by the outer product, the softmax activation function outputs the prediction results. The classification accuracy of this model on the fine-grained bird dataset CUB200-2011 reaches 84.1%.

Yu et al. [20] put forward the hierarchical bilinear pooling (HBP) method, which regards each step of convolution in a CNN as an extraction operation for component features. The convolution output features of different layers are aggregated through bilinear pooling to build an interaction model between feature layers to obtain more subtle target fine-grained features. Li et al. [21] proposed a bilinear network based on an attention mechanism (BAM BCNN) according to the bilinear nature of the BCNN. It uses two-level attention to obtain object regions and component regions, extracts images in different stages, and adds an attention mechanism through a bilinear network.

Referring to relevant research on fine granularity and network-optimization methods, in this paper, ConvNeXt is applied to the fine-grained classification task and further optimized and rectified so that it can accurately locate the fine-grained features and perform highly accurate fine-grained classification.

A sequential block diagram of the research ideas in this paper is shown in Figure 2.

2.2. Introduction to ConvNeXt Network

As studies on deep learning are conducted, algorithm networks continuously emerge in the image classification field. However, in the coarse-fined classification field, the Swin transformer has gradually replaced CNNs in functions. Later, ConvNeXt [22], improved based on Swin transformer, further increased the accuracy of classification by means of the Swin transformer’s layer structure, downsample method, activation function, data-processing method, inverted bottleneck and depthwise convolution. This network restored the importance of the CNN in image classification. With great feature extraction abilities and a small number of parameters, ConvNeXt-Tiny has low requirements for hardware conditions during training. Therefore, the algorithm in this paper is further improved based on ConvNeXt-Tiny. Below are charts showing network structures of ConvNeXt-Tiny and ConvNeXt Block.

Figure 3 can be compared to the layer structure of ResNet50. In addition to the downsampling method and layer normalization added to the network, its cycling method and structure combination are basically similar to those of ResNet50. Figure 4 can be compared to the layer structure of MobileNetV2. Inverted residuals and the MLP structure of Swin transformer are used to obtain ConvNeXt Block. A ConvNeXt designed in this way combines the processing method of the Swin transformer and the characteristics of convolution, which further improves the accuracy of coarse-grained image classification.

3. Proposed Algorithm

This section introduces a series of improvements and optimizations made based on ConvNeXt-Tiny for fine-grained classification tasks that strengthen the network’s ability to accurately locate and extract fine-grained features by improving and innovating the attention mechanism.

3.1. ConvNeXt Model Embedded with Attention Mechanism

In the ConvNeXt network structure, there are four types of blocks depending on the number of channels in the output feature map. Each type of block cycles for a different number of times: the first type cycles 3 times, outputting a 96-dimensional feature map; the second type cycles 3 times, outputting a 192-dimensional feature map; the third type cycles 3 times, outputting a 384-dimensional feature map; and the fourth type cycles 3 times, outputting a 768-dimensional feature map. Each time the feature map undergoes a downsampling operation before a block, the final output size (W, H) of the block becomes half that of the original, and the output dimension is doubled under the action of convolution. In this case, it seems very important to add weight values to channels.

In performing fine-grained classification tasks, the accurate localization of targets is necessary for feature extraction. So that the network can learn specific target features during the process of training, every block attention (EBA) and severy block attention (SBA) are proposed based on the embedding position.

Integrating the attention mechanism with each block in the ConvNeXt-Tiny network can enable the attention mechanism to be embedded in the internal cycle of each type of block. This is called EBA. When the attention weight is applied to the feature map with inverted residuals, attention parameters are cyclically trained in each block. Then each block cycling object forms a feature map with attention weight, which realizes the repeated training of attention parameters. For the specific structure of the modified ConvNeXt Block, see Figure 5:

The integration of the attention mechanism with the block after ConvNeXt-Tiny cycling is called SBA. The attention mechanism does not interfere with the block cycling process. Instead, it acts on the feature map after each type of block cycling comes to an end so that the training of attention weight depends on the whole cycled feature map rather than a single block. This can reduce the number of attention parameter trainings. In addition, to enable the network to solve the problems of degradation, exploding gradient, and vanishing gradient of the deep network, the downsampled feature map is added to the output of the attention module in the form of a residual shortcut based on the channel. For the specific structure, see Figure 6.

Due to the residual structure, when the attention mechanism is combined through SBA, the residual connection outside the attention of the third type of block is cancelled (as shown in Figure 6). This is because the third type of block in all versions of ConvNeXt cycles the most, such as nine cycles in the Tiny version. If an external residual connection is added, a large number of unprocessed and shallow semantic information is integrated into the attention feature map, weakening the overall feature extraction ability and learning ability of the network.

3.2. Bilinear Attention Mechanism

In addition to embedding an attention mechanism in ConvNeXt-Tiny for the weight allocation of fine-grained features, in this paper, the convolutional block attention module (CBAM) mechanism [23] is further improved, and a new attention mechanism is proposed to improve the accuracy of ConvNeXt in performing fine-grained classification tasks to the fullest extent. This section introduces the improved CBAM in detail.

Based on the idea of a bilinear CNN, in this paper, the CBAM mechanism is improved with the idea of bilinearity; a bilinear CBAM (BCBAM) is proposed. Additionally, the feature information of channels and spatial dimension are integrated to improve the channel attention module (CAM) in the CBAM. To improve the algorithm in this paper, considering the bilinearity of the BCNN, the original maximum pooling and average pooling paralleling is retained to maintain the feature information of the original map to the fullest extent. Then, the output result concatenation is performed based on the channels. The full connection layer in the original structure is changed to a 1 × 1 convolutional layer, and layer normalization is utilized for batch data processing. Since ConvNeXt uses the same Gaussian error linear unit (GeLU) activation function as the Swin transformer, the rectified linear unit (ReLU) activation function in the CAM is replaced with a GeLU with random regularity. Its structure is shown in Figure 7:

The processing of input data by the network mentioned above is shown in Formula (1):

\begin{array}{l} F_{a 1} & = σ (C o n v (G e L U (L N (C o n v (C o n c a t (A v g p o o l (F), M a x p o o l (F))))))) \\ = σ (W_{1} (G e L U (L N (W_{0} (C o n c a t (F_{a v g}^{c}, F_{m a x}^{c})))))) \end{array}

(1)

where F_a1 represents the feature map output by the CAM mechanism; Conv is the convolution; and σ and GeLU are activation functions. After the GeLU activation function is applied,

F \in R

^(B,2C/r,W,H), and after the sigmoid activation function is applied,

F \in R

^(B,C,W,H); Avgpool and Maxpool are the average pooling and maximum pooling, respectively.

Next, bilinearity is brought into the spatial attention module (SAM). He Kai et al. [24] proposed a spatial attention method that uses 1 × 1 and 3 × 3 convolutions to obtain more abundant and diverse feature information. In this paper, this method is improved on this basis. First, 1 × 1 and 3 × 3 convolution operations are performed for the input feature map, respectively. After obtaining spatial features of different scales, layer normalization and GeLU activation function mappings are carried out for the two features. Then, 1 × 1 convolution is used to reduce the dimensions of the features to single channels. These two single-channel features are multiplied to obtain the multiscale spatial attention features, which are then multiplied by the feature map with channel attention weight to obtain a feature map with both channel weight features and spatial features. Batch processing is performed for the data using layer normalization for two reasons. The first reason is to utilize the research result of ConvNeXt. Since batch processing is performed with layer normalization in transformer models, in this paper, the attention information is batched using layer normalization, which matches the data processing of ConvNeXt more closely. Second, batch normalization is always employed in CNNs. It is associated with the batch size used in training. The larger the batch size is, the better the effect. Nevertheless, the batch size is limited by the hardware but unrelated to the layer normalization. In terms of selecting the size of the convolution kernel, 1 × 1 and 3 × 3 convolutions are used. In addition to obtaining spatial features at different scales, 1 × 1 convolution can reduce the complexity and number of calculations of the entire network. In addition, most GPUs have optimized the calculation of 3 × 3 convolution, which further improves the efficiency.

Optimization and modification are performed based on the considerations mentioned above. The Figure 8 below shows the improved structure chart of the SAM.

The processing of the input data by the network structure mentioned above is indicated in Formulas (2)–(4):

F_{a 2} = S i g m o i d (F_{1} \otimes F_{2})

(2)

F_{1} = C o n v_{1 \times 1} (G e L U (L N (C o n v_{1 \times 1} (F))))

(3)

F_{2} = C o n v_{1 \times 1} (G e L U (L N (C o n v_{3 \times 3} (F))))

(4)

where F_a2 represents the feature map output by the SAM mechanism; Conv_1×1 is convolution with a 1 × 1 convolution kernel; Conv_3×3 is convolution with a 3 × 3 convolution kernel; after GeLU is used,

F \in R

^(B,C/r,W,H); and after sigmoid is used,

F \in R

^(B,1,W,H).

Since the ordinary attention mechanism (squeeze and excitation, SE) allocates only weights between channels rather than considering the spatial perspective, a bilinear attention mechanism with two branches is built. The first branch provides channel weights, while the second branch provides spatial weights. In the original CBAM mechanism, the serial processing method is used. During the process of improving this algorithm, as fine-grained classification tasks are implemented, the network is required to be more accurate in extracting the attention of features. For this reason, serial processing is replaced by parallel processing so that channel and spatial attention trainings are carried out at the same time. This can ensure that sufficient semantic information is retained while obtaining the multiscale features of the target. For the specific attention network structure, see the Figure 9 below.

The abovementioned bilinear attention mechanism can be summarized by the formula below:

F_{o u t} = (F_{a 1} \otimes F_{i n}) \otimes F_{a 2} \oplus F_{i n}

(5)

where F_a1 is the feature map output by the CAM; F_a2 is the feature map output by the SAM; F_in is the feature map of the input BCBAM; and F_out is the feature map of the output BCBAM.

3.3. BCBAM-Based Attention Framework

In Section 3.1 of this paper, two methods for embedding attention mechanisms into ConvNeXt are proposed to study the full embedding position of the attention mechanism in the entire network. Here, further study is carried out on the specific attention framework for the specific position.

Figure 10 and Figure 11 indicate how the SE and CBAM mechanisms in SEResNet and CBAMResNet are embedded in ResNet. This section proposes a new multi-perspective attention framework. As shown in Figure 10, the shortcut attention branch provides the bilinear attention mechanism proposed in this paper as well as the size of the feature map passing over each module. Considering each channel and space of the feature map and the information interaction between channels, the attention framework shown in Figure 12 is proposed to add to the ECA mechanism, which considers the information interaction between channels [25]. ECA can compensate for the lost interaction information between channels caused by the dimension increase and decrease in convolution. This scheme unifies the relationships in different directions of the feature map and acts on the feature map without network block cycling, which retains the channel interaction information of the original feature map to a large degree. As a result, the attention mechanism can repair deficiencies during the process of training, realizing all-around and multi-perspective attention parameter training.

In addition, when promoting the original ConvNeXt to build a deeper network structure, this framework aggregates the feature map before the input block cycle with the attention output in the form of relationships between channels, which helps the network extract and focus on fine-grained features. Moreover, the ReLU activation function after the addition of the attention feature map and the original feature map is canceled. This can improve the final recognition accuracy of the network model. In the subsequent comparison experiment network, the ReLU activation function is also removed here.

3.4. ConvNeXt Based on Multiscale Bilinear Attention Mechanism

Based on the EBA, SBA, and multi-perspective attention frameworks proposed in this paper, the BCBAM mechanism is now integrated into ConvNeXt-Tiny with the attention framework proposed in this paper through EBA and SBA. Then, we have BCBAM ConvNeXt. During the process of training, we can not only repeatedly obtain multiscale attention features from the feature map, but also accurately extract fine-grained features of the target.

Figure 13 indicates how the BCBAM proposed in this paper is embedded into the ConvNeXt-Tiny network through SBA. The same can be achieved in the ConvNeXt-Small, ConvNeXt-Base, ConvNeXt-Large and ConvNeXt-Xlarge versions. With increasing network size and depth, adding BCBAM theoretically improves the accuracy of fine-grained classification.

4. Experimental Results and Analysis

To verify that the feasibility of using ConvNeXt-Tiny for fine-grained image classification is improved and how much the ability of BCBAM to classify the model improved, the following experiments are set up for verification. After the network is built, it is tested on an internationally recognized standard fine-grained classification dataset and compared with a traditional fine-grained classification network on this basis. This provides a comparison for the improved network proposed in this paper and verifies the effectiveness and superiority of the network.

4.1. Dataset Introduction

We select internationally recognized fine-grained image classification datasets for the following classification tests, i.e., CUB200-2011 [26], FGVC-Aircraft [27], and Stanford Cars [28].

CUB200-2011: Bird classification dataset that contains 200 categories of birds, 5994 training set samples, and 5794 test set samples totaling more than 10,000 bird pictures.

Stanford Cars: Car classification dataset that contains 196 categories of cars, 8144 training set samples, and 8041 test set samples, totaling more than 16,000 car pictures.

FGVC-Aircraft: Aircraft classification dataset that contains 100 categories of aircraft, 6667 training set samples, and 3333 test set samples, totaling more than 10,000 pictures.

The specific information is shown in the Table 1 below:

Figure 14 below presents pictures in each category of dataset:

4.2. Preconditions of Experiment and Environment Description

To verify the feasibility and reliability of the proposed method, the following experiments were conducted:

(1): Experiment 1: For ConvNext-Tiny network, EBA and SBA were used to embed the traditional attention mechanism (SE, CBAM) and the proposed BCBAM mechanism, respectively. The applicability and robustness of the proposed method were verified on different fine-grained datasets.
(2): Experiment 2: On the basis of Experiment 1, the CUB200-2011 dataset was used to conduct an ablation experiment on the optimal optimization method to remove the influence of some components in the attention mechanism on the performance of the mechanism so as to better understand the behavior of the frame.
(3): Experiment 3: The traditional fine-grained classification network is compared, and the superiority and practicality of the ConvNeXt-Tiny network embedded into BCBAM in implementing fine-grained classification tasks are verified.

To ensure the better robustness of the network obtained in the experiment, the dataset is further processed. The Cutmix data enhancement algorithm is used to enhance the data of the training set to improve the model’s ability to extract fine-grained features of the target during the process of training. Then, during the process of network training optimization, the exponential moving average (EMA) is used. This enables the network to maintain shadow weights when the backpropagation gradient is decreasing, preventing the jitter of the model parameters at the global optimum and making the network more robust during training. The above training strategy can improve the model accuracy slightly with a stable training process.

The learning rate is decayed by a cosine function according to the training period. The decay period is the epoch of the entire training, and the initial learning rate is set to 0.00008. The network is trained according to the cross-entropy loss function.

The performance comparisons between individual models in the experiments were built on three datasets, CUB201-2011, FGVC-Aircraft, and Stanford Cars, respectively, each with a large number of categories, as shown in Table 1. Therefore, due to the limitation of space, it is not possible to show the confusion matrix or the data comparison of many individual category model evaluation parameters (e.g., precision and recall), and it is not convincing to evaluate the model by averaging the precision and recall of many categories with such a large number of categories in the fine-grained dataset. Therefore, this paper uses the accuracy rate metric to compare and validate the performance of each model for fine-grained classification. The classification accuracy rate is obtained by averaging the accuracy rates of each batch size data to obtain the classification accuracy rate of this round of testing (the number of test rounds coincides with the epoch of network training). The highest accuracy is recorded as the final classification accuracy of the model. In summary, the accuracy rate is the most convincing metric to validate the model performance for this study.

The model accuracy is calculated as shown in Equation (6).

Accuracy Rate = \frac{TP + TN}{TP + TN + FP + FN} \times 100 %

(6)

where TP means true positive; FP means false positive; TN means true negative; and FN means false negative.

The computer configuration used in the experiment is as follows: Intel Core i7-9000 K CPU and a GTX1060 GPU with 6G video memory, running a Windows 10 system and Python programming environment and the Pytorch deep learning framework for fine-grained model training and testing.

4.3. Analysis of Results

Experiment 1 is designed to verify the generality and robustness of the improved network and BCBAM on different datasets.

Experiment 1 uses the two integration methods (EBA and SBA) and the attention framework proposed in this paper to conduct classification experiments on three fine-grained datasets: CUB200-2011, FGVC-Aircraft and Stanford Cars. The details are recorded in Table 2.

Table 2 shows how the process of embedding different attention mechanisms into ConvNeXt-Tiny using different methods affects the fine-grained classification accuracy. According to the analysis of the experimental results, the integration of the BCBAM with the attention framework through SBA can more effectively improve the accuracy of the model’s fine-grained classification. For different datasets, the accuracy is still the highest, which shows that the BCBAM ConvNeXt proposed in this paper has great generalization ability and good robustness.

According to the experimental results indicated in the table above, the BCBAM proposed in this paper significantly improves ConvNeXt’s ability to extract fine-grained features. It can be seen from the comparison of the results of the two integration methods that the SBA method can better utilize the fine-grained feature extraction ability of ConvNeXt. In addition, it takes nearly half of the time used by EBA in training.

The Figure 15 shows the accuracy and loss variation curves of the model during the testing process. The analysis of the curve changes shows that the model converges quickly, with CUB200-2011 requiring only 44 epochs, FGVC-Aircraft requiring only 80 epochs to reach convergence, and Standford Cars reaching convergence at 88 epochs. The same effect is reflected in the change of the loss curve, where the loss decreases faster, which leads the network to reach the stable stage faster and shortens the training time significantly. Finally, we conclude that the model in this paper converges quickly and has high classification accuracy.

Next, the model was evaluated in two ways: (1) confusion matrix, and (2) model evaluation parameters such as precision, recall, and specificity. We evaluated the model using both of these approaches. However, since the datasets used in this paper are CUB200-2011, FGVC-Aircraft, and Stanford Cars, the smallest dataset has 100 classes, so the smallest confusion matrix size is 100 × 100, and the confusion matrix cannot be displayed properly. So, the table below shows the results of model evaluation using precision, recall, and specificity parameters. Using the evaluation parameters also suffers from the problem that cannot be displayed properly because of many categories; therefore, the precision, recall, and specificity of the model for different categories of each dataset are averaged to reflect the overall classification effect of the model on the dataset. This is shown in Table 3.

From the analysis in Table 3, it can be obtained that the precision, recall and specificity values of the model for classifying birds are 88.2%, 87.9% and 99.9%, respectively, and the above values indicate that the percentage of samples correctly classified among all samples predicted by the model as this target is 88.2%; the number of samples successfully recalled and correctly classified from all samples is 87.9% of the total number of samples in this category; and the number of samples classified by the model into other categories for samples not in this category is 99.9% of the rest of the total number of samples, except for samples in this category. The indicators of the evaluation parameters for the remaining two datasets were the same as the above analysis. Therefore, in general, after analyzing the evaluation parameters of the model, the robustness and reliability of the model for fine-grained classification are relatively high and meet most of the fine-grained classification requirements.

To present the optimization result more intuitively, in Figure 16, we show the visualized attention result predicted by the model.

Note: The red area in the image is the attention focus area when the network model predicts the image type, and the darker the color means the more attention is gathered.

According to Figure 16, in the process of prediction, the network utilizes the GradCAM algorithm to call the output of the last convolutional layer of the last block of the network for visualization. It can be seen from the comparison results that for fine-grained classification, the network attention no longer focuses on the overall features of the target. Instead, it captures different characteristics of subclasses through training. As shown in the car pictures (third column), upon optimization of the three attention mechanisms, the network can focus on the features that are helpful for classification and unique to the subclass, such as the front of the car, and successfully filter the features that are not conducive to fine-grained classification and are common between subclasses, such as the wheels. Compared with the other two mechanisms, the BCBAM mechanism further pays attention to the different features of the front and body of the car and focuses on more comprehensive fine-grained features. Therefore, it is more helpful for subclass classification. The same is also reflected in the other two datasets.

In order to verify the superiority and effectiveness of the proposed attention framework, the ablation experiment is designed as follows. The proposed attention frame is different from the traditional attention frame in the following three aspects: the main network used to train the attention parameters, the presence or absence of ECA mechanism, and the nonlinear activation of ReLU. The above differences are set, and ablation conditions are conducted on the CUB200-2011 dataset to study the influence on the performance of the frame so as to better understand the behavior of the frame. Table 4 below is the experimental results.

The experimental results show that the classification accuracy of the attention framework composed of removing some components is different on CUB200-2011. The comparison shows that the nonlinear activation of the output feature map through the ReLU activation function seriously affects the network’s fine-grained classification, and the highest accuracy rate of classification is only 84.9%. The attention framework without ReLU activation has a significant effect on the overall improvement of the network, no matter what mechanism is used to optimize the network; the accuracy rate has been improved. In the framework structure that does not use ECA and only uses the BCBAM mechanism, the accuracy rate reaches 86.0%; the attention framework combining ECA and BCBAM (proposed in this article) can fully demonstrate the advantages of each attention. The classification accuracy rate on the CUB200-2011 data and above is as high as 87.8%, which exceeds the original unoptimized ConvNeXt-Tiny by 2.7%. The ablation experiment proves that the attention framework proposed in this paper not only improves the overall network’s ability to control and capture fine-grained feature details, but also can combine multi-attention to achieve all-round and multi-perspective attention parameter training.

Experiment 3 is designed to verify the superiority and applicability of ConvNeXt with BCBAM embedded in implementing fine-grained classification tasks. Using the improved network model BCBAM ConvNext (SBA method) with the highest classification accuracy in experiment 1, the current commonly used fine-grained classification algorithm is introduced to form a comparison experiment with the network classification results proposed in this paper. The Table 5 lists the experimental comparison results. It is worth noting that choosing different and optimal training strategies for different models in the training process is the key to obtain the best accuracy, the fine-grained classification experiments on ConvNeXt all use the training strategy introduced in Section 4.2 above, while for other typical fine-grained classification networks, in order to obtain their own best accuracy for fine-grained classification, a training strategy optimal for itself and different from introduced in Section 4.2 is used to train the model.

The experimental results mentioned above show that in the feature extraction model based on ConvNeXt, embedding the attention framework proposed in this paper into the attention mechanism through SBA can allow the final accuracy rate to exceed that of the traditional fine-grained classification network model in implementing fine-grained classification tasks. Additionally, good classification results are indicated in the three types of fine-grained classification tasks. The classification accuracies on the three datasets were improved by 2.7%, 0.3%, and 0.4%, respectively, compared to the original network without the addition of optimization. The main reasons for the significant improvement of the optimized model on the bird dataset are the following two: (1) birds generally have lower classification results due to their complex background environment and smaller target size, and because of this, the bird dataset provides more samples of previous classification errors to the improved model than aircraft and automobiles, and therefore, the most significant improvement in network accuracy. (2) Since the fine-grained features of airplanes and cars are easier to be focused by the network model than those of birds, the classification network generally has higher accuracy in classifying airplanes and cars, and therefore provides fewer difficult samples for the model prediction, so the improvement in accuracy of the model on these two datasets is less than that of the bird dataset.

For the different fine-grained classification networks used in the above experiments, the advantages and disadvantages are relatively obvious: the advantage is that basically all the network models can achieve an accuracy of more than 80% on all three types of datasets, and the classification accuracy of the car and airplane classes can even reach more than 90%, which already meets the classification requirements in many cases; however, the disadvantage is that there is still a lot of room for improving the accuracy of fine-grained classification. In addition, except for the algorithm proposed in this paper, other algorithms to achieve fine-grained classification have high hardware requirements and cannot achieve the classification task under general hardware conditions.

In summary, combining the experimental results to compare the different network models, it is concluded that the improved ConvNeXt network is able to outperform the traditional fine-grained classification network accuracy in terms of classification accuracy.

5. Conclusions

Based on the ConvNeXt-Tiny network structure, we improve and optimize the structure in terms of fine-grained classification. First of all, we put forward two attention mechanism embedding methods according to the network cycle structure, i.e., EBA and SBA. Second, we study the attention structure. On the basis of the CBAM, we improve the CAM and SAM, design and restructure with the bilinear concept, and build and propose BCBAM, which is more effective in extracting fine-grained features. Third, by comparing the frame structure with SE embedded and the CBAM mechanism network, we propose an attention framework based on BCBAM. With this framework structure, we can solve the issues with each attention mechanism and effectively combine the advantages of the three types of attention to maximize the role of the attention mechanism, achieving multiscale, all-around, and multi-perspective attention parameter training. After comparing the classification accuracy on current fine-grained classification datasets CUB200-2011, FGVC-Aircraft, and Stanford Cars, we use Grad CAM to visually analyze the attention distribution map of the three mechanisms during prediction. The attention framework proposed through SBA by combining with SBA performs the best. Secondly, the importance of each component of the attention framework in this paper is verified by setting ablation experiments, and the effectiveness of the framework structure in improving the ability of network fine-grained feature extraction is obtained. Finally, to verify the superiority and robustness of the network structure proposed in this paper in implementing fine-grained classification tasks, the accuracy of the commonly used fine-grained classification algorithms on the three types of datasets mentioned above is compared with the accuracy of this work. The result shows that the proposed network structure has higher accuracy and better robustness in implementing fine-grained classification tasks.

In the future, BCBAM may be combined with other network structures for fine-grained classification verification to further show the generality of BCBAM. In addition, we use ConvNeXt-Tiny as the basic network. However, due to the limitations of hardware equipment, it is not tested with other specifications, such as ConvNeXt-Small, Base, Large, and Xlarge. The research results of ConvNeXt show that with the expansion of specification, its accuracy increases continuously. Therefore, theoretically, if other specifications of networks are used for the experiment in this paper, their accuracy will also increase, and their degree of optimization will be further improved. Finally, since the proposed algorithm and the optimization approach are experimentally validated on the existing dataset, the algorithm in this paper will be subsequently used to accomplish fine-grained classification tasks in practical applications.

Author Contributions

Conceptualization, Z.L. and T.G.; Data curation, T.G; Formal analysis, T.G. and B.L.; Investigation, X.H. (Xin He); Project administration, Z.L.; Resources, X.H. (Xiangyu Hui); Supervision, B.L.; Validation, T.G. and W.X.; Visualization, T.G.; Writing—original draft, T.G.; Writing—review and editing, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the following three projects: 1: Guangxi Key Technologies R&D Program (AB22035066); 2: Guangxi Science and Technology Major Project (AA22068064); 3: Guangxi Science and Technology Project (AD22080042).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The research results of this paper are supported by the following three projects: 1: Guangxi Key Technologies R&D Program (AB22035066); 2: Guangxi Science and Technology Major Project (AA22068064); 3: Guangxi Science and Technology Project (AD22080042). Thanks to the above three projects for the technical, equipment and financial support of this research, and to the other five authors for their technical assistance.

Conflicts of Interest

The authors declare no conflict of interest.

References

He, K.-M.; Zhang, X.-Y.; Ren, S.-Q. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Xie, S.; Girshick, R.; Dollar, P. Aggregated residual transformations for deep neural networks. In Proceedings of the 2017 IEEE Conference on Computer-Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Howard, A.-G.; Zhu, M.; Chen, B.; Kalenichenko, D. Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. Available online: https://arxiv.org/pdf/1704.04861.pdf (accessed on 9 August 2021).
Zhang, X.-Y.; Zhou, X.-Y.; Lin, M.-X. 2018b. ShuffleNet: Anextremely efficient convolutional neural network for mobile devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision andPatterm Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Huang, G.; Liu, Z. Densely connected convolutional networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
Tan, M.-X.; Quoc, V.L. Efficientnet: Rethinkingmodel scaling for convolutional neural networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
Han, D.; Kim, J.; Kim, J. Deep pyramidal residual networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6307–6315. [Google Scholar]
Yamada, Y.; Iwamura, M.; Kise, K. Deep pyramidal residual networks with separated stochastic depth. arXiv 2016, arXiv:1612.01230. [Google Scholar]
Zhang, K.; Guo, L.-R.; Gao, C. Pyramidal RoR for image classification. Clust. Comput. 2019, 22, 5115–5125. [Google Scholar] [CrossRef]
Yang, Y.-B.; Zhong, Z.-S.; Shen, T.-C. Convolutional neural networks with alternately updated clique. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2413–2422. [Google Scholar]
Huang, G.; Liu, S.-C. Condense Net: An efficient Dense Net using learned group convolutions. In Proceedings of the 2018 Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 2752–2761. [Google Scholar]
Chen, Y.; Li, J.; Xiao, H.; Jin, X.; Yan, S.; Feng, J. Dual Path Networks. In Proceedings of the International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 4467–4475. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 17; MIT Press: Cambridge, MA, USA, 2017; pp. 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Zhu, Y.-X.; Li, R.C.; Yang, Y. Learning cascade attention for fine-grained image classification. Neural Netw. 2019, 122, 174–182. [Google Scholar] [CrossRef]
Yan, Y.C.; NI, B.B.; Wei, H.-W. Fine-grained image analysis via progressive feature learning. Neurocomputing 2020, 396, 254–265. [Google Scholar] [CrossRef]
Lin, T.Y.; Roychowdhury, A.; Maji, S. Bilinear CNN models for fine-grained visual recognition. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Santiago, Chile, 7–13 December 2015; pp. 1449–1457. [Google Scholar]
Yu, C.-J.; Zhao, X.-Y.; Zheng, Q. Hierarchical bilinear pooling for fine-grained visual recognition. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; Springer: Heidelberg, Germany, 2018; pp. 574–589. [Google Scholar]
Li, K.-L.; Wang, Y.-H.; Chen, D.; Wang, J. Fine-grained Image Classification Combining Attention and Bilinear Networks. J. Chin. Comput. Syst. 2021, 42, 1071–1076. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. arXiv 2021, arXiv:2201.03545. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
He, K.; Feng, X.; Gao, S.-N. A fine-grained image classification algorithm based on multi-scale feature fusion and repeated attention mechanism. J. Tianjin Univ. (Nat. Sci. Eng. Technol. Ed.) 2020, 53, 1077–1085. (In Chinese) [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Wah, C.; Branson, S.; Welinder, P. The Caltech-UCSD Birds-200-2011 Dataset; Technical Report CNS-TR-2011-001; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
Maji, S.; Kannala, J.; Rahtu, E. Fine-Grained Visual Classification of Aircraft. arXiv 2013, arXiv:1306.5151. [Google Scholar]
Krause, J.; Stark, M.; Deng, J. 3D Object Representations for Fine-Grained Categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Tan, M.; Wang, G.-J.; Zhou, J. Fine-grained classification via hierarchical Bilinear pooling with aggregated slack mask. IEEE Access 2019, 7, 117944–117953. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, R.-S.; Huang, Y.-P. Unsupervised Part Mining for Fine Grained Image Classification. Available online: https://arxiv.org/abs/1902.09941 (accessed on 22 May 2020).
Fu, J.; Zheng, H.; Mei, T. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4438–4446. [Google Scholar]
Zheng, H.-L.; Fu, J.-L.; Mei, T. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE Computer Society Press: Los Alamitos, CA, USA, 2017; pp. 5219–5227. [Google Scholar]
Wei, X.; Zhang, Y.; Gong, Y. Grassmann pooling as compact homogeneous Bilinear pooling for fine-grained visual classification. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 355–370. [Google Scholar]

Figure 1. Structural schematic of a BCNN [19].

Figure 2. Sequential block diagram of research ideas.

Figure 3. Network structure chart of ConvNeXt-Tiny.

Figure 4. Structure chart of ConvNeXt Block.

Figure 5. Structure chart of EBA (based on each ConvNeXt block).

Figure 6. Structure chart of SBA (based on each type of block).

Figure 7. Improved CAM mechanism.

Figure 8. Improved SAM mechanism.

Figure 9. BCBAM mechanism.

Figure 10. SE framework [3].

Figure 11. CBAM framework [23].

Figure 12. Attention mechanism framework proposed in this paper.

Figure 13. Structure chart of bilinear CBAM ConvNeXt (SBA) network.

Figure 14. Some pictures from the three types of datasets.

Figure 15. (a) CUB-200-2011 Test Accuracy Curve. (b) FGVC-Aircraft Test Accuracy Curve. (c) Stanford Cars Test Accuracy Curve. (d) CUB-200-2011 Test Loss Curve. (e) FGVC-Aircraft Test Loss Curve. (f) Stanford Cars Test Loss Curve.

Figure 16. Thermodynamic diagrams of the three attention mechanisms.

Table 1. Dataset information.

Dataset	Number of Categories	Training Set/Picture	Test Set/Picture
CUB200-2011	200	5994	5794
FGVC-Aircraft	100	6667	3333
Stanford Cars	196	8144	8041

Table 2. Fine-grained classification with different attention mechanisms based on EBA and SBA.

ConvNeXt-Tiny	Integration Method	Improvement Method	Accuracy Rate/%
	Integration Method	Improvement Method	CUB200-2011	FGVC-Aircraft	Stanford Cars
	EBA	SE	74.5	89.5	—
		CBAM	82.9	91.0	—
		BCBAM	83.7	91.7	—
	SBA	SE	82.7	91.7	92.9
		CBAM	84.7	92.0	92.7
		BCBAM	87.8	92.1	93.3

Table 3. Parameters evaluated by the model for different datasets (mean values).

Dataset	Model Evaluation Parameters/%
Dataset	Precision	Recall	Specificity
CUB200-2011	88.2	87.9	99.9
FGVC-Aircraft	92.5	92.1	99.9
Stanford Cars	93.5	93.2	99.9

Table 4. Ablation experiments on CUB200-2011 based on our attention framework.

Backbone	Dataset	ReLU Activate Function	ECA Mechanisms	BCBAM Mechanisms	Accuracy Rate/%
ConvNeXt-Tiny	CUB200-2011				85.1
			√		85.8
				√	86.0
		√	√		84.9
		√		√	79.3
			√	√	87.8
		√	√	√	84.2

Table 5. Comparison of bilinear CBAM ConvNeXt and traditional network in classification task.

Backbone	Accuracy Rate/%
Backbone	CUB200-2011	FGVC-Aircraft	Stanford Cars
ConvNeXt-Tiny	85.1	91.8	92.9
ResNet50	81.6	88.9	92.0
SE-ResNet50	80.9	88.7	91.7
CBAM-ResNet50	81.2	89.2	92.0
ECA-ResNet50	79.5	88.9	91.5
BCNN [19]	84.1	84.1	91.3
HBP-RNet [29]	85.8	90.2	92.2
UPM [30]	81.9	85.9	89.2
RA-CNN [31]	85.3	88.2	92.5
MA-CNN [32]	86.5	89.9	92.8
GP-256 [33]	85.8	88.1	91.7
Bilinear CBAM ConvNeXt	87.8	92.1	93.3

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Gu, T.; Li, B.; Xu, W.; He, X.; Hui, X. ConvNeXt-Based Fine-Grained Image Classification and Bilinear Attention Mechanism Model. Appl. Sci. 2022, 12, 9016. https://doi.org/10.3390/app12189016

AMA Style

Li Z, Gu T, Li B, Xu W, He X, Hui X. ConvNeXt-Based Fine-Grained Image Classification and Bilinear Attention Mechanism Model. Applied Sciences. 2022; 12(18):9016. https://doi.org/10.3390/app12189016

Chicago/Turabian Style

Li, Zhiheng, Tongcheng Gu, Bing Li, Wubin Xu, Xin He, and Xiangyu Hui. 2022. "ConvNeXt-Based Fine-Grained Image Classification and Bilinear Attention Mechanism Model" Applied Sciences 12, no. 18: 9016. https://doi.org/10.3390/app12189016

APA Style

Li, Z., Gu, T., Li, B., Xu, W., He, X., & Hui, X. (2022). ConvNeXt-Based Fine-Grained Image Classification and Bilinear Attention Mechanism Model. Applied Sciences, 12(18), 9016. https://doi.org/10.3390/app12189016

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ConvNeXt-Based Fine-Grained Image Classification and Bilinear Attention Mechanism Model

Abstract

Featured Application

Abstract

1. Introduction

2. Related Work

2.1. Analysis of Traditional Methods

2.2. Introduction to ConvNeXt Network

3. Proposed Algorithm

3.1. ConvNeXt Model Embedded with Attention Mechanism

3.2. Bilinear Attention Mechanism

3.3. BCBAM-Based Attention Framework

3.4. ConvNeXt Based on Multiscale Bilinear Attention Mechanism

4. Experimental Results and Analysis

4.1. Dataset Introduction

4.2. Preconditions of Experiment and Environment Description

4.3. Analysis of Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI