2.1. RepVGG
In 2014, the Visual Geometry Group at the University of Oxford introduced the VGG model [
26], which exhibited strong performance in various computer vision tasks. The primary attribute of the VGG model is its depth, with networks typically comprising 16 (VGG-16) or 19 (VGG-19) successive convolutional and fully connected layers. While this allows the network to learn more complex features, the increased depth also exacerbates the vanishing gradient problem during training, leading to higher computational complexity and a more significant number of parameters. In addition, the fully connected layers, in particular, require substantial computational resources and training time. Consequently, VGG networks have been progressively superseded by more advanced network architectures.
In 2021, Ding et al. [
27] drew inspiration from residual structures and proposed the
RepVGG [
28] model. The defining characteristic of residual structures is their skip connections, where the input is directly added to the output. This design mitigates the vanishing gradient problem, enabling deeper network training.
RepVGG, in turn, features a multi-branch structure akin to residual structures during the training phase. Specifically,
RepVGG’s multi-branch structure comprises a 1
1 convolutional branch, a 3
3 convolutional branch, and a constant mapping branch, as illustrated in
Figure 1. For example, given input feature map
X, the 3
3 convolutional branch output is
A, while the 1
1 convolutional branch output is
B. The constant mapping branch output is
C. The output of the fundamental building block can then be expressed as:
where
and
denote the 3
3 and 1
1 weight matrices.
BN denotes the batch normalization operation. The main role of the BN layer is to normalize the output of the convolutional layer. Specifically, the BN layer calculates the mean and variance of the output of the convolutional layer and normalizes the output. Let the desired mean be
µ, and the variance be
σ2. The BN layer is given by:
where
γ and
β are learnable parameters to control the scaling and translation of the output features, respectively.
x is a sample of the output of the convolution layer.
ε is a small value to maintain numerical stability and is usually set to e
−5.
This can effectively reduce internal covariate bias, improve training stability, and speed up the convergence process while helping to reduce the sensitivity of the model to parameter initialization. In the
RepVGG block, the BN layer follows the convolutional layer, and this design also helps to improve the training effect and inference performance of the model.
Figure 2a below shows the structure of the model during the training process.
Upon completing
RepVGG training, we employ a structural reparameterization strategy to transform the original multi-branch structure into a simplified, continuous convolution operation, enhancing computational efficiency during inference. First, we fuse the convolutional layer with the batch normalization (BN) layer. Then, in Equation (2),
x represents the output value following the convolutional layer operation. By substituting the convolutional layer output, we can transform Equation (2) accordingly.
Equation (3) represents a new fused convolutional layer, with the updated convolution kernel weight denoted as
and the bias as
. Specifically:
For the convolutional layer, fusion with the BN layer can be achieved by directly substituting into Equation (3), resulting in a new weight, , and a bias, . For a 1 × 1 convolutional layer, it must first be transformed into a convolutional layer by padding zeros around the convolution kernel to create a kernel. Substituting this into Equation (3) provides the weight and bias for a new convolution branch. For the Identity branch, a convolution kernel is established with all nine positions set to 1 to complete the identity mapping. This yields the branch’s new weight, , and new bias, .
Upon transforming all three branches into convolutional layers, the weights and biases from each branch are superimposed separately, thus forming a new, fused convolutional operation.
By integrating the BN layers into the convolutional weights and subsequently performing structural reparameterization, a simplified
RepVGG model is obtained. This model retains the representative capacity of the original multi-branch structure while offering enhanced computational efficiency.
Figure 2b illustrates the model’s structure following structural reparameterization.
In their paper, the model authors propose an array of RepVGG networks. For our purposes, we selected the network structure.
2.2. ECA Attention Mechanism
The attention mechanism emulates human visual or cognitive focus in deep learning models, selectively emphasizing specific input data components. Rather than uniformly weighting all parts during input data processing, each part is assigned a weight reflecting the model’s current attention allocation. Consequently, the model prioritizes task-relevant information, thereby enhancing classification accuracy. Widely employed attention mechanisms include Squeeze-and-Excitation (SE) [
29] and Convolutional Block Attention Module (CBAM) [
30]. In addition, the Efficient Channel Attention (ECA) [
31], depicted in
Figure 3, constitutes an effective channel attention strategy designed to augment convolutional neural networks’ feature representation by capturing inter-channel dependencies.
Given a single input image X[C, H, W], where C represents the number of channels and H and W denote the feature map’s height and width, respectively, the ECA attention mechanism initially conducts Global Average Pooling (GAP) to capture global contextual information for each channel. Specifically, the GAP procedure can be expressed as:
In this context, represents the (i, j)th element of channel c in the input feature map X. The GAP output is a vector with C dimensions, reflecting the average response of each channel.
Subsequently, the ECA attention mechanism captures local dependencies between channels via a one-dimensional convolution layer (1D Convolution Layer). Denoting a one-dimensional convolution operation with a convolution kernel of size k as
, the ECA attention mechanism dynamically determines the size
k of the 1D Convolution Layer’s convolution kernel. This adaptability enables ECA to accommodate varying numbers of channels and more effectively capture local dependencies between channels. The convolution kernel size
k can be computed according to the following equation:
Here, γ serves as a hyperparameter governing the scope of local dependencies. A smaller γ value yields a larger convolution kernel size, encompassing a broader range of channel dependencies.
The output of a one-dimensional convolutional layer can be expressed as:
To facilitate adaptive learning of inter-channel correlations within the model, the output of the 1D convolutional layer undergoes nonlinear transformation via a Sigmoid activation function, yielding a vector of attention weights:
Ultimately, the attention weight vector
A is multiplied element-wise by the original input feature map
X, effectuating channel attention rescaling. The recalibrated feature map
is computed as follows:
The ECA attention mechanism bolsters the feature representation of convolutional neural networks by capturing inter-channel dependencies. Its merits include high computational efficiency, a low parameter count, and seamless compatibility with existing convolutional neural networks. In addition, by incorporating localized adaptive channel attention, the ECA attention mechanism enhances feature representation and elevates model performance.
2.3. Proposed Methods
In deep learning-based image classification, recent research has demonstrated that incorporating attention mechanisms can significantly enhance a model’s classification accuracy. The RepVGG model, which has already demonstrated its robust classification performance on ImageNet, is an ideal candidate for addressing the unique challenges associated with rice pest and disease classification. These challenges include the similarity of early-stage symptoms across various pests and diseases, the infrequent appearance of certain pests and diseases, the impact of weather and lighting conditions on image quality, and the cluttered rice background that complicates feature learning. The Efficient Channel Attention (ECA) module adaptively recalibrates channel-wise feature responses and effectively addresses these challenges by explicitly modeling interdependencies between channels, thereby improving the model’s generalization capabilities. Moreover, the ECA module’s lightweight and efficient design enables performance enhancement without substantially increasing the model’s complexity. Consequently, integrating the ECA module into the RepVGG architecture proves to be a logical and beneficial strategy for improving the classification of rice pests and diseases.
Our design amalgamates
RepVGG blocks and ECA modules to construct the overall model. As depicted in
Figure 4, we have innovatively integrated the ECA module into two
RepVGG blocks, referred to as Block A_ECA and Block B_ECA, and incorporated the ECA module after the Head. “Head” refers to the first layer of the
RepVGG_ECA model architecture. This aims to enhance classification performance by emphasizing crucial features within the input data and guides the model to concentrate on specific regions or channels. Furthermore, to mitigate overfitting during training, we employ the L2 regularization method. This paper’s fourth section will juxtapose our proposed approach against alternative models to validate its superiority.