1. Introduction
Citrus, the most extensively cultivated fruit tree variety in China, is predominantly distributed in the southern regions of the country owing to its high profitability. In 2021, the annual citrus production in China reached 57.3216 million tons, representing roughly one third of the global total output [
1,
2]. Due to the excessive application of pesticides, expansion of citrus cultivation areas, and global climate change, the incidence and spread of citrus pests and diseases are worsening. China may encounter new challenges in controlling citrus pests.
Currently, traditional pest identification methods include manual identification [
3] and instrument-based identification. The former method exhibits strong subjectivity, low efficiency, and high manual cost consumption [
4,
5,
6], whereas the instrument recognition method is susceptible to external environmental factors, hardware devices, and software systems. These shortcomings may result in the delayed detection and treatment of pests, causing crop damage and reduced yields. Deep learning image recognition technology can effectively address these issues, facilitate the diagnosis and prevention of crop pests, and promote rapid advancements in agriculture [
7,
8,
9].
Common deep learning networks can be classified into two categories, including large networks (e.g., ResNet [
10], VggNet [
11]) and lightweight networks (e.g., SqueezeNet [
12], ShuffleNet [
13], and MobileNet [
14]). In research on large-scale networks, Pan Chenlu et al. [
15] introduced a G-ECA Layer structure into the DenseNet201 model to enhance its feature extraction capability. The new model achieved an accuracy of 83.52% in recognizing five categories of images of rice diseases and insect pests. In addition, Cen Xiao et al. [
16] integrated the architecture of SeparableConv with deep separable convolutional layers based on the Xception network, achieving complete decoupling of spatial and cross-channel correlations. This model improved the effectiveness without increasing the network complexity. Furthermore, it achieved an identification accuracy of 81.9% in recognizing four citrus diseases and pests, such as woodlice and fruit flies. Zeba et al. [
17] proposed an ensemble-based model using transfer learning, where an amalgamation of pretrained models was experimented with. The ensemble model comprising VGG16, VGG19, and ResNet50, followed by a voting classifier ensemble, yielded the most promising results, achieving an accuracy of approximately 82.5%. Su Hong et al. [
18] used an R-CNN model based on a 33-layer ResNet main trunk network to identify citrus huanglongbing, red spider infection, and canker disease. Although there were few convolutional layer networks selected, they were still able to achieve a high recognition accuracy.
The studies under discussion encountered two main challenges: the complexity of the models and their suboptimal recognition accuracy. Recognized for their minimal computational demands and robust learning capacities, lightweight networks also offer modest memory usage and adaptability. These advantages have made them increasingly preferred for pest identification tasks. In their study, Ganyu et al. [
19] enhanced EfficientNet. They integrated a coordinated attention mechanism and implemented a hybrid training approach. This approach combines data augmentation with the Adam optimization algorithm. It effectively improves the model’s generalization capabilities. Despite these improvements, the model attained a modest accuracy of 69.45% with 5.38 M parameters. Zhang Pengcheng et al. [
20] also sought to improve model performance. They augmented MobileNet V2 by incorporating the efficient channel attention (ECA) mechanism. This enhancement improved the network’s capability for cross-channel interaction and feature extraction. Their model achieved a classification accuracy of 93.63% for citrus pests, utilizing 3.5 M parameters. Setiawan et al. [
21] proposed an efficient training framework tailored for optimizing small-sized MobileNetV2 models. This approach leveraged dynamic learning rate adjustments, CutMix augmentation, layer freezing, and sparse regularization. By integrating these techniques during training, an accuracy of 71.32% was achieved with a parameter count of 4.2 M. However, models with reduced parameter settings did not yield high accuracy. Zhou et al. [
22] proposed a GE-ShuffleNet convolutional neural network model for rice disease identification; the new model can reach a 96.6% accuracy but the model size is 5.879 M. Liu et al. [
23] propose a training model based on bidirectional encoder representation from transformers (BERT). When tested on a created dataset, the model achieved an accuracy rate of 0.9423. However, none of the above models combine the two main properties of a low parameter and high accuracy at the same time.
Considering the issues highlighted in the aforementioned research, this study focused on optimizing the ShuffleNet V2 lightweight convolutional neural network model. The objective was to enhance its structure to develop a streamlined architecture capable of achieving high recognition rates for pest identification.
3. Results and Discussion
3.1. Experimental Setup and Parameters
The performance metrics evaluated in this study for pest identification included accuracy, precision, recall, and the F1 score. Furthermore, the complexity of the model was assessed based on the number of model parameters and the volume of floating-point operations.
Accuracy refers to the proportion of correctly predicted samples among all the samples, as shown in Equation (4).
Precision refers to the proportion of samples that were correctly predicted as positive among all actual positive samples, as shown in Equation (5).
Recall refers to the proportion of positive samples that were correctly predicted as positive, as indicated by Equation (6).
The F1 score is the harmonic mean of precision and recall, as shown in Equation (7).
where TP represents the number of actual positive samples that were correctly predicted as positive, TN represents the number of actual negative samples correctly predicted as negative, FP represents the number of actual negative samples incorrectly predicted as positive, and FN represents the number of actual positive samples incorrectly predicted as negative.
3.2. Replacing the ReLU Activation Function
In this study, the ReLU activation function utilized in the basic unit, downsampling unit, Conv1 structure layer, and Conv5 structure layer was substituted with the PReLU activation function. The experimental findings indicated that the PReLU activation function outperformed the ReLU function in our network model.
Figure 9 illustrates the superior performance of the PReLU function in terms of accuracy, demonstrating a consistent and rapid improvement. This suggested that the PReLU function effectively mitigated the issues related to nonupdatable weights caused by inputs in the hard saturation zone, thereby more efficiently preventing “neuron death”. The analysis of the results presented in
Table 2 suggested that the PReLU activation function achieved higher accuracy, precision, and recall rates of 92.42%, 92.32%, and 92.25%, respectively. These rates were 1.06%, 0.95%, and 1.11% higher than those achieved by the ReLU activation function, respectively. These findings further validated the suitability of the PReLU activation function in our model.
3.3. Impact of the Attention Mechanism on Model Performance
To assess the effect of integrating the BCBAM attention mechanism on the performance of our model, we conducted comparative experiments under identical conditions. We added SE, CBAM, and BCBAM attention mechanisms to the original ShuffleNet V2 model, labeled Schemes 1, 2, and 3, respectively, as shown in
Table 3. The results indicated that Scheme 3, incorporating the BCBAM attention mechanism, achieved superior accuracy, precision, recall, and F1 scores compared with Schemes 1 and 2. Notably, all three schemes had an equal number of parameters. Although the floating-point operations for Schemes 2 and 3 were marginally higher than those for Scheme 1, Scheme 3 demonstrated the best performance across the evaluated metrics, suggesting a slight superiority of the BCBAM attention mechanism module over the SE and CBAM attention mechanisms in our model.
3.4. Ablation Study
To evaluate the impact of each enhancement on the experimental outcomes, we conducted an ablation study, the details of which are presented in
Table 4. In this table, the symbol “✓” indicates the inclusion of improvements to the ShuffleNet V2 network, whereas “✗” denotes no improvements. The integration of the BCBAM attention mechanism into the original ShuffleNet V2 model increased accuracy by 2.41% and the F1 score by 2.59%. However, this integration also led to increases in the number of parameters, floating-point operations, and model size by 0.2 × 10
6, 1.48 × 10
6, and 0.81 MB, respectively, thereby increasing the complexity of the network. The application of transfer learning methods did not change the number of parameters, floating-point operations, or model size but resulted in improvements of 0.78% in accuracy and 0.87% in the F1 score. Substituting the ReLU activation function with the PReLU function in the original ShuffleNet V2 model led to an increase of 1.06% in accuracy and 1.04% in the F1 score. Although the number of parameters and model size remained unchanged, the floating-point operations increased by 2 × 10
6. When combining the BCBAM attention mechanism, transfer learning methods, and PReLU activation function with the original ShuffleNet V2 model, the accuracy improved by 2.48% and the F1 score by 2.6%. These enhancements, however, also increased the model complexity. Following the architectural adjustments, there was a notable decrease in the number of parameters, floating-point operations, and model size.
The SCHNet recognition model proposed in this study achieved an accuracy rate of 94.48% and an F1 score of 94.38%, marking a 3.12% and 3.13% increase, respectively, compared to the original ShuffleNet V2 model. The model comprised 1.97 × 106 parameters, executed 121.11 × 106 floating-point operations, and had a size of 3.84 MB. Compared to the original ShuffleNet V2 model, the parameters decreased by 0.31 × 106, floating-point operations by 31.6 × 106, and model size by 1.13 MB, corresponding to reductions of 13.6%, 20.7%, and 22.7%, respectively. In summary, the introduction of the BCBAM attention mechanism notably enhanced the model performance. Architectural adjustments effectively reduced the structural complexity without compromising the accuracy. The SCHNet recognition model in this study is feasible due to the high precision and lightweight design (annotation: TL: Transfer Learning; AAd: Architectural Adjustment; Acc: Accuracy; Par: Parameters; MS: Model Size; ShuNet: ShuffleNet V2).
3.5. Heatmaps
To provide a concrete representation of the focal points in our model, we employed Grad-CAM [
31] to visualize class activation maps for pest detection. These visualizations were compared with those generated by the original model. The heatmaps in
Figure 10 suggested that the ShuffleNet V2 model may display a tendency to either shift its focus areas for pests or cover a wider area. Conversely, the SCHNet model introduced in this study demonstrated a more concentrated focus on pests with fewer notable positional shifts.
3.6. Comparative Experiments with Different Network Models
This study comprehensively assessed the efficacy of various models in identifying citrus pests and diseases. It selected high-performance networks, including AlexNet, ResNet50, and EfficientNet_b2, from a pool of eight models for detailed comparison. The experimental results are outlined in
Table 5. According to the data presented in
Table 5, although the SqueezeNet1_0 model had fewer parameters than the SCHNet model, it was outperformed by SCHNet in terms of computational parameters and floating-point operation capacity. SqueezeNet1_0 recorded an accuracy of 79.97%, a precision of 80.35%, a recall rate of 78.94%, and an F1 score of 79.64%. These figures represent declines of 14.51%, 14.05%, 15.41%, and 14.74%, respectively, compared to SCHNet. Similarly, the other models also lagged behind SCHNet in these critical performance metrics. These findings underscore the dual advantages of the SCHNet model in enhancing performance and managing network complexity, establishing its superiority in pest identification tasks.
From
Table 6, it can be seen that the citation large network has a higher number of parameters, with a maximum accuracy of 95.3%, which is 0.82% higher than the lower accuracy of the SCHNet model, but the number of parameters is too high for deployment on mobile devices. In the citation’s lightweight network, Zhang Pengcheng et al. [
19]’s model has a higher accuracy and lower parameter count. However, compared to the model in this article, the accuracy is 0.85% lower and the parameter count is higher by 1.53M. This further illustrates that the SCHNet model in this article is superior to the network model chosen in the citation in terms of both accuracy and parameter count.
Moreover, to illustrate the remarkable classification performance of the SCHNet model, this study employed the visualization analysis of the confusion matrix to demonstrate its ability to distinctly classify various types of citrus pest images. Each column in the confusion matrix represented the predicted category and each row corresponded to the actual category. The examination of
Figure 11 reveals the proficiency of the SCHNet model in extracting features from each type of pest image and its efficient classification capabilities.
4. Conclusions
Deep learning has been widely applied in the field of agriculture. Particularly in the cultivation of citrus, its impact is notable for tasks such as detecting crop diseases, insect pests, and the level of fruit maturity. This study involves adjustments and optimizations to the ShuffleNet network architecture to more effectively identify and classify insect pests during the citrus-growing process. The dataset was also expanded to enhance the model’s ability to recognize insect pests under different environmental conditions. Several key improvements were made to the original ShuffleNet network, including changing the activation functions, incorporating an enhanced attention mechanism module, and adjusting the network structure. These improvements significantly increased the accuracy of the model and achieved a more lightweight design. Moreover, by changing the training strategy through transfer learning, not only was the training time reduced but the costs were also lowered. Experiments with comparison tests, ablation studies, and heatmap validations have proven that these improvements indeed enhance the performance of the network. Ultimately, the improved ShuffleNet network model in this paper has a better overall performance with a recognition accuracy of 94.48% and a model size of 3.84 MB, which can be used as a reference for citrus pest recognition and classification techniques.
While prioritizing lightweight computations, the SCHNet pest identification model achieved a high identification rate. The subsequent phase will focus on deploying the model within a WeChat mini-program or mobile app.