1. Introduction
As the global population continues to grow rapidly, ensuring a sufficient and varied food supply is imperative. Agricultural pests present a significant threat to the yield and quality of crops, making accurate pest recognition vital for effective crop protection and yield enhancement [
1,
2]. In the realm of precision agriculture, the precise recognition of pest species is fundamental to devising optimal pest management strategies [
3]. Traditionally, pest recognition has been manual, dependent heavily on the expertise of the observer, and often lacks consistency in accuracy and efficiency when dealing with large pest populations [
4]. Advances in traditional machine learning have greatly enhanced automated pest-recognition technologies by implementing advanced feature extraction and model training techniques, which are essential for precise and efficient pest recognition [
5,
6,
7]. However, these methods are vulnerable to variations in image quality, target posture, and environmental conditions such as lighting and background complexity, reducing their effectiveness in diverse and complex scenarios. Moreover, they typically rely on predefined feature extraction techniques that struggle to adapt to the variability of pests in natural settings, posing a significant challenge to achieving the necessary accuracy for practical applications [
8]. Existing applications using these methods have shown limited success, often achieving high accuracy in controlled environments but encountering high error rates under diverse and complex conditions [
9,
10,
11].
To address challenges in pest recognition, such as detecting tiny pests and managing intra-class variation, researchers have utilized deep learning to develop advanced networks. Among these, the residual networks have been effective, capturing complex features and delivering impressive results. Cheng et al. [
2] introduced deep residual learning to reduce training degradation in complex agricultural scenarios. Thenmozhi et al. [
12] utilized transfer learning to increase the precision of pre-trained deep learning models. Kong et al. [
13] implemented a spatial feature-enhanced attention module that improves the model’s capacity to discern semantic relationships between different image regions. Combined with an innovative higher-order pooling module, this led to improved recognition accuracy for various pest categories. Additionally, Wei et al. [
14] incorporated a multi-scale feature extraction module (MFE) into their neural network, capturing detailed features of pests at various scales and enhancing the network’s capacity to differentiate between diverse pest species. Despite these advances, challenges such as insufficient model generalization and difficulty in capturing fine-grained features still exist. To this end, we propose an adaptive multi-task learning framework that utilizes task synergy and dynamically adjusts learning, enhancing the robustness and accuracy of pest recognition.
Motivated by the above challenges, we propose a multi-task learning framework consisting of a main task and a subsidiary task. The main task employs the discriminative attention multi-network (DAM-Net), which uses a multi-scale learning approach and attention mechanism to identify fine-grained features. The subsidiary task utilizes the residual network-50 (ResNet-50), a deep network with residual connections [
15], to supplement texture details and global contextual information. This framework learns features in a multi-dimensional manner, overcoming the limitations of single-task systems and promoting robust feature learning. Furthermore, we implement an adaptive weighted loss mechanism that optimally prioritizes tasks within the framework, mitigating overfitting and enhancing accuracy in pest recognition. All of the necessary details on the proposed multi-task learning framework are further provided in the Methodology section.
As illustrated in
Figure 1, our framework processes the input dataset simultaneously through two task pathways. The results from these two tasks are then combined to collaboratively identify pests. Specifically, the DAM-Net has three branches that learn global features, attention-enhanced precise features, and the most discriminative regions. Concurrently, the ResNet-50 improves the main task by providing texture details and global contextual information. The framework enhances its discriminative accuracy by adaptively adjusting the loss weights for each task based on its uncertainty level, prioritizing the task with higher confidence. We evaluate our framework using the D0 dataset, which comprises around 4500 images across 40 pest categories, and the IP102 dataset, which contains over 75,000 images spanning 102 categories. The framework achieves enhanced discernment and increased robustness by integrating DAM-Net’s precise attention to detail with ResNet-50s broad feature extraction capabilities. The major contributions of our research are as follows:
- (1)
We present the DAM-Net with three branches to enhance the extraction of complex and fine-grained pest features. We also incorporate a subsidiary task based on the ResNet-50, enriching coarse-grained and global contextual information, improving the generalizability and robustness of pest recognition;
- (2)
We enhance both learning efficiency and recognition accuracy through a shared representation. Task loss weights, determined by homoscedastic uncertainty, are automatically learned from the data. This approach handles weight initialization reliably and balances task significance;
- (3)
An enhanced pest-recognition framework based on multi-task learning achieves accuracies of 99.7% on the D0 dataset and 74.1% on the IP102 dataset.
The remainder of the study is organized as follows:
Section 2 reviews the related work.
Section 3 introduces the exposition of our multi-task framework.
Section 4 details experimental results and corresponding analysis.
Section 5 summarizes this study.
2. Related Work
Early pest recognition relied on manual identification by experts, which was labor-intensive, time-consuming, and prone to inaccuracies due to individual and environmental factors [
16]. Advances in computer vision have significantly improved efficiency and accuracy in digital and precision agriculture. These methods are categorized into traditional machine learning [
10,
17,
18,
19] and deep learning [
2,
12,
13,
20].
Traditional machine learning approaches typically involve extracting features using techniques like the histogram of oriented gradients (HOG) or scale-invariant feature transform (SIFT), followed by classification algorithms such as support vector machines (SVM) or artificial neural networks (ANN). For instance, Xie et al. [
17] combined sparse-coding histograms with multiple kernel learning (MKL) for feature fusion. Dimililer et al. [
18] used backpropagation neural networks with geometric feature extraction. Xie et al. [
19] developed a dictionary-based approach for robust pest recognition. Kasinathan et al. [
10] employed multiple feature extraction techniques to enhance classification precision through a majority voting strategy. Despite their effectiveness, traditional methods depend on expert knowledge and manually designed features, which may not capture all variations in pest appearances, leading to unstable outcomes and reduced accuracy.
Deep learning methods, particularly those using CNNs, automatically extract robust features, eliminating the need for manual design and achieving higher recognition efficiency and accuracy [
21,
22]. Cheng et al. [
2] used the ResNet-101 architecture, surpassing traditional methods. Thenmozhi et al. [
12] applied transfer learning to improve accuracy and reduce training time. Li et al. [
23] fine-tuned GoogLeNet, achieving high accuracy but at a significant computational cost. Bollis et al. [
24] proposed weakly supervised learning with multi-instance learning (MIL) for better classification. Kong et al. [
13] refined a CSP-based network for precise feature extraction, and Wei et al. [
14] introduced innovative modules for multi-scale and deep feature extraction. Chen et al. [
20] used an effective feature localization module (EFLM) for pest location identification and an adaptive filtering fusion module (AFFM) to refine features, with soft voting (SV) for final categorization. However, their method relies heavily on diverse pest images to enrich feature input, making it challenging to extract fine-grained and global context information effectively in real-world applications.
Despite the accuracy and flexibility of deep learning in pest recognition, real-world environmental noise remains a challenge due to factors such as varying lighting conditions, occlusion by foliage, and the presence of similar non-target objects [
25,
26]. However, applying data augmentation techniques and optimization algorithms can enhance model accuracy and generalization. For instance, Nanni et al. [
27] used saliency methods for data augmentation, achieving high accuracies on pest datasets. Cabrera et al. [
28] researched generative models for synthesizing pest images. Nanni et al. [
29] combine CNN architectures and introduce Adam variants, Exp and ExpLR, enhancing pest recognition. However, this method is complex and lacks fine-grained feature extraction, making it difficult to handle high intra-class variability in pests.
While previous methods have enhanced pest recognition efficiency and accuracy, they struggle to capture fine-grained details and diverse contextual features, reducing accuracy in recognizing different life stages and distinguishing similar species. Therefore, we develop a multi-task learning framework to address these challenges.
3. Methodology
This section details our pest-recognition framework. The DAM-Net, serving as the main task model, excels in fine-grained feature recognition with input images at 448 pixels, learning global features, attention-enhanced features, and discriminative regions. Complementing this, the ResNet-50, as the subsidiary task model, operates on 224-pixel images to extract coarse-grained and global contextual information. By leveraging the diverse capabilities of both models, our framework achieves feature complementarity, enhancing robustness under shared feature extraction. Additionally, it dynamically adjusts loss weights based on task uncertainty, optimizing the integration of fine-grained details and global contextual information. The performance enhancement is demonstrated in the ablation study results shown in
Section 4.6.
3.1. Discriminative Attention Multi-Network (DAM-Net)
Our main task is executed using the DAM-Net, designed for fine-grained feature recognition of pests.
Figure 2 outlines the composition of the DAM-Net, which includes a base branch, a target branch, and a discriminative branch. The base branch captures global features, providing a holistic image representation. The object focus location module (OFLM) synthesizes feature maps from the base branch to determine the target object’s bounding box, enabling cropping of the base image. This cropped image undergoes processing via the target branch to extract structural and fine-grained features. Complementing this, the specific part identification module (SPIM) isolates and crops the most discriminative and least redundant parts of the object. These regions are fed into the discriminative branch, which delves into intricate details across various scales. Equations (
1)–(
3) represent the loss functions for the three branches, respectively.
In this model,
c represents the actual category of the input image. The probabilities
and
are the softmax outputs from the base and target branches, respectively. Meanwhile,
is the softmax output for the
n-th image in the discriminative branch, with
N being the total number of such discriminative images. The total loss is computed as
The total loss combines the losses from the three branches, utilizing shared convolutional and classification layers to ensure unified feature extraction across the network. During backpropagation, this cumulative loss function adjusts the entire network by considering the losses from each branch, optimizing both global and local feature detection. To enhance efficiency in practical applications, we exclude the discriminative branch during testing, reducing complexity and accelerating predictions. The final results are achieved by merging the logits from the base and target branches.
3.1.1. Object Focus Location Module (OFLM)
The OFLM is designed to precisely delineate the boundary boxes of target pests by focusing on salient features. Feature maps from the final convolutional layer of an input image
X are represented as
, where
C is the number of channels, and
is the spatial size. The
i’th feature map of the corresponding channel is denoted as
. By aggregating these feature maps
F, we obtain an activation map
A, as defined in Equation (
5):
where the activation map
A indicates the focus areas of deep neural networks for recognition and accurately locates target regions, as illustrated in
Figure 3. The mean value of
A, denoted as
, is calculated using Equation (
6):
where
represent the pixel intensity value at position
of the activation map.
is used as the threshold to generate a coarse mask map
from the final convolutional layer
according to Equation (
7):
A bounding box that encloses the largest connected region is used to locate the target. Using the same method, we compute an additional activation map from the preceding convolutional block
and a corresponding mask
. We then refine the target mask
by intersecting
and
, as indicated in Equation (
8):
A threshold is applied to the final target mask to highlight the most significant regions of interest within the image. A binary mask
is computed using a predefined threshold value
, such that
where
is established as 0.5 for this analysis. The final target mask
is then obtained using
This method treats pixels in the activation map with intensity values above a threshold as significant regions. As a result, the final mask effectively distinguishes between the foreground, which includes features of interest like pests, and the background.
3.1.2. Specific Part Identification Module (SPIM)
To overcome the challenge of the OFLM typically detecting incomplete sections of a target, we enhance our model’s robustness by introducing a SPIM and a discriminative branch. Analysis of the activation map
A reveals that regions with higher activation values are often associated with crucial parts of the pest, like its head or tail. To capture these informative areas, we employ a sliding window technique that isolates windows containing critical information as discriminative images. This process is further refined by adopting a full CNN, reducing computational effort in a manner similar to how OverFeat [
30] processes feature maps. Subsequently, we perform channel-wise aggregation of the activation maps,
, for each window and determined the mean activation value, following the procedure outlined in Equation (
11):
and
represent the height and width of a feature map corresponding to a particular window. We arrange the windows in order based on their
values, which denote the informativeness of the regions they encompass. A higher
value signifies greater significance, as shown in
Figure 4. To select the most representative windows while avoiding redundancy, we employ the non-maximum suppression (NMS). This technique allows us to choose a diverse set of windows as discriminative images that vary in scale, ensuring more robust and comprehensive localization of target parts.
3.2. Residual Network-50 (ResNet-50)
The ResNet-50, a deep 50-layer CNN, leverages residual learning to simplify training deeper networks [
15].
Figure 5 illustrates a residual block with skip connections that facilitate uninterrupted information flow during forward propagation. This architecture focuses on learning the residual
between the input
x and the output
, promoting identity mapping and enabling some layers to be bypassed. During backpropagation, skip connections provide a pathway for gradients to flow directly through the network, migrating the vanishing gradient issue. This architecture allows for the construction of deeper networks to learn more complex features, significantly enhancing deep learning capabilities.
Our ResNet-50 model acts as a subsidiary task, enriching the DAM-Net through a shared representation in a multi-task learning framework. It complements the framework’s coarse-grained and global contextual information, enhancing the generalization and robustness of the main task, and ultimately increasing overall recognition accuracy.
3.3. Adaptive Weighted Loss Mechanism
In our multi-task framework for pest recognition, we use the uncertainty of each task as the basis for weighting the various loss functions. Specifically, we utilize homoscedastic uncertainty, a form of aleatoric uncertainty that stays constant for a specific task, regardless of input variations. This allows us to measure a task’s inherent noise level or reliability. Our framework utilizes an adaptive weighted loss mechanism, adjusting the contribution of each loss based on its associated homoscedastic uncertainty. This prioritizes tasks with higher confidence, enhancing the overall performance of the framework.
In our pest-recognition task, the neural network’s output for an input
is denoted by
, where
represents the network’s weights. Then, the likelihood of recognition is adjusted by introducing a parameter
. This allows for a calibrated confidence in the output, accounting for data uncertainty:
In addition, the loss function is defined as the negative logarithm of the conditional probability of the actual class
c. This definition aligns with the principles of maximum-likelihood estimation, as illustrated in the following equation:
Within this model, the log likelihood can be formulated as
where
is the
c’th element of the output vector
.
Finally, by defining
, we can derive the following formula:
Introducing a simplification for computational tractability, we approximate the softmax normalization constant as , which holds exactly as →1.
In the multi-task pest recognition framework, we can derive the following loss function:
Our objective is to minimize a composite loss function comprising two subtask losses: for the DAM-Net and for the ResNet-50. Each loss is modulated by a noise parameter, , which adjusts the contribution of each task based on data reliability. Specifically, an increase in the noise parameter for reduces the influence of , thus decreasing reliance on potentially unreliable data; conversely, a decrease in noise increases the weight of the corresponding loss, prioritizing more reliable data. The last term in the function regularizes the noise parameters, preventing them from becoming excessively large and causing the framework to ignore task-specific data.
4. Experiment
4.1. Experimental Setup
Our experiments ran on a platform with an Intel Xeon 8255C CPU (12 vCPUs at 2.50 GHz, Intel Corporation, Santa Clara, CA, USA), an NVIDIA Tesla V100-SXM2 GPU (32 GB VRAM, NVIDIA Corporation, Santa Clara, CA, USA), running Ubuntu 18.04. We used CUDA 11.3 and PyTorch 1.12.1 to develop and test our framework.
4.2. Datasets
In our deep learning experiments, we employ the D0 dataset [
19], which comprises around 4500 images across 40 distinct pest categories derived from various field crops, including corn, soybean, wheat, and canola, in authentic agricultural settings. The D0 dataset is available at
https://www.dlearningapp.com/web/DLFautoinsects.htm, (accessed on 24 June 2024). Moreover, for a comprehensive evaluation of the framework’s robustness and generalizability, we utilize a second dataset, IP102 [
31], which contains 75,222 images across 102 classes. This dataset presents considerable challenges due to its large intra-class variation and small inter-class variation, encompassing various life stages and exhibiting a pronounced class imbalance, with image counts per class ranging from 71 to 5740. A notable hurdle is the visual similarity between the pests and their backgrounds, which necessitates sophisticated feature extraction for accurate recognition.
Figure 6 displays a selection of pest images from IP102, showcasing the visual challenges posed by the varied categories. The IP102 dataset, featuring pest species from diverse crops under varied agricultural conditions, facilitates the evaluation of the framework’s adaptability across different contexts. These two benchmark datasets are detailed in
Table 1.
4.3. Experimental Settings
In our multi-task framework, we propose the DAM-Net as the main task model and the ResNet-50 as the subsidiary model. During training, images for the DAM-Net are resized to 480 pixels and randomly cropped to 448 pixels, while images for the ResNet-50 are resized to 256 pixels and cropped to 224 pixels. For testing, center cropping is used for both tasks. We intentionally avoid additional data augmentation like rotation and color jittering to focus on assessing the framework’s inherent performance.
Optimizing hyperparameters is essential for deep learning classification tasks. For the DAM-Net, we employ an SGD optimizer starting with a learning rate of 0.001, reduced by a factor of 0.1 at the 15th and 30th epochs through a step scheduler. For the ResNet-50, we utilize an Adam optimizer with an initial learning rate of 0.0001, adjusted each epoch using an exponential decay scheduler. Both tasks incorporate L2 regularization to prevent overfitting. Training is halted by an early stopping mechanism if validation accuracy does not improve for 25 consecutive epochs, optimizing both generalization and resource efficiency. Detailed hyperparameter settings are provided in
Table 2. These adjustments were carefully calibrated to enhance pest-recognition accuracy.
4.4. Evaluation Metrics
To comprehensively assess our framework’s performance, particularly in addressing class imbalances in the D0 and IP102 datasets, we adopt the same evaluation metrics used in existing research [
29,
32,
33,
34]. Specifically, we utilize five reliable evaluation metrics: accuracy (Acc), macro-average precision (MPre), macro-average recall (MRec), macro-average F1-score (MF1), and geometric mean (GM). The Acc provides an overall performance measure, while the macro-average metrics offer insights into class-wise performance, and the GM balances sensitivity and specificity. These metrics ensure a robust and comparable evaluation of our framework’s effectiveness.
4.5. Experimental Results and Analysis
To evaluate our proposed method, we conduct comparative experiments against established benchmarks on the D0 and IP102 datasets.
Table 3 presents the comparison results. For the D0 dataset, the multi-level classification framework with unsupervised learning and a multiple kernel boosting algorithm [
19] achieved the lowest Acc at 89.3%. Enhanced pre-trained models using transfer learning [
12] reached an Acc of 96.0%. Furthermore, a two-stream network combining ConvNeXt and Swin Transformer models [
32] achieved an Acc of 98.5%. The GAEnsemble method [
33], which uses a genetic algorithm-based weighted voting mechanism, achieved an Acc of 98.8%. However, it struggles to capture fine-grained pest features due to its lack of detailed feature extraction. In contrast, our framework achieves an Acc of 99.7%, and has a shorter inference time of 74 ms compared to their 106 ms, demonstrating superior efficiency and performance.
Regarding the IP102 dataset, the ResNet-50 model [
31] had the lowest performance. The DMF-ResNet model [
35] integrated multi-scale representation learning from three branches and incorporated an SFR module to refine channel-wise feature responses, achieving an Acc of 59.1%. A feature fusion network synthesizing CNN and Vision Transformer (ViT) models [
36] achieved an Acc of 65.6%. An enhanced pre-trained Inception-v3 model [
37] achieved an Acc of 67.9%. Additionally, the saliency-guided discriminative learning network (SGDL-Net) [
38], utilizing a multi-task learning framework with a dual-branch architecture for global and fine-grained feature extraction, achieved an Acc of 71.2% and an MF1 of 60.6%. However, our multi-task framework, integrating the DAM-Net for the main task and the ResNet-50 for the subsidiary task, outperforms the SGDL-Net with Acc and MF1 improvements of 2.9% and 2.0%, respectively. A model using a genetic algorithm-based hyperparameter optimization strategy [
34] achieved an Acc of 71.8%. An ensemble method [
29] combined multiple CNN architectures and variant Adam optimization algorithms, achieving an Acc of 73.6%. A model [
20] using an EFLM and an AFFM coupled with soft voting achieved an MF1 of 73.6%. Despite using attention mechanisms, it struggles with diverse contextual features and varying pest appearances, achieving an Acc of 73.9%, lower than our 74.1%. In contrast, our framework uses multi-task learning and an adaptive weighted loss mechanism, capturing a broader range of features and adapting better to real-world conditions for superior performance.
Our multi-task framework outperformed previous methods in several key metrics. On the D0 dataset, our method achieved the highest performance across all metrics, highlighting its effectiveness in capturing both fine-grained details and diverse contextual information. On the IP102 dataset, our framework achieved an Acc of 74.1% and an MRec of 69.9%, surpassing existing methods. The inference time was 74 ms, further demonstrating the framework’s efficiency.
To further assess our framework’s effectiveness,
Figure 7 presents a visual analysis of the accuracy and loss curves on the D0 and IP102 datasets. This comparison provides a comprehensive overview of the performance across these distinct datasets. During training, our framework shows a consistent improvement in accuracy for both datasets. Initially, both training and validation losses decrease, indicating effective learning. Over time, while the validation loss exhibits a minor increase, the validation accuracy continues to rise slightly. This suggests that the framework maintains its generalization ability.
4.6. Ablation Study
In this section, we perform ablation studies to analyze the effects of different components in our proposed framework for pest recognition.
To assess the performance of the DAM-Net when isolated from the ResNet-50 model, we conduct experiments on the D0 and IP102 datasets. The DAM-Net achieves an Acc of 99.6%, an MF1 score of 99.5%, and a GM of 99.5% on the D0 dataset. On the IP102 dataset, the results are an Acc of 71.4%, an MF1 of 63.7%, and a GM of 58.0%. These results underscore the discriminative power and robustness of the DAM-Net, confirming its suitability as the main task. The DAM-Net excels in learning global, attention-enhanced, and distinctive regional features, facilitating the recognition of subtle details and discriminative features for classification.
We also train the ResNet-50 independently to evaluate its standalone performance. On the D0 dataset, the ResNet-50 achieves an Acc of 99.3%, an MF1 of 99.1%, and a GM of 99.1%. On the IP102 dataset, it achieves an Acc of 70.9%, an MF1 of 63.5%, and a GM of 59.5%. These results highlight the competent classification performance of the ResNet-50. By providing extensive global contextual information and an enriched feature representation, the ResNet-50 complements the main task, enhancing the overall performance within the multi-task learning framework.
In our multi-task learning framework, we integrate the DAM-Net as the main task and the ResNet-50 as the subsidiary task, using an adaptive weighted loss mechanism. As shown in
Table 4, this configuration improves performance compared to each model operating independently. On the D0 dataset, Acc increases to 99.7%, MF1 to 99.6%, and GM to 99.5%. On the IP102 dataset, Acc improves to 74.1%, MF1 to 65.9%, and GM to 59.5%. These enhancements confirm that combining the DAM-Net and the ResNet-50 leverages complementary features, boosting classification performance and enhancing generalization through task knowledge sharing. The adaptive weighted loss mechanism adjusts the importance of each task based on uncertainties. This approach benefits complex real-world scenarios by focusing on the most confident and informative features and reducing the impact of noisy data. It optimizes the integration of fine-grained details and global contextual information, thereby enhancing overall performance.
The ablation study validates the critical role of each component within our multi-task learning framework and demonstrates the superiority of the multi-task learning approach over single-task models. The results emphasize the pivotal role of the adaptive weighted loss mechanism in achieving further enhancements in multi-task recognition efforts.
4.7. Qualitative Analysis
Our framework demonstrates superior performance on the D0 dataset, but faces challenges on the IP102 dataset due to the varied appearances of pests at different growth stages. This variability complicates classification, as species within the same genus can exhibit significant differences across stages, and unrelated species can appear similar at specific stages. To evaluate the framework’s robustness,
Figure 8 displays instances of misclassifications, highlighting difficulties in pest recognition during the egg, larval, and pupal stages. These stages often lack distinct morphological features, leading to a homogenized appearance that confuses the framework. In contrast, adult pests develop unique features that aid correct recognition. However, the framework’s performance is also impacted by issues such as mislabeled categories and absent targets in the IP102 dataset images [
20,
39], as shown in
Figure 6c.
To further illustrate the strengths and limitations of our framework,
Table 5 compares our multi-task framework with several models on the IP102 dataset. MobileNetV3, ideal for resource-constrained environments due to its small number of parameters and low FLOPs, performs poorly in terms of Acc and MF1. ConvNeXt base outperforms standard CNNs like GoogLeNet and ResNet-50, showing superior overall performance. While the Swin Transformer base achieves high performance, its complex hierarchical architecture results in 88 million parameters and 15.7 GFLOPs. Additionally, it lacks specialized fine-grained feature recognition to effectively handle variations in pest appearances. In contrast, our multi-task learning framework integrates the DAM-Net and the ResNet-50 with an adaptive weighted loss mechanism, enhancing feature extraction and reducing overfitting. This approach achieves superior accuracy of 74.1% with only 49 million parameters and 20.3 GFLOPs, demonstrating the effectiveness and efficiency of our framework.
Our method incurs higher FLOPs due to processing 448 × 448 inputs, compared to the 224 × 224 used by other methods. However, this increased computational load allows for capturing finer details and richer contextual information, enhancing pest recognition accuracy. Our framework is designed for cloud server deployment, where computational resources are less constrained. Although network transmission delays exist, they increase tolerance for computational demands. Thus, this study focuses on achieving superior pest recognition performance rather than minimizing computational cost.
5. Conclusion and Future Work
We develop a multi-task framework for enhanced pest recognition, leveraging the strengths of the DAM-Net and the ResNet-50 for the main and subsidiary tasks, respectively. We confirm that the DAM-Net effectively identifies fine-grained features, with the Res-Net-50 enhancing this by enriching texture details and global context, thus improving the framework’s generalizability. The framework incorporates an adaptive weighted loss mechanism to dynamically adjust loss weights based on task uncertainty, aiming to improve overall accuracy. Tests on the D0 and IP102 datasets prove the robustness of our method, reaching accuracies of 99.7% and 74.1%, respectively, and outperforming existing methods. Despite the framework’s efficacy in pest recognition, it demonstrates limitations when dealing with imbalanced datasets. To address this, it is crucial to collect a comprehensive collection of high-quality photos of insects at various stages of development. Future work will leverage diffusion models to augment scarce categories by synthesizing high-fidelity images, thus enhancing the framework’s generalizability across diverse pest scenarios and improving its performance in pest control and crop protection amid data scarcity.