1. Introduction
Forest pests pose a significant threat to both forest ecosystems and timber resources, thereby necessitating the timely and precise detection and management of these pests for the preservation of forest health [
1]. However, conventional methods for pest detection suffer from accuracy deficiencies and slow detection speeds, thereby limiting their practical applicability in the face of ever-evolving threats.
In recent years, numerous scholars have conducted methodological research to achieve an efficient and precise detection of forestry pests, yielding promising results. Target detection based on deep learning has emerged as a prominent research focus in the field of computer vision, demonstrating robust capabilities in automatically identifying the position and categories of objects within images or videos. It has been empirically validated as an effective approach in the domain of forestry pest detection.
Target detection networks are generally categorized into two main types: one-stage target detection networks and two-stage target detection networks. One-stage target detection networks integrate object detection and classification into a single network to enhance detection speed. Representative models in this category include YOLO (You Only Look Once) [
2,
3,
4,
5,
6,
7,
8,
9] and SSD (Single Shot MultiBox Detector) [
10,
11,
12]. Two-stage target detection networks, on the other hand, first generate candidate regions and then perform target classification and localization on these regions. Representative models in this category include Fast-RCNN and Faster-RCNN [
13,
14,
15]. The advancement of these deep learning target detection networks has elevated detection accuracy and efficiency, rendering them powerful tools widely applied across multiple domains. In the context of pest detection, there is a critical need for rapid and effective monitoring of forestry within a short timeframe, as well as the timely implementation of corresponding measures. On the other hand, for non-real-time applications like pest detection, the use of a one-stage object detector can achieve model lightweighting, reducing both the model’s parameter count and complexity serves to diminish both storage and loading costs associated with the model. Simplifying the model structure also decreases the computational resource requirements, thereby enhancing overall efficiency. Therefore, this study opts for the one-stage object detector YOLO for pest detection.
However, despite the enormous potential of deep learning in pest recognition, its practical application still faces several challenges. For instance, deep learning models require a large amount of annotated data, which are often difficult to acquire in the field of pest recognition. Moreover, the computational cost of deep learning models is high, which may limit their application in resource-constrained regions. Therefore, future research needs to further explore how to improve the efficiency and accuracy of deep learning in pest recognition while reducing its computational cost and data requirements.
For the detection of forest pests, Sun Haiyan et al. [
16] proposed a forestry pest detection method based on an attention model and lightweight YOLOv4. They achieved the detection of seven types of forest pests by improving the network structure, optimizing the loss function, and introducing an attention mechanism. However, there is significant variation in accuracy among different pest classes, and the overall average precision requires further enhancement. Hou Ruihuan et al. [
17] presented a real-time forestry pest detection method based on YOLOv4-TIA. By incorporating a three-branch attention mechanism [
18], they improved the backbone network of YOLOv4 and optimized the loss function, enabling the detection of seven types of forest pests. Nevertheless, this model exhibits increased complexity, slower detection speed, and an average precision of only 85.9%.
In response to the aforementioned issues, this paper introduces a lightweight forestry pest image recognition model based on an improved YOLOv8. This model not only enhances the performance of small object detection, but also ensures minimal resource consumption. The main contributions of this paper are as follows:
- (1)
Integrating the GSconv module [
19] and employing the Slim-Neck design philosophy to refine the Neck layer of YOLOv8n, thereby achieving a lightweight architecture. This optimization reduces the network’s parameter count, resulting in enhanced detection speed.
- (2)
Incorporating the attention module CBAM [
20] into the backbone network to augment the network’s focus on small objects. This enhancement significantly improves detection accuracy without introducing a substantial increase in computational complexity.
- (3)
Incorporating WIoU v3 [
21] into the bounding box regression loss function and implementing a dynamic non-monotonic mechanism to devise a more judicious strategy for gradient gain allocation. WIoU v3 effectively mitigates gradient gain discrepancies between high-quality and low-quality samples, thereby fortifying the model’s localization proficiency and generalization capabilities.
3. Methods
Currently, the YOLOv8 algorithm has achieved significant success in the field of object detection. However, challenges in its application to forestry pest detection include the computational cost increase and detection speed decrease resulting from a large number of parameters. In practical applications, rapid model deployment is crucial to meet the real-time requirements of forestry pest detection. Lightweight models often offer higher inference speeds; thus, we consider lightweighting the Neck network in YOLOv8 to reduce computational costs, accelerate detection speeds, while maintaining robust feature representation capabilities. In the context of forestry pest detection, the widespread presence of small targets (such as the acuminatus, coleoptera, armandi, and linnaeus pests in the dataset) poses challenges. YOLOv8 exhibits poor detection performance, with issues such as low detection rates and increased false positives when dealing with these small targets. To address these challenges, we introduce attention modules to focus the model more on key information in input features, enhancing the model’s attention to small targets and thus improving small target detection performance. Simultaneously, we explore the design principles of Weighted Intersection over Union (WIoU). WIoU v3 employs a dynamic non-monotonic mechanism to evaluate anchor box quality, directing the model’s attention toward anchor boxes of regular quality, thereby improving the model’s localization capabilities. In forestry pest detection, the high proportion of small targets increases the difficulty of detection. WIoU v3 further enhances the model’s detection performance by dynamically optimizing the loss weights for small targets. The specific details of these optimization strategies are outlined below:
Firstly, in terms of the Neck network, we employed the Slim-Neck design approach. We replaced traditional convolution operations with lightweight GSConv convolutions and introduced the VoV-GSCSP module to replace the original C2f module, incorporating lightweight bottleneck layers (GSbottleneck). The purpose of this adjustment is to reduce computational costs and accelerate the model’s detection speed without compromising feature representation capabilities. This is crucial to meet the real-time requirements of forestry pest detection. These modifications result in a more lightweight application, facilitating easier deployment while maintaining the integrity of feature representation.
Additionally, in the backbone network, we introduced the CBAM (Channel Attention and Spatial Attention) attention mechanism. The CBAM attention mechanism dynamically adjusts the weights of feature maps, directing the network’s focus toward regions containing small targets. By increasing the network’s perceptual range, this mechanism aids in capturing a more extensive context of information. This is particularly valuable when parts of the target may be obscured, as a larger perceptual range enables a better understanding of the contextual information surrounding the target. Overall, the incorporation of the CBAM attention mechanism holds the promise of improving small target detection performance in forestry pest detection. It enhances the network’s attention to crucial target information, thereby improving adaptability to complex scenes and small targets.
Finally, we adopted WIoU v3 as a replacement for the original CIOU bounding box regression loss in YOLOv8. WIoU v3 integrates a dynamic non-monotonic mechanism and introduces a gradient gain allocation strategy to mitigate the occurrence of substantial or harmful gradients in extreme samples. This version of WIoU places a greater emphasis on samples of regular quality, enhancing the model’s generalization capabilities and overall performance. Additionally, WIoU v3 dynamically adjusts the loss weights for small targets, further improving the model’s detection performance. In summary, WIoU demonstrates its advantages in forestry pest detection by exhibiting flexible adaptability to targets of different sizes, shapes, and qualities. This enhances the accuracy, robustness, and generalization of the detection process, showcasing WIoU’s effectiveness in addressing the challenges posed by diverse pest targets in forestry environments.
3.1. GSConv and VoV-GSCSP Modules
As the practical applications of deep learning models continue to expand, there is an urgent demand for algorithm lightweighting. This is primarily driven by the prevalent scenarios in forestry pest detection where resources are constrained, and computational capabilities are limited. Despite YOLO’s outstanding performance in object detection tasks, its relatively large model size hampers its operational efficiency on lightweight devices commonly encountered in forestry pest detection. To adapt to resource-constrained environments, the imperative for refining and lightweighting YOLO becomes apparent. Optimization of the YOLO algorithm by reducing model size and enhancing computational efficiency is essential to better meet the practical requirements of forestry pest detection. Lightweighting not only facilitates real-time detection on embedded devices but also contributes to cost reduction, thereby enhancing the practical usability of the system. In the process of improving YOLO lightweighting, a holistic consideration of model accuracy, speed, and power consumption metrics is crucial to achieve a balance across diverse scenarios. Employing more efficient network architectures, streamlining parameters, and optimizing for specific hardware platforms are pivotal strategies for lightweighting. Through these concerted efforts, the adaptability of pest detection technology to the intricate and dynamic natural environment is significantly enhanced, providing a more reliable and efficient support system for forest conservation.
In the realm of lightweight models, such as Xception [
27], MobileNets [
28], and ShuffleNets [
29], the use of Depthwise Separable Convolution (DSC) operations has significantly improved the speed of detectors. However, these models suffer from the issue of accuracy loss. To address the accuracy loss associated with Depthwise Separable Convolution, Li Hulin proposed the GSConv lightweight convolution module, the main structure of which is depicted in
Figure 2. Through a shuffle operation, it achieves the fusion of information generated by traditional convolution modules (dense convolution operations) and information generated by Depthwise Separable Convolution. In this process, with an input channel number of C1 and an output channel number of C2, the following steps are taken: first, a standard convolution reduces the channel number by half to C2/2; then, the channel number remains unchanged through Depthwise Separable Convolution. Subsequently, the result of the first convolution is Concatenated and shuffled with the structure after Depthwise Separable Convolution. In the final shuffle operation, channel information is uniformly shuffled to ensure the effective retention of multi-channel information. This process aims to enhance the extraction of semantic information, strengthen the fusion of feature information, and consequently improve the expressive capabilities of image features. Through such shuffle operations, an orderly fusion of information is achieved, providing an effective mechanism for enhancing the model’s performance.
GSConv’s computational cost is approximately 50% of SC, but its contribution to the model’s learning capability is comparable to the latter. Building upon GSConv, a GS bottleneck module is designed in the literature, and
Figure 3a illustrates the structure of the GS bottleneck module. The VoV-GSCSP module is crafted through a one-time aggregation method.
Figure 3b–d depicts three design options provided for VoV-GSCSP, where (b) is straightforward with faster inference speed, and (c), (d) exhibit higher feature reuse rates. In practice, due to its hardware-friendly nature, it is more convenient to employ a structurally simpler module.
Therefore, in optimizing the Neck network layer, a design approach based on Slim-Neck is employed in this study. This involves the substitution of a standard convolution with a GSConv lightweight convolution, and the replacement of the original C2f module with the lightweight bottleneck layer (GSbottleneck) within the Vision over Visibility Skip-level Cross-Stage Partial (VoV-GSCSP) module. This implementation achieves a lightweight Neck layer, leading to a significant reduction in computational costs and, consequently, an acceleration in inference speed. The application of the Slim-Neck theory to the YOLOv8 model structure is depicted in
Figure 4. This design not only ensures computational efficiency but also aligns with the standards of scientific rigor in the pursuit of model optimization.
3.2. CBAM Attention Module
Given that the background occupies a substantial portion of the images in the utilized dataset, and the predominant targets for detection are small-sized pests, the detection performance of the YOLOv8n algorithm hinges predominantly on the efficiency of the backbone network. To enhance the backbone network’s capacity for extracting critical information, we introduce a Convolutional Block Attention Module (CBAM) into the YOLOv8 backbone, as depicted in
Figure 5. The CBAM model comprises a Channel Attention Module (CAM) and a Spatial Attention Module (SAM), dedicated to extracting channel and spatial attention, respectively. Through adaptive feature refinement facilitated by channel and spatial attention mechanisms, the model endeavors to identify attention regions within densely populated pest scenarios. This incorporation aims to elevate the model’s efficacy in discerning salient features in the presence of prevalent background and small-sized targets, thereby aligning with the rigorous standards of scientific discourse.
Given the input feature
, where
,
, and
represent the channel number, height, and width of the feature map, respectively. After the Channel Attention Module (CAM), channel attention features denoted as
are obtained, as shown in Equation (1).
is then multiplied with the feature map
resulting in the feature
. Following the Spatial Attention Module (SAM), spatial attention features denoted as
are obtained, as expressed in Equation (2).
is subsequently multiplied with the feature
, yielding the refined feature
.
In this investigation, the CBAM attention module is introduced following the C2f module in the YOLOv8 backbone network, incurring a negligible computational overhead. This strategic enhancement aims to provide the deep network with more accurate feature information, thereby contributing to the reduction in loss values and ensuring precise identification and localization of small targets, especially in the context of forestry pest detection. From an intuitive perspective, during the forward propagation of gradients, crucial channels and spatial information in the feature map receive greater emphasis. This refinement is evident in the final output image, effectively accentuating regions of interest for the detection model and enhancing its ability to discern target objects accurately. This improved approach significantly elevates the model’s performance in object detection tasks, demonstrating enhanced potential in effectively handling intricate image scenarios, aligning with the rigorous standards of scientific discourse.
3.3. Improved Loss Function
In the task of detecting small objects in forestry pest images, where the proportion of small objects is relatively high, the rational design of loss functions can significantly enhance the detection performance of the model. The loss function in YOLOv8 consists of multiple components, including the classification loss (VFL Loss) and the regression loss in the form of CIOU Loss + DFL. The formula for the VFL Loss function is given by Equation (3) [
30].
In the aforementioned formula, “q” represents the Intersection over Union (IoU) between the bounding box (predicted box) and the ground truth box. IoU is calculated by dividing the intersection of the predicted box and the ground truth box by the union of the two boxes. The variable “p” represents the score or probability. If the two boxes intersect (q > 0), it is considered a positive sample, and if there is no intersection, then q is set to 0, indicating a negative sample.
The definition of CIoU, as given in Equation (4), incorporates an additional penalty term on top of DIoU.
Among these, serves as a weight function, defined in Equation (5). gauges the similarity in the aspect ratio between the predicted box and the ground truth, as outlined in Equation (6). The terms and denote the aspect ratios of the ground truth box and the predicted box. and represent the center points of the predicted box and the ground truth box, respectively. denotes the Euclidean distance between the two rectangular boxes, and signifies the diagonal distance of the closed region between these two rectangular boxes.
DFL loss (distribution focal loss) is a loss function designed to address the issue of class imbalance. Similar to focal loss, DFL incorporates information about class distribution, providing improved handling of imbalanced class scenarios. The formula for DFL is expressed by Equation (7).
However, CIoU has its drawbacks. Firstly, the computation of CIoU involves the calculation of inverse trigonometric functions, which increases the computational cost of the model, particularly in large-scale object detection tasks. Secondly, CIoU does not account for the balance of hard and easy samples. Thirdly, CIoU considers aspect ratio as a penalty term. When the aspect ratio of the actual box and the predicted box is the same, but the values of width and height are different, the penalty term fails to reflect the true difference between the two bounding boxes.
Therefore, this paper introduces Wise-IoU (WIoU). In terms of computational speed, the additional computation cost of WIoU mainly lies in the calculation of the focusing coefficient and the mean statistics of IoU loss. Under the same experimental conditions, WIoU has a faster speed compared to CIoU because it does not involve aspect ratio calculations, with WIoU’s computation time being 87.2% of CIoU. In terms of performance improvement, WIoU considers not only the area, centroid distance, and overlap area, but also introduces a dynamic non-monotonic focusing mechanism. When the annotation quality of the dataset is poor, WIoU performs better relative to other bounding box losses. The weight calculation of WIoU can better reflect the differences in the appearance and structure of the targets, providing better target distinctiveness, which is advantageous when dealing with targets with similar features. Specific information about WIoU is as follows:
- (1)
Wise-IoU v1: As it is challenging to avoid including low-quality examples in the training data, geometric metrics such as distance and aspect ratio exacerbate the penalty on low-quality examples, leading to a decrease in the model’s generalization performance. A good loss function should weaken the penalty on geometric metrics when the anchor box and target box overlap well, intervening in training as little as possible to enhance the model’s generalization ability. In WIoU v1, distance attention is constructed based on distance metrics. The definition of WIoU v1 is given by Formula (8).
, significantly amplifying the of ordinary-quality anchor boxes, , markedly reducing the of high-quality anchor boxes, and notably decreasing their attention to the center point distance in cases where the anchor box overlaps well with the target box.
- (2)
Wise-IoU v2: Focal loss introduces a monotonic focusing mechanism tailored for cross-entropy, effectively reducing the contribution of easy examples to the loss value. This allows the model to focus on challenging examples, enhancing classification performance. Similarly, in v2, a monotonic focusing coefficient is constructed for . The definition of Wise-IoU v2 is given by Formula (10).
In the model training process, the gradient gain Lriou decreases as Liou decreases, resulting in a slow convergence speed in the later stages of training. Therefore, the introduced mean is used as a normalization factor, as shown in Formula (11):
The term represents the moving average with momentum , dynamically updating the normalization factor to keep the overall gradient gain at a higher level, addressing the issue of slow convergence speed in the later stages of training.
- (3)
Wise-IoU v3: The concept of outlierness is introduced to characterize the quality of anchor boxes, defined as in Equation (12):
Building upon Wise-IoU v1, Wise-IoU v3 introduces a non-monotonic focusing coefficient based on
, defined as in Equation (13). A smaller outlierness implies a higher quality anchor box, resulting in a smaller gradient boost assigned to it, allowing for better bounding box regression focus on anchor boxes with common quality. For anchor boxes with larger outlierness, a smaller gradient boost is allocated, effectively preventing harmful gradients from arising in low-quality examples.
At that time, when , makes . When the outlierness of the anchor box satisfies (C is a constant value), the anchor box will receive the highest gradient boost. Because is dynamic, and the quality criteria for anchor boxes are also dynamic, this enables Wise-IoU v3 to dynamically allocate gradient boosts according to the current situation at any given moment.
Through the comparative analysis mentioned above, this study achieved a significant performance improvement by replacing the traditional CIOU with Wise-IoU v3 in YOLOv8. Wise-IoU v3 utilizes a dynamic non-monotonic mechanism to evaluate anchor box quality, making the model focus more on anchor boxes of ordinary quality, and thus improving the model’s object localization capability. For the task of detecting targets in forestry pests, where small targets have a high proportion, increasing the detection difficulty, Wise-IoU v3 can dynamically optimize the loss weights for small targets, enhancing the model’s detection performance.