1. Introduction
Pests, including caterpillars, threaten crop yields, increase production costs, and damage the environment, ultimately impacting food security and agricultural sustainability. Understanding the economic implications and implementing effective pest control strategies are crucial for mitigating these challenges. Studies, such as those by Savary et al. [
1], highlight the global economic burden of pests and underscore the urgent need for effective management. Modern pest control has evolved to incorporate various innovative strategies, with integrated pest management (IPM) gaining prominence, as discussed by Auclair et al. [
2]. Additionally, McCullough [
3] illustrates the importance of targeted approaches, such as those employed against the emerald ash borer, to effectively address specific pest threats.
In recent years, advancements in deep learning and computer vision have driven transformative changes across various sectors, particularly in precision agriculture, especially in areas of automated disease detection and pest identification and control. Andrew et al., as well as Mohanty et al., utilized publicly available datasets such as the PlantVillage dataset for studying the feasibility of employing deep learning methods for pest identification with significant accuracy and precision in identifying the leaf diseases [
4,
5]. Ramcharan et al. [
6] utilized a dataset from Tanzania and employed deep learning methods to identify five different types of leaf diseases of the Cassava plant with a correct detection rate of at least 95% for these diseases. In another example, Caldeira et al. [
7] employed GoogleNet and ResNet50 models to identify cotton leaf lesions, leading to an accuracy of above 85% in the detection rate. Another critical challenge in orchard management is the effective detection of pests, especially caterpillars, which can cause significant crop damage. Deep learning has emerged as a powerful tool for pest identification, with researchers developing models that can accurately detect and classify pests across various crops. For instance, Selvaraj et al. [
8] employed deep learning techniques to recognize banana diseases and pests, achieving high detection accuracy. Similarly, Yang et al. and Hassan et al. utilized object detection algorithms, such as Faster R-CNN, to identify pests in maize and rice crops, respectively [
9,
10]. The application of deep learning for pest detection extends beyond traditional crop fields. Kasinathan et al. [
11] successfully applied deep learning to detect and recognize pest insects in open-field crops, while Mamdouh and Khattab [
12] developed a real-time pest recognition system specifically for olive trees. Additionally, Tetila et al. [
13] analyzed the YOLO (You Only Look Once) detector for the real-time detection of soybean pests. Meanwhile, Li et al. [
14] employed an improved version of the YOLOv5 detector for the accurate detection and counting of aphids in pepper plants, Wang et al. [
15] utilized a revamped Faster R-CNN based on the ResNet101 feature extractor to identify common pests of apples in the orchards. These examples underscore the versatility and effectiveness of deep learning in pest detection across diverse agricultural environments.
This study focuses on the development of a deep learning-based system designed for the detection and tracking of caterpillars in orchard environments, with the potential for subsequent interventions, such as laser-based elimination. The approach integrates object detection algorithms with fast tracking techniques to create an efficient and adaptable tool for real-time pest detection and tracking. The primary objective is to achieve the precise identification and continuous tracking of caterpillars, even under challenging conditions, such as occlusions, wind interference, and variable lighting typical of natural orchard settings. Despite significant advancements in object detection and tracking technologies, several critical research gaps persist in the domain of agricultural pest monitoring. First, there is a notable limited availability of specialized datasets for caterpillar detection in real orchard environments that adequately represent varying illumination conditions, occlusion scenarios, and weather variations. Second, existing systems have given insufficient attention to tracking persistence through occlusion events, which is crucial for continuous monitoring in complex agricultural settings where foliage frequently obscures target pests. Third, there remains a lack of integrated approaches that effectively combine high-accuracy detection with real-time tracking capabilities specifically optimized for precision pest management applications. Our work addresses these gaps through several novel contributions. First, we have developed a robust detection and tracking system optimized for small, camouflaged targets in complex natural environments, which overcomes the challenges of detecting cryptic pests against visually similar backgrounds. Moreover, we introduce the integration of corner-tracking functionality specifically designed for precision targeting applications, enabling accurate positional data necessary for potential automated intervention systems. Finally, our comprehensive evaluation under real-world agricultural conditions provides valuable insights into system performance across diverse environmental variables, informing future deployments in practical pest management scenarios.
A key aspect of our research is the evaluation of various updated YOLO models, including YOLOv9 [
16], YOLO-NAS [
17], YOLOv11 [
18], and YOLOv12 [
19]. YOLO models are renowned for their exceptional speed and accuracy in object detection, and they continue to evolve, improving detection accuracy, processing speed, and the ability to handle small objects. This has been succinctly reviewed by Terven et al. [
20] in his recent work. YOLO-NAS introduces advanced features, such as neural architecture search and quantization modules, which enhance detection speed and accuracy. YOLOv9, on the other hand, incorporates programmable gradient information which addresses information bottlenecks and reversible functions when data traverse through deep networks. YOLOv9 also introduces the generalized efficient layer aggregation network architecture, which utilizes conventional convolution operators to achieve better parameter utilization than methods based on depth-wise convolution, making it highly efficient for small object detection tasks. Further evolution in the YOLO family includes YOLOv11, which features enhanced backbone and neck architectures for more precise feature extraction while optimizing for efficiency and speed. Notably, YOLOv11m achieves higher mean average precision (mAP) on the COCO dataset with 22% fewer parameters than YOLOv8m, demonstrating significant improvements in computational efficiency without compromising accuracy. With its adaptability across various environments and support for diverse computer vision tasks including object detection, instance segmentation, and pose estimation, YOLOv11 presents a compelling option for our caterpillar detection system. The latest iteration, YOLOv12, takes a different approach by incorporating attention mechanisms while maintaining real-time performance. It introduces an innovative “area attention” module that strategically partitions feature maps to reduce the computational complexity typically associated with self-attention operations. YOLOv12 also adopts residual efficient layer aggregation networks to enhance feature aggregation and training stability, which is particularly beneficial for larger models. These advancements allow YOLOv12 to better model global context without sacrificing the speed critical for real-time pest monitoring applications.
In parallel with YOLO developments, transformer-based detection systems have emerged as powerful alternatives for object detection tasks. DETR (DEtection TRansformer) by Carion et al. [
21] pioneered the application of transformers to object detection, eliminating the need for many hand-designed components like anchor boxes and non-maximum suppression through its direct set prediction approach. Deformable DETR [
22] addressed DETR’s slow convergence issues by introducing deformable attention mechanisms that focus only on a small set of key sampling points around a reference point, significantly improving training efficiency while maintaining accuracy. Real-time DETR [
23] further optimized transformer-based detection for real-time applications by incorporating an IoU-aware query selection mechanism and hybrid encoder design, achieving state-of-the-art performance-speed trade-offs particularly beneficial for edge device deployment. An efficient DETR [
24] approach reduces computational complexity through progressive decoder designs and adaptive feature selection, making transformer-based detection more viable for the resource-constrained systems common in agricultural settings. Given the small size of caterpillars, which makes them difficult to detect with less refined methods, our study places significant emphasis on the small object detection capabilities of these models. This focus is critical for ensuring that the system can accurately identify and track caterpillars in real orchard environments. The advances in both YOLO architectures and transformer-based detection systems provide promising approaches for addressing the challenges associated with detecting and tracking these small, often camouflaged pests in complex natural settings.
Effective tracking mechanisms are crucial for our system’s functionality, extending beyond mere object detection. Trackers are essential for maintaining the identity and trajectory of detected objects across successive frames, which is vital for continuous monitoring and timely intervention. To identify the fast and optimal tracking solution for our application, we selected the SORT (Simple Online and Realtime Tracking) algorithm [
25]. While more advanced trackers like DeepSORT, BYTETrack, OCSORT, BotSORT, and StrongSORT [
26,
27,
28,
29,
30] offer various enhancements and additions to the original SORT algorithm, we selected SORT for its effective balance of computational efficiency and tracking accuracy in real-time applications [
27,
30], making it particularly suited for our needs. The integration of YOLO-NAS with the SORT tracker results in a powerful combination. YOLO-NAS excels in detecting small objects with high precision, while SORT provides efficient and low-overhead tracking capabilities, as this is required while deploying in real-time scenarios with embedded devices on agricultural vehicles. This synergy enables real-time precise detection and tracking, ensuring seamless deployment in real-world orchard environments. Additionally, our research enhances the system’s functionality with features like selective corner tracking, which allows for more advanced applications such as laser-based elimination. By precisely identifying the head and tail of caterpillars, this approach supports targeted pest control measures, further improving the effectiveness of our system.
2. Materials and Methods
2.1. Caterpillar Rearing and Dataset Collection
In this study, a series of experiments were conducted using live caterpillars cultivated in a laboratory setting to simulate the conditions found in orchards. The caterpillars, nourished with fresh leaves sourced from the orchard (22.92826° N, 120.29395° E) in Tainan City, Taiwan, were selected for imaging and training during their 3rd and 4th instar growth stages, ranging in length from 2 cm to 4.5 cm. To replicate natural orchard conditions, the caterpillars were placed on leaves and branches in an orchard environment. The species Orgyia postica and Porthesia taiwana, both common pests in jujube orchards in Taiwan, were utilized for these experiments.
To capture images of the caterpillars, an Intel RealSense D405 camera (manufactured by Intel Corporation, Santa Clara, CA, USA) was employed. This stereo camera, known for its high accuracy of 0.1 mm, was set to capture images at a frame rate of 30 frames per second (fps) within a suitable range of around 7 cm to 50 cm. Images were captured at 30-min intervals twice weekly from 6:00 AM to 6:00 PM over six months (April to October 2024), ensuring the dataset represents the full spectrum of daily light conditions. Our data collection protocol deliberately included various lighting scenarios: direct sunlight, partial cloud cover, heavy overcast (1000–5000 lux) and dawn/dusk transitions. To ensure proper exposure control, the region of interest (ROI) feature was utilized, focusing on the leaves rather than the sky.
During our experimental period, we recorded a temperature range of 16–28 °C (60.8–82.4 °F), with daily fluctuations of approximately 5–8 °C, and relative humidity levels of 65–85%, with higher humidity levels during early morning hours. Based on the entomological literature, both species observed in our study (
Orgyia postica and
Porthesia taiwana) exhibit optimal feeding activity at temperatures in the range of 20–25 °C and relative humidity of 70–80%. These environmental parameters largely fell within the optimal range for active feeding behavior, which enhanced detection opportunities [
31,
32]. Weather condition data were obtained from the Central Weather Bureau government website, specifically for the Guiren district in Tainan, and matched to the times of data collection.
A total of 1130 images were captured at varying distances from the camera, ranging from 10 cm to 55 cm, and under different lighting conditions at various times of the day with a resolution of 1280 × 720 pixels. This resolution was chosen to balance image quality with the capture frame rate. Data augmentation was applied to introduce variety into the dataset, with images rotated at angles of 45, 90, 135, and 180 degrees, enhancing the representation of the caterpillars in different orientations. Our approach to data collection and augmentation was intentionally rigorous and context-specific, prioritizing real-world diversity over artificial enhancements. Images were systematically captured under varying environmental conditions, spanning different times of the day and night over a six-month period. This extended collection period ensured the comprehensive coverage of local orchard environments, capturing a wide range of naturally occurring variations. Rather than relying on synthetic augmentation techniques such as Gaussian noise, blurring, or contrast adjustments, we focused on maintaining the authenticity of our dataset. Artificial modifications often fail to accurately replicate real-world variations in lighting, occlusions, and motion blur caused by wind and foliage movement. By emphasizing naturalistic data collection, we ensured that the dataset truly represents the complexities encountered in orchard settings, improving the robustness of our model for practical deployment. This methodological choice also enhances the long-term value of the dataset, as it remains highly representative of real-world conditions without relying on artificially induced distortions.
This process resulted in a total of 5650 images. These images were later cropped to 640 × 640 pixels for training and testing. The caterpillars were labeled using LabelImg software (v. 1.8.6) to create the dataset required for training the network. The dataset was partitioned into training, validation, and test sets in a ratio of approximately 0.89:0.05:0.06, consisting of 5040, 200, and 250 images respectively. The resulting dataset is notably diverse, capturing a range of environmental conditions, such as low light and occlusions caused by leaf movement due to wind. This aspect of our study is particularly significant, as previous research employing deep learning techniques for pest detection often employed datasets that might not adequately represent real-world orchard conditions.
2.2. Manual Camera Parameter Selection
In real-world applications, variable weather conditions, such as alternating cloudy and sunny periods, can significantly impact image quality. To ensure the optimal performance of our detection and tracking system under such conditions, precise calibration of the camera parameters was essential. The Intel RealSense D405 camera was employed in this study, and specific settings were adjusted to achieve the best results across varying environmental conditions. These settings included white balance, brightness, contrast, sharpness and exposure time.
The white balance was calibrated and adjusted to the maximum value of 6500 K to account for different lighting conditions, ensuring color consistency in the captured images. Brightness and contrast were left to be auto-adjusted to enhance image clarity and detail, making it easier for the detection algorithm to identify caterpillars. The sharpness parameter was set to a maximum value of 100 to enhance and make sure the edges were clear. The exposure time, also known as shutter speed, was set manually to 1 millisecond to prevent overexposure in bright conditions and underexposure in low light, ensuring a consistent image quality across different times of the day. This careful adjustment of the exposure time was crucial, as automatic settings often resulted in images that were either too dark or too bright, which could hinder the detection process.
Additionally, the ROI feature available on the RealSense camera was utilized to focus on the leaves where the caterpillars were likely to be found rather than on the sky or background elements that could skew the exposure and white balance settings. This focus ensured that the captured images were of high quality and relevant to the detection task. By optimizing these parameters, we ensured that the camera could capture high-quality images under varying weather conditions, which is critical for accurate caterpillar detection and tracking. This parameter selection process is an important step in developing a reliable pest management system that can operate effectively in real orchard environments. The resulting images from these varying conditions are shown in
Figure 1a–e.
2.3. YOLO-NAS and Recent YOLOs for Small Object Detection
The YOLO object detection algorithm has become widely popular in computer vision due to its impressive speed and accuracy. YOLO-NAS, introduced by Deci AI in 2023, has emerged as one of the top-performing models in the YOLO family. Given our focus on effective small object detection, we chose to compare the performance of YOLO-NAS, YOLOv9, YOLOv11, and YOLOv12 specifically targeting this objective. As described by Tong et al. [
33], any object that is equal to or lesser than 32 × 32 pixels is defined as a small object in object detection and classification problems. For a fair comparison, we selected the YOLO-NAS-L, YOLOv9-E, YOLOv11x, and YOLOv12x models for our study. The models were evaluated on a Windows 10 system equipped with a 12th Gen Intel(R) Core (TM) i5-12400 2.50 GHz processor, 32 GB of RAM, and an Nvidia GeForce RTX 3060 GPU with 12 GB VRAM, using the CUDA version 11.7, PyTorch version 1.13.1, and Python version 3.8.10. The training was conducted with a batch size of 4, over 30 epochs.
In our study, we identified the optimal model based on three key criteria: small object detection, recall value and mAP at both 50% and 50–95% confidence intervals. Recall measures the proportion of correctly identified positive cases (true positives) out of the total number of actual positive cases (true positives + false negatives). We prioritized recall over precision, which calculates the proportion of true positives among all positive predictions (true positives + false positives), because our primary goal was to ensure the successful detection of caterpillars, even if it resulted in some false positives. Also, to test the robustness of the YOLO versions for general caterpillars’ detection, the detections of the caterpillars in the 30 frames for the test videos taken at a distance of 20–25 cm and 30–35 cm were compared based on the number of true positive detections and false positive detections and tabulated.
2.4. YOLO-NAS Plus SORT for Selective Corner Tracking for Head Detection
Our study also highlighted that all detectors are prone to missing detections due to occlusions or changes in lighting caused by external factors such as cloudy weather and wind. Nguyen et al. [
34] discussed in his study how an additional Kalman filter along with deep learning algorithm for camshift human tracking improved the overall accuracy of this robust system, as it can deal with the occlusions and missing detections, as mentioned above. To address these challenges and enhance the overall detection rate, we integrated the detectors with the SORT tracking algorithm [
25], which provides an optimal balance between computational speed and tracking accuracy. While more advanced trackers are available, we selected SORT for its simplicity and efficiency. SORT is particularly known for its speed, leveraging the Kalman filter for motion prediction and the Hungarian algorithm for data association. Although it is less robust to occlusions and complex motion patterns, its efficiency makes SORT especially well-suited for real-time applications where minimizing computational overhead is crucial. Also, standalone YOLO-NAS and YOLO-NAS plus SORT were compared by studying the parameters, namely true positive detections and false positive detections for the caterpillars in the test video taken at the distance of 20–25 cm and 30–35 cm.
Figure 2 presents a schematic diagram of the YOLO-NAS combined with the SORT algorithm with the UGV.
To enable precise laser-based interventions targeting the heads of caterpillars, selective corner tracking has been incorporated. This method identifies the head and tail of caterpillars when they are positioned diagonally within the bounding box. Two corners of the caterpillar align with two corners of the bounding box, while the other two corners fall on the leaf or background. Due to the significant color contrast between the caterpillars and the background, it is possible to track the head and tail corners by specifying the caterpillars’ combined hue, saturation, and value (HSV) color values. A component was added to selectively identify these corners based on their color. The system was developed with the potential application of lasers in unmanned agricultural ground vehicles for caterpillar elimination. Targeting the caterpillar’s head with a precise laser beam requires significantly less exposure time compared to targeting the body, as Elgar et al. [
35] described how the head contains antennae and other critical parts essential for environmental perception and sensory processing. By obtaining the
x,
y, and
z coordinates of the two corners of the bounding box representing the head and tail, a laser strike at these points ensures at least one hit on the head, effectively neutralizing the caterpillar. The overall software architecture of YOLO-NAS plus SORT with selective tracking is illustrated in
Figure 3.
4. Practical Applications and Perspectives
4.1. Performance and Applications of YOLO-NAS and SORT for Pest Management
Our integrated approach using YOLO-NAS and SORT with selective corner tracking demonstrated superior performance for the real-time detection and tracking of caterpillars in orchards. YOLO-NAS achieved a high recall value (0.99) that outperformed other YOLO versions with particularly impressive results detecting small caterpillars (2–2.5 cm) at 91.7%. This early detection capability is crucial for timely intervention before crop damage occurs. The integration of SORT tracking enhanced the system by reducing false positives by up to 8.4% at 30–35 cm distances. Our camera parameter optimization methodology also allowed robust performance across variable environmental conditions, addressing a common limitation in previous studies that operated under controlled lighting.
The system provides significant advantages over traditional pest monitoring methods, which are typically labor-intensive, time consuming, and often provide delayed detection. In contrast, our automated approach enables the continuous monitoring with real-time detection capabilities across multiple orchard locations, providing comprehensive spatial coverage rather than limited sampling. The selective corner tracking feature adds precision by identifying specific body parts of caterpillars, which are valuable for targeted interventions. This automated system provides objective, consistent detection and counting, potentially improving threshold-based decision making in pest management strategies.
4.2. Economic and Practical Benefits
Implementation costs for our detection system include approximately USD 3500 for hardware (computing unit, camera, mounting hardware, protective enclosures) plus optional unmanned ground vehicle platforms (USD 4000–6000) for mobile applications. Operational requirements are modest: 150–250 W power consumption (compatible with solar integration), 5–10% annual maintenance costs, and significantly reduced labor requirements (1–2 person-hours per hectare weekly versus 8–12 for conventional methods). The economic benefits stem primarily from 15–25% yield improvement through early detection and targeted interventions (USD 4500–20,000 per hectare for high-value jujube orchards). This results in an estimated ROI period of 1.2–1.8 growing seasons for stationary systems and 2–3 seasons for mobile platforms.
4.3. Laser-Based Pest Control Integration
The integration of our detection and tracking system with laser-based pest control offers a promising alternative to chemical pesticides. Targeting specific body parts of caterpillars with lasers, made possible by our selective corner tracking feature, can reduce energy requirements by 30–40% compared to untargeted applications. For a typical 2–2.5 cm caterpillar, approximately 0.2–0.3 J per pest is sufficient when targeting vulnerable regions. A 5 W continuous wave laser with 150 millisecond pulses would provide sufficient energy while maintaining reasonable power consumption compatible with solar-powered field installations or mobile battery platforms.
4.4. Environmental and Safety Considerations
Laser-based systems offer significant environmental advantages by leaving no chemical residues while achieving high mortality rates in target pests. Unlike pesticides that affect both target and non-target organisms, laser systems can specifically target identified pests at the individual level with a 96–98% reduction in non-target impacts. Safety features including optical isolation, proximity sensors, physical shielding, and emergency shutdown systems would prevent accidental human exposure, while species-specific detection algorithms would protect beneficial insects. Regulatory pathways for laser pest control differ from chemical pesticides, avoiding extensive toxicological testing but requiring a demonstration of safety and efficacy through field trials. Despite promising potential and compelling economic advantages over conventional methods (25–40% reduction in total pest management costs over a 10-year lifespan), challenges remain including all-weather reliability, targeting precision for early instar larvae, throughput limitations, and dense canopy penetration that will require continued engineering refinement and comprehensive field validation.
4.5. Comparative Analysis of Precision Agriculture Technologies
Our approach offers distinct advantages over existing AI-driven pest management technologies. Unlike the autonomous spraying platforms developed by Qin et al. [
39] and Zhang et al. [
40], which primarily focus on chemical pesticide application, our deep learning-based pest identification method, combined with a future laser-based system, provides a non-chemical intervention that minimizes environmental impact. While systems such as those by Selvaraj et al. [
8] and Mohanty et al. [
5] have demonstrated high accuracy in pest detection using deep learning, our methodology extends beyond mere identification to offer a targeted, energy-efficient intervention strategy.
Compared to remote sensing-based pest detection approaches, such as those proposed by Hassan et al. [
10], our system delivers higher spatial resolution and real-time tracking capabilities. The integration of YOLO-NAS with selective corner tracking enhances pest identification and localization accuracy, addressing a key limitation found in many existing AI-driven agricultural technologies. By focusing on precise tracking at an individual level, our system ensures that interventions are more effective and adaptable to real-world orchard conditions.
Furthermore, our approach differentiates itself from traditional CNN-informed precision application tools by providing a non-chemical pest control method, enabling individual pest targeting, reducing the impact on non-target organisms, and offering potentially lower long-term operational costs. While deep learning-based systems like those by Ramcharan et al. [
6] and Caldeira et al. [
7] have achieved impressive detection accuracies, our work takes a more holistic approach by integrating advanced detection (YOLO-NAS), robust tracking (SORT), and a targeted intervention mechanism (laser targeting). This comprehensive framework positions our system as a highly precise and environmentally sustainable alternative to conventional pest management technologies.
5. Conclusions
The integration of YOLO-NAS with the SORT algorithm has significantly improved caterpillar detection and tracking in orchard environments, particularly under challenging conditions such as partial occlusion by leaves and wind interference. This approach, combined with selective corner tracking, enables precise head and tail identification, facilitating accurate laser targeting for efficient pest control and optimized energy use. The SORT algorithm has proven effective in maintaining reliable caterpillar tracking, overcoming environmental challenges to ensure consistent identification—an essential factor for precision pest management. Additionally, the system’s processing speed remains practical for real-world deployment in orchards. Compared to contemporary YOLO versions, YOLO-NAS demonstrates superior performance, detecting the smallest caterpillars (21 × 6 and 25 × 6 pixels, or ~2.5 cm) in 55 out of 60 instances, while YOLO-NAS + SORT achieves detection in 38 instances. The high recall and efficient tracking performance of YOLO-NAS + SORT highlight its robustness and adaptability, making it a promising tool for precision agriculture and sustainable pest management.
Future work will optimize YOLO-NAS + SORT with TensorRT for embedded devices. We will also expand the dataset to include more caterpillar species and environmental variations. Expanding the dataset to include additional economically significant caterpillar species such as Spodoptera litura, Helicoverpa armigera, and Plutella xylostella and incorporating geographical, environmental, and seasonal variations will improve the model’s generalizability. A key challenge observed was frequent detection switches between two caterpillar species; however, since the approach prioritizes maximum detection over species differentiation, the impact was minimal. Another limitation was the occasional misidentification of foliage as caterpillar corners under specific lighting conditions, which could be addressed with a more precise tracking algorithm such as BYTETrack, StrongSORT and others. Collaborating with agricultural experts through field trials will further validate and refine the system. With continued development, the YOLO-NAS + SORT integration, coupled with selective corner tracking and performance enhancements, holds significant potential for revolutionizing pest management by providing an efficient, reliable, and environmentally sustainable solution for agriculture.