1. Introduction
The global challenges of climate change, population growth, and food security have driven significant advancements in modern agriculture. Researchers have introduced several mechanization technologies, including advanced farm machinery, autonomous navigation systems, artificial intelligence, sensing technologies, and communication tools, to enhance productivity and sustainability [
1,
2]. These innovations improve land utilization, labor efficiency, and resource management and increase farmers’ profitability. Among these advancements, integrating autonomous navigation systems with machine vision and image processing has become crucial for intelligent farm machinery [
3]. This technology enables precise movement, real-time crop monitoring, and efficient execution of field tasks, reducing labor dependency and operational costs [
4,
5].
Mobile robots, such as Unmanned Ground Vehicles (UGVs), are increasingly used in complex and unstructured environments, extending beyond traditional indoor settings. With the increasing environmental complexity, fast and robust environment perception and recognition have become critical for autonomous robotic navigation [
6,
7]. These environments pose unique challenges for autonomous navigation due to dynamic conditions, irregular terrain, and dense vegetation. Adequate perception and recognition of surroundings are critical for safe and efficient movement [
8]. These robots must distinguish between “Real” obstacles (e.g., tree trunks, humans) that require avoidance and “Fake” obstacles (e.g., branches, tall grass) that do not impede movement [
9,
10]. This distinction enhances decision-making, optimizes navigation, and improves safety and efficiency [
11,
12]. Existing research underscores the need for robust scene understanding in agricultural robotics, particularly in environments with varying illumination, unstructured boundaries, and unpredictable features [
13,
14].
An orchard navigation system comprises three key components: environment detection, path planning, and navigation control. Presently, the primary sensors used for environmental sensing include the Global Navigation Satellite System (GNSS), machine vision, Light Detection and Ranging (LiDAR), and multi-sensor fusion [
15]. Machine learning and artificial intelligence models have recently been widely applied in real-world situations, such as feature extraction and target detection in field environments [
16]. Similarly, artificial intelligence has garnered significant attention in traditional agriculture, enabling the planning of various activities and missions effectively by utilizing limited resources with minimal human interference [
17].
Furthermore, deep learning has significantly enhanced obstacle detection in agricultural robotics, primarily through convolutional neural networks (CNNs). Models such as YOLO (You Only Look Once) have gained prominence for real-time obstacle detection. YOLOv3 balances speed and accuracy, resulting in optimized versions such as YOLOv4 and YOLOv5, better suited for agricultural robotics [
18,
19,
20,
21,
22,
23,
24]. Wang and Wei [
6] stated that for a machine-learning approach, a large amount of labeled data for training, computation, and storage is required. Therefore, using deep learning algorithms necessitates support from powerful computational resources, such as high-performance graphics processors (GPUs) and high-capacity storage devices. However, despite these advancements, orchard environments remain semi-structured, requiring robots to detect and avoid obstacles [
25] promptly. Dynamic conditions and environmental variability compound the challenge of detecting obstacles in such settings. To enhance robustness, several modifications to standard models have been proposed [
26,
27,
28,
29]. Recent studies have explored the integration of multimodal sensors (e.g., LIDAR, radar, and cameras) to mitigate environmental noise and occlusions, thereby improving detection accuracy, reducing false positives, and facilitating real-time decision-making [
30,
31,
32,
33,
34,
35]. Advanced obstacle avoidance algorithms, such as the Obstacle-Dependent Gaussian Potential Field (ODG-PF) model, have also been introduced to refine collision avoidance by evaluating obstacle proximity and collision probabilities [
36].
Han et al. [
37] combined 3D CNNs with Long Short-Term Memory (LSTM) networks to enhance obstacle recognition by integrating spatial and temporal features. While CNN-based architectures, such as Mask R-CNN and Faster R-CNN, have been explored for dynamic and multi-class detection tasks, these models and traditional deep learning approaches focus on object detection rather than classifying obstacles into functional categories. Although YOLOv5 and YOLOv8 offer high detection accuracy, they do not inherently differentiate between ‘Real’ and ‘Fake’ obstacles, making them suboptimal for autonomous navigation in semi-structured orchard environments [
38,
39].
Moreover, some recent works incorporate advanced sensor fusion techniques, integrating multiple sensing modalities, such as LIDAR, cameras, and radar, to enhance obstacle detection accuracy. However, these approaches often increase computational complexity and hardware costs, limiting their deployment on lightweight, mobile agricultural robots. Thus, a computationally efficient yet robust classification mechanism is required to enhance real-time obstacle differentiation without significantly increasing processing overhead.
Machine vision solutions often rely on depth cameras and deep learning algorithms for obstacle detection [
18,
19,
20,
40]. Feng et al. [
41] developed a segmentation network with dual semantic-feature complementary fusion to segment diverse fake obstacles. At the same time, Liu et al. [
42] proposed the YOLO-SCG model for faster and more precise target recognition. Koch et al. [
43] evaluated the geometric features of potholes, and Matthies et al. [
44] introduced a method for detecting fake obstacles using thermal infrared imaging. However, existing methods primarily focus on identifying obstacles rather than differentiating their impact on navigation.
In real-world orchard environments, a binary approach to obstacle detection is insufficient, as not all detected obstacles require the same level of avoidance. For instance, a tree trunk should trigger a different response than a flexible branch, which an autonomous robot can navigate without requiring a detour. However, current research lacks a dedicated real-time classification mechanism to separate obstacles into actionable categories [
45,
46]. This limitation can lead to unnecessary path deviations, reduced efficiency, and increased operational complexity for orchard robots.
This study enhances the YOLOv8 detection framework to address these challenges by integrating a lightweight CNN classifier with Ghost Modules and Squeeze-and-Excitation (SE) blocks. These modifications improve feature extraction while reducing computational overhead, ensuring efficient real-time deployment in orchard environments. By distinguishing between “Real” obstacles (e.g., tree trunks, humans) and “Fake” obstacles (e.g., branches, tall grass), our model minimizes unnecessary stops and detours, thereby improving navigation efficiency. Additionally, Hyperband optimization balances accuracy and speed for real-world applications.
Building on our previous work [
47], which enhanced YOLOv8-based obstacle detection but lacked classification capabilities, we introduce a dedicated classification mechanism that distinguishes between obstacles that require avoidance and those that do not. To validate the effectiveness of our model, we evaluate it on multiple datasets, including orchard-based and campus datasets, as well as an external dataset [
48] to ensure generalizability.
The key contributions include the following:
Utilizing Ghost Modules and SE blocks to enhance feature extraction while reducing computational overhead, facilitating deployment on energy-limited robotic platforms.
Ensuring model robustness by training on diverse datasets (orchard and campus environments) to improve adaptability to varying operational conditions.
Evaluating model generalization on an open dataset to ensure applicability to previously unseen obstacles [
48].
Employing Hyperband optimization for efficient hyperparameter tuning, increasing accuracy while reducing training time for real-time applications.
Classifying obstacles into actionable categories (“Real” and “Fake”) to facilitate faster decision-making for collision avoidance and navigation.
Demonstrating superior performance compared to conventional classifiers and state-of-the-art models in terms of accuracy, efficiency, and reliability within practical orchard environments.
With these advancements, our work aims to significantly improve the reliability and efficiency of autonomous robots in orchard settings. By providing a robust solution for real-time obstacle classification, our model enhances robots’ ability to make informed decisions regarding obstacle avoidance and trajectory adjustments, ultimately improving safety and optimizing navigation in dynamic agricultural environments. Furthermore, this paper is structured as follows:
Section 1 introduces the research objectives, highlights the significance of obstacle classification, and identifies the gap addressed by our approach.
Section 2 presents the system description, navigation framework, experimental setup, dataset preparation, and model architecture design.
Section 3 details the experimental results, analyzes model performance compared to traditional classifiers, and evaluates its effectiveness in real-world orchard and campus environments.
Section 4 discusses the results, elaborates on strengths and limitations, explores practical implications, and suggests future research directions. Finally,
Section 5 concludes the paper with key insights and recommendations.
2. Materials and Methods
2.1. System Description and Vision System Overview
The teleoperated orchard management robot developed by our team integrates multiple high-precision sensors for accurate environmental monitoring. These sensors include the RS-LIDAR-16 for three-dimensional mapping, a high-resolution camera for visual data acquisition, a GNSS unit for precise positioning, and an IMU for motion sensing, as illustrated in
Figure 1. The RS-LIDAR-16 (Robosense, Shenzhen, China) was chosen for its cost-effectiveness and sufficient performance for navigation and mapping, serving a complementary role within the sensor suite. Its compatibility with other sensors facilitated real-time data processing. Although its contribution was limited in this study, it remains a valuable component for future research where its capabilities can be fully utilized.
The robot was manually operated via a wireless control interface, with sensor data recorded at a rate of 10 Hz for immediate processing and decision-making. The Intel RealSense D455 RGB-D camera (Santa Clara, CA, USA), positioned at a height of 1.2 m, captured high-resolution RGB images (1280 × 800 pixels) and depth data for accurate distance measurement. The high resolution enables better detail in obstacle features, which is crucial for accurate classification. Additionally, the camera’s frame rate of up to 90 fps ensures rapid data capture, supporting real-time processing and minimizing the likelihood of missed obstacles in dynamic environments.
The embedded computing platform comprised an Advantech MIC770Q industrial computer. This MIC770Q (Taipei, Taiwan) model is powered by an Intel 8-core i7-9700E processor, which features a built-in Intel UHD Graphics 630 card and has two 16GB DDR4 memory modules. After initial evaluations for real-time validation, we deployed the classification model on this local machine. This deployment required the configuration of essential libraries, including Python 3.8.20, PyTorch 1.13.1, YOLOv8, a CNN-based classification framework, and OpenCV for video processing. An Intel RealSense D455 camera captured real-time RGB images, significantly enhancing the system’s perception capabilities.
The selection of the RGB-D camera underwent a rigorous evaluation, comparing potential alternatives such as the D415, D435i, and Azure Kinect. Key specifications—including depth range, resolution, and frame rate—were analyzed to determine the most suitable option.
Table 1 presents a comparative analysis, highlighting the advantages of the Intel RealSense D455 in complex orchard environments.
2.2. Subsystem Integration and Coordination
Subsystem integration was accomplished using the Robot Operating System (ROS), which enabled seamless communication among the robot’s sensors, navigation systems, and control components. ROS facilitated real-time data exchange, ensuring synchronized processing of sensor inputs from the Intel RealSense D455 camera.
Environmental data were continuously monitored via ROS topics and nodes for obstacle detection, with processed results feeding into the robot’s decision-making system. This integration allowed dynamic trajectory adjustments based on real-time environmental changes, ensuring efficient and safe navigation in complex orchard environments.
2.3. Obstacle Classification and Navigation Framework
The robot’s navigation relied on a real-time obstacle classification model, distinguishing between Real and Fake obstacles. Real obstacles were physical obstructions requiring the robot to stop or alter its path, while Fake obstacles were negligible and did not affect navigation. This binary classification approach streamlined decision-making while minimizing computational demands, essential for real-time autonomous navigation. The classification procedures outlined in Algorithms 1 and 2 were integrated into the navigation system to enhance efficiency and responsiveness. The model underwent rigorous testing in both orchard and campus environments to ensure robust performance.
Algorithm 1: Obstacle Classification and Navigation Decision |
|
Input: Detected Obstacles (O), Navigation Status, Output: Stop Signal, Continue Signal |
|
1. Initialize the Stop Signal ← false |
2. Initialize the Continue Signal ← true |
3. For each obstacle o ∈ O (Detected Obstacles) |
|
4. Classify(o) → obstacle class # Classify the obstacle (real/fake) |
|
5. If obstacle class = “Real” then |
|
6. Set Stop Signal ← true # Real obstacle detected, stop robot |
|
7. Break the loop, stop further processing |
|
8. Else if obstacle class = “Fake” then |
|
9. Set Continue Signal ← true # Fake obstacle, continue robot movement |
end for |
Algorithm 2. Navigation Control Based on Obstacle Classification |
|
Input: Detected Obstacles (O), Current Position (Poscurrent), Output: Status (Moving or Stopped) |
|
1. Initialize Status ← “Moving” |
2. While Status = “Moving” do |
|
3. Detected ← Detect_Obstacles (Poscurrent) # Detect obstacles at current position |
|
4. Stop_Signal, Continue_Signal ← Obstacle_Classification_and_Navigation_Decision (Detected) |
|
# Classify and decide navigation |
|
5. If Stop_Signal = true then |
|
6. Update Status ← “Stopped” # Stop robot if real obstacle detected |
|
7. Else if Continue_Signal = true then |
|
8. Update Status ← “Moving” # Continue moving if fake or no obstacles |
end while |
2.4. Data Collection and Dataset Preparation
The teleoperated robot was deployed in two environments: Nanjing Agricultural University campus, which is heavily forested, and a sweet apple orchard in Yanghe Town, Suqian City, where fruit trees are approximately 3 m in height, with canopy widths of 1 to 1.5 m and row spacing of around 3 m. Data were collected in 2024 across different lighting conditions and seasons. Approximately 30,000 RGB images were acquired per environment to ensure robustness. These images captured diverse obstacles, including barriers, dead vegetation, fencing, fruit baskets, grass, live vegetation, people, stones, debris, tree trunks, and branches, as shown in
Figure 2.
Initial data processing involved a YOLOv8-based detection algorithm (
Figure 3) to streamline classification by generating bounding boxes around potential obstacles [
47]. These were manually verified for accuracy, particularly for partially occluded objects. A balanced dataset of 24,000 labeled bounding boxes, equally distributed between Real and Fake obstacles, was created for each environment. The dataset was split into 85% for training and 15% for unbiased testing. Further generalization testing utilized 2400 images from both environments.
To ensure dataset reliability, a comprehensive data-cleaning process was implemented. Images with severe motion blur, overexposed regions, or excessive occlusions were removed to maintain dataset integrity. Duplicate images captured under similar conditions were filtered out to prevent overfitting. Additionally, any mislabeled data points identified during manual verification were corrected. If an image contained significant missing or incomplete visual information due to sensor noise or transmission errors, it was either removed or reconstructed using adjacent frames when possible. To further enhance dataset quality, outlier detection was performed by identifying bounding boxes with extreme aspect ratios or inconsistent obstacle labeling, ensuring that objects were accurately represented in the dataset.
Data augmentation techniques were applied to improve generalization, including random rotations (±30°), brightness adjustments, horizontal flipping, and 20% shifts (
Figure 4). All images were resized to a fixed resolution to maintain consistency. Additionally, pixel values were normalized between 0 and 1 to standardize network input.
2.5. Model Architecture Design
The proposed system follows a two-stage process: first, object detection using a YOLOv8n model to identify regions of interest [
47], and second, a CNN-based classification model to categorize detected objects as Real or Fake obstacles. The CNN architecture integrates Ghost Modules and SE blocks to enhance computational efficiency and feature extraction. Integrating these modules is intentionally designed to balance lightweight processing and enhanced feature discrimination, essential for real-time applications in orchard environments.
Ghost Modules generate lightweight feature maps to reduce computational overhead, making them ideal for real-time applications in resource-constrained environments, as Shen et al. [
49] demonstrated. While SE blocks recalibrate feature maps through channel-wise attention, improving interpretability and performance, as Hu et al. [
12] discussed. The cooperation between Ghost and SE modules allows the model to operate efficiently while maintaining high classification accuracy. The Ghost Module ensures reduced computational complexity by generating simplified yet informative feature maps. In contrast, the SE Module enhances the network’s ability to focus on the most critical features by prioritizing channel-wise attention. This combination ensures a balance between accuracy and speed, which is crucial for real-time obstacle classification in orchard environments.
As depicted in
Figure 5, the CNN architecture begins with an input layer that processes RGB images resized to 224 × 224 pixels. The input image, with dimensions H × W × C (where H is the height, W is the width, and C represents the three color channels), undergoes standard preprocessing steps, such as normalization and resizing, before being fed into the network.
The overall architecture consists of convolutional layers with ReLU activation functions interspersed with max-pooling layers for dimensionality reduction. SE blocks and Ghost Modules follow these layers, strategically incorporated to optimize computation and enhance feature representation, particularly for the challenging task of obstacle detection in orchard environments. The coordinated use of these modules ensures robustness by improving feature extraction while reducing computational costs, making the model well-suited for deployment on energy-limited robotic platforms.
Fully connected layers were regularized with dropout to mitigate overfitting, while the final layer uses a sigmoid activation function for binary classification to distinguish between Real and Fake obstacles. By combining the computational efficiency of Ghost Modules with the feature-enhancing capabilities of SE blocks, the model ensures both accuracy and speed, making it well-suited for real-time applications in agricultural robotics.
2.5.1. Integration of Ghost Module
Convolutional layers serve as the primary feature extractors in the CNN architecture. Ghost Modules were integrated into these layers to optimize computational efficiency. Unlike traditional convolutions, which generate C output feature maps, Ghost Modules create fewer primary feature maps via standard convolutions and then augment these maps with “Ghost” feature maps produced by lightweight depth-wise convolutions. This strategy reduces computational load while preserving feature richness. Mathematically, this is expressed as follows:
where
is the reduction ratio. The ghost feature maps are computed as follows:
The combined output is as follows:
Here, represents the primary feature maps, represents the ghost feature maps, and is the combined output. This approach reduces computation while preserving feature richness.
2.5.2. Integration of Squeeze-and-Excitation (SE) Module
The SE blocks are integrated following the Ghost Modules to enhance the model’s representational power by focusing on informative channels. The SE blocks recalibrate the feature maps through squeeze and excitation.
Squeeze Step: Global Average Pooling (GAP) is applied to each channel to compress the spatial dimensions into a squeezed vector z:
where
represents the feature value at the position
in channel
.
Excitation Step: The squeezed vector z is passed through fully connected layers to produce channel weights
:
where
and
are learnable weight matrices, and σ denotes the sigmoid activation function. The recalibrated feature maps are obtained via channel-wise multiplication:
where
denotes element-wise multiplication.
The SE block helps the model focus on the most informative features by recalibrating feature maps based on their importance, thus improving the model’s interpretability and performance.
2.5.3. Hyperband Optimization
The Hyperband algorithm was used for hyperparameter optimization to further enhance the model’s performance. This method efficiently explores the hyperparameter space to identify the optimal values for key parameters such as learning rate, batch size, and dropout rate. As the researcher suggested, this technique dynamically allocates resources to the most promising configurations and ensures the best possible model performance [
50].
The optimization process involves running several configurations of the model with different hyperparameter values. These configurations were evaluated, and the top-performing ones were allocated more resources for further evaluation. The best combination of hyperparameters was then selected for training the final model, resulting in improved accuracy and efficiency.
2.6. Model Training and Evaluation Metric
The training process, which utilized the Adam optimizer with a learning rate of 0.0001, a batch size of 32, and multiple epochs, was used to train the model. Regularization techniques were implemented to prevent overfitting, including L2 weight decay and dropout at a rate of 0.2. Early stopping was employed based on validation loss to optimize training efficiency.
The performance evaluation encompassed key metrics, including accuracy, precision recall, and F1-score, as outlined in Equations (7)–(10). The training was conducted on a Google Colab GPU (
https://colab.research.google.com/, accessed on 20 December 2024). Various visualization techniques were employed to analyze generalization performance, including bar graphs, radar charts, confusion matrices, and heat maps. A comparative study was performed against state-of-the-art models under identical conditions. Analysis of Variance (ANOVA) tests analyzed performance variations, followed by Tukey’s Honestly Significant Difference (HSD) post hoc analysis to validate statistical significance.
where
(P) is precision and
(R) is a recall, true positive
(TP) represents correctly identified Real obstacles, and false positive
(FP) represents Fake obstacles incorrectly classified as Real. In addition, true negatives
(TN) and false negatives
(FN) represent correctly identified Fake and Real obstacles incorrectly classified as Fake simultaneously. Additionally, the overall workflow of obstacle classification and navigation is illustrated in
Figure 6, which outlines the integration of classification and decision-making steps for autonomous navigation in orchard environments.
4. Discussion
4.1. Performance Analysis of Model Generalization Across Different Datasets
The model demonstrated strong adaptability across individual datasets, achieving a precision, recall, and F1-score of 0.96 on the orchard dataset, which included complex conditions. For Fake obstacles, it recorded a precision of 0.94, recall of 0.99, and F1-score of 0.96. For Real obstacles, the precision was 0.99, recall 0.94, and F1-score 0.96.
In contrast, the campus dataset showed slightly lower performance, with all metrics averaging 0.91. Specifically, for Fake obstacles, precision was 0.87, recall 0.94, and F1-score 0.91, while for Real obstacles, precision was 0.94, recall 0.88, and F1-score 0.91. The varied conditions in the orchard dataset contributed to improved model generalization. In contrast, the more uniform data in the campus dataset led to a higher misclassification rate for Fake obstacles. Expanding the training set to include more diverse environments is recommended to improve performance further.
When tested on the combined dataset, the orchard-trained model outperformed the campus-trained model by 2% for Fake obstacles (precision: 0.91 vs. 0.87, recall: 0.94 vs. 0.90, F1-score: 0.94 vs. 0.89) and by 4% for Real obstacles (precision: 0.95 vs. 0.92, recall: 0.96 vs. 0.91, F1-score: 0.96 vs. 0.92). This indicates the benefit of training on a more diverse dataset. The orchard-trained model excelled under challenging conditions such as varying lighting and occlusions, resulting in fewer false positives and negatives during low-light scenarios. This analysis reinforces the model’s suitability for real-time agricultural monitoring across varying illumination conditions. This ensures safer and faster autonomous navigation in dynamic environments like orchards. Furthermore, the computational efficiency analysis discussed in
Section 3.2.4 shows that while the orchard-trained model requires more computational resources, it significantly enhances both accuracy and generalization. Future research should focus on optimizing these resource needs to improve real-time performance on low-power devices. Additionally, the evaluation of the open test set [
48] illustrated the model’s capacity to generalize across new obstacle categories, including tractors and overlapping objects such as tree trunks and support poles, underscoring its robustness and potential applicability in various agricultural settings.
4.2. Comparison with State-of-the-Art Models
A comparative analysis of the orchard-combined trained model against VGG16, ResNet50, MobileNetV3, DenseNet121, EfficientNetB0, and InceptionV3 shows that the proposed model outperforms these state-of-the-art architectures. It achieves an accuracy of 0.90 and excels in Fake obstacle detection with a precision of 0.91, recall of 0.94, and F1-score of 0.94. For Real obstacle detection, the model achieves a precision of 0.95, recall of 0.96, and F1-score of 0.96, which is crucial for minimizing false alarms in autonomous navigation. These results highlight the proposed model’s superior adaptability across diverse environments, reinforcing its suitability for real-time obstacle detection in agricultural robotics.
Furthermore, the orchard-combined model achieved 2.31 FPS, outperforming DenseNet121 (1.92 FPS) and EfficientNetB0 (2.01 FPS), demonstrating its efficiency for real-time agricultural applications. It also maintained reasonable memory usage (3191.65 MB), making it suitable for lightweight autonomous platforms. While MobileNetV3 offers a slightly higher FPS, it comes with lower accuracy. The orchard-combined model effectively balances accuracy and efficiency, reinforcing its suitability for real-time deployment in agricultural robotics.
4.3. Key Strengths and Limitations of the Proposed Method
The method presented in this study offers real-time obstacle detection and classification into Real or Fake obstacles. It demonstrates several key strengths, including high accuracy, strong generalization, and a competitive advantage over existing models. These qualities make it well-suited for deployment on autonomous robots, enabling the fastest navigation in agricultural settings while minimizing delays and errors. Additionally, the open test set evaluation revealed that the proposed model can recognize previously unseen obstacles, such as tractors and various agricultural scenarios. This suggests the model’s strong generalization capabilities and robustness. The model achieved 92.0% accuracy on the campus dataset and 95.0% in the orchard, confirming its reliability across diverse agricultural environments. The comparative analysis under different lighting conditions further validates its adaptability, particularly in orchard environments with low-light scenarios. The model’s inference speed of 2.31 FPS surpasses several well-known architectures, including InceptionV3, DenseNet121, and MobileNetV3, striking an optimal balance between accuracy and efficiency for real-time deployment. However, occasional misclassifications indicate that additional training data or architecture refinement could further enhance its performance. Addressing this limitation would improve its applicability for open-set recognition tasks in dynamic agricultural environments.
Despite these strengths, the approach has certain limitations. The training phase exhibited fluctuations in validation loss, indicating potential optimization challenges and occasional instability. While the complex orchard dataset contributed to enhanced generalization, the more straightforward campus dataset may have restricted the model’s adaptability to less complex environments. Addressing these issues by fine-tuning hyperparameters and introducing additional diverse training data could further improve robustness. The dataset was collected under varying lighting conditions—early morning, afternoon, and evening—which adds to its robustness. However, it does not take into account factors such as temperature and adverse weather conditions (e.g., rain and fog). Expanding the dataset to include these elements could significantly improve the model’s applicability in real-world situations. Additionally, handling occluded or overlapping objects remains a challenge, and future refinements could focus on improving object recognition in dense foliage or cluttered environments. Misclassifications between Real and Fake obstacles in such scenarios may lead to operational inefficiencies. For example, falsely identifying benign elements like overhanging branches as Real obstacles can trigger unnecessary avoidance behavior, while failing to detect actual obstacles could pose safety risks. Addressing this edge case remains essential for ensuring both smooth navigation and reliable decision-making in densely vegetated environments. Future efforts will focus on optimizing the model for lightweight embedded systems, thereby enhancing its utility across diverse agricultural scenarios.
4.4. Practical Implications
This research underscores the significance of advanced CNN architectures for autonomous obstacle classification. Integrating Ghost Modules and SE Blocks enhances computational efficiency, making the model ideal for real-world deployment on autonomous robots. Additionally, this study highlights the practical advantages of autonomous robots in agriculture, showcasing high precision and recall for obstacle detection, which is crucial for minimizing disruptions and collisions.
For instance, during orchard harvesting, the model effectively distinguishes between Real obstacles—physical objects that must be avoided, such as tree trunks, humans, or machinery—and “Fake” obstacles—visual artifacts or benign elements like overhanging branches or tall weeds—thereby optimizing autonomous navigation and reducing operational downtime. Furthermore, the model’s robust performance under varying lighting conditions makes it suitable for low-light environments, which are common in agricultural operations. Its strong generalization capability suggests potential applications in automated pesticide spraying, fruit harvesting, and field monitoring, further advancing precision agriculture. To explore multi-modal data fusion, our future work will utilize depth data from the Intel RealSense D455 camera for close-range obstacle distance detection experiments. This approach is economically efficient compared to LiDAR systems. Additionally, we plan to integrate LiDAR and camera fusion for long-range scenarios to further enhance classification accuracy and robustness.
Beyond technical performance, the proposed system has broader implications for sustainable agriculture. Automating obstacle detection and navigation can significantly reduce reliance on manual labor during time-intensive tasks like harvesting and monitoring, ultimately lowering labor costs and improving operational efficiency in orchards. Furthermore, since the system operates entirely on visual data captured in open environments, it minimizes concerns around privacy or ethical data use, making it a responsible choice for deploying AI in agricultural fields.
5. Conclusions
This study contributes to autonomous robotics by developing a CNN model that accurately classifies Real and Fake obstacles, enhancing navigation safety. The system effectively differentiates between Real and Fake obstacles, enabling faster autonomous decision-making and seamless navigation. Its onboard framework optimizes computational efficiency, ensuring reliable deployment in agricultural environments. Building on our previous work with YOLOv8-based detection, this research introduces an efficient architecture tailored for real-time applications.
The findings demonstrate that incorporating advanced CNN components and hyperparameter optimization significantly improves performance. In real-time evaluations, the model achieved high accuracy (92.0% on campus, 95.0% in the orchard) across different environments, confirming its reliability for real-world deployment. The comparative analysis under different lighting conditions validates the model’s adaptability, particularly in orchard environments with common low-light scenarios. The proposed model effectively addresses the challenges of varying illumination, ensuring reliable performance in real-time agricultural applications. The evaluation of an open test set further confirmed its robustness and adaptability for real-world deployment.
The orchard-combined model achieved 2.31 FPS, demonstrating superior inference speed compared to all other comparative models, including InceptionV3 (2.18 FPS), DenseNet121 (1.92 FPS), VGG16 (1.61 FPS), ResNet50 (2.07 FPS), MobileNetV3 (1.91 FPS), and EfficientNetB0 (2.01 FPS). While lightweight models like MobileNetV2 offer slightly lower FPS, the orchard-combined model strikes an optimal balance between accuracy and efficiency, making it a strong candidate for real-time agricultural robotics.
Moreover, the analysis of computational efficiency (
Section 3.2.4) underscores the significance of inference speed and memory consumption in selecting models for deployment on low-power devices. By ensuring that the system remains efficient in practical agricultural settings, the orchard-combined model stands out for its ability to meet the demands of real-time applications. Future work will focus on enhancing the model’s ability to handle occluded or overlapping objects by incorporating more diverse training data and refining the architecture. In particular, incorporating weather variability, such as rain, fog, and temperature changes, will be essential for achieving reliable operation in unpredictable field conditions, further improving the model’s adaptability in real-world agricultural settings. Future improvements will also optimize inference speed and evaluate performance in additional agricultural environments. The implications for sustainable agriculture are notable, particularly in improving the efficiency and safety of agricultural robots for tasks such as fruit harvesting, weeding, spraying, planting, and field monitoring. While minor misclassifications occurred in dense foliage, future improvements will focus on enhancing robustness. This research advances autonomous navigation, providing a practical solution for real-time obstacle classification in agricultural environments.