1. Introduction
The fruit-growing industry encounters significant challenges during the annual orchard harvest season, striving to ensure timely delivery of fresh produce to the market. This challenge is particularly pronounced in apple harvesting, a labor-intensive and time-consuming endeavor [
1]. Moreover, escalating labor costs exacerbate the inefficiency, expense, and risk associated with manual harvesting [
2,
3]. Consequently, to enhance agricultural productivity and supplant manual apple picking in orchards, the development of robots capable of autonomous and intelligent operation in orchard environments is crucial [
4,
5].
Apple-picking robots consist primarily of two key subsystems: the vision system and the picking execution system. Accurate target recognition by the vision system is prerequisite to grasping and picking actions on apple targets. Vision serves as the cornerstone of information perception [
6]. The accuracy and efficiency of the vision system, encompassing recognition and localization, profoundly influence the picking execution system, directly impacting picking efficiency. Hence, it has emerged as a pivotal means for robots to perceive their surroundings [
7]. In recent decades, propelled by the continual advancement of precision agriculture technology, orchard robots have been increasingly applied in the agricultural domain. However, the practical utilization of highly efficient apple-picking robots remains limited [
8]. Despite the availability of diverse exemplary target detection models capable of recognizing various target types, their applicability in specific apple-picking scenarios remains constrained [
9]. This constraint can be primarily attributable to the complex natural scenes characteristic of unstructured apple orchards, encompassing fluctuating lighting conditions (e.g., sunny, cloudy, backlit), light obstruction by foliage and fruits, resulting in shadowing on apple surfaces, and fruit overlap, all of which significantly impede accurate identification of apple targets. Consequently, precise target positioning and execution of the picking task are compromised [
10]. Enhancing recognition accuracy in complex scenes thus represents the crux and challenge of vision technology for picking robots.
Currently, two primary approaches dominate computer vision: recognition methods grounded in prior knowledge models and data-driven deep learning methods, as illustrated in
Figure 1. Model-based recognition algorithms entail direct engagement in target feature engineering, such as constructing features like Scale-Invariant Feature Transform (SIFT), Histogram of Oriented Gradient (HOG), and Speeded Up Robust Features (SURF). These engineered features are then integrated with machine learning principles like the K-means clustering algorithm [
11], the region growing method [
12], and algorithms such as Support Vector Machines (SVMs) [
13], to facilitate classification recognition and semantic segmentation. This enables target recognition amidst complex backgrounds. Such algorithms offer ease of explanation and comprehension, owing to their comprehensive grasp of data and underlying algorithms, thereby facilitating straightforward parameter tuning and model design adjustments [
10].
Manickam and Chithra [
14] converted RGB color images depicting apple fruits into grayscale representations. Subsequently, they delineated the region of interest within these apple images by applying five distinct fuzzy clustering bins, characterized by overlapping pixel ranges. The selection of the cluster with the maximal pixel count facilitated the computation of the threshold value. Li and Jing [
15] proposed an innovative method for apple target recognition grounded in K-Means feature clustering. Operating within the L* a* b color space, this method harnessed K-Means clustering to partition the image, leveraging the *a component to discern the target from the background, thereby enhancing segmentation accuracy. Gill et al. [
16] designed an objective function predicated on cross entropy, subsequently employing a teacher-learner-based optimization algorithm to minimize it. This optimization process yielded optimal thresholds at varying levels, subsequently utilized for segmenting red, green, and golden apple images. Zou et al. [
17] introduced a pioneering color-index-based approach to apple image segmentation, affording automatic determination of a specific color index pertinent to the image segmentation task. Wang et al. [
18] introduced a novel kernel density clustering algorithm (KDC). This algorithm employed a simple linear iterative clustering mechanism to partition the apple image into irregular blocks, thereafter amalgamating approximate pixels within confined regions into superpixel regions. In scenarios characterized by notable disparities between the fruit and background in apple imagery, color information emerged as the most direct discriminator for distinguishing ripe fruit from the background. Nevertheless, akin to the aforementioned studies, notwithstanding the attainment of heightened apple recognition, the intricate natural milieu of orchards poses significant challenges. When light is obstructed by branches, leaves, or other fruits, shadows and light spots are formed on the apple surface, thereby perturbing the extractable or learnable features from the image. Therefore, algorithms must consider the complex characteristics of orchards to further refine algorithmic recognition performance [
19].
Existing CNN-based fruit recognition techniques are based on the principle of anchor frame recognition, categorized into one-stage and two-stage modalities. One-stage methodologies bifurcate the fruit recognition problem into two discrete steps: initial generation of candidate regions potentially encompassing fruits by the network, succeeded by the classification of these candidate regions. Networks under this paradigm typically exhibit protracted recognition times, colloquially termed two-stage detection. Representative algorithms encompass RCNN (Regions with CNN features), Fast RCNN, and Faster RCNN [
20,
21,
22]. Rahnemoonfar and Sheppard [
23] modified the Inception-ResNet network architecture, quantifying distinct fruits with an average accuracy approaching 91%. However, the discernment accuracy markedly diminishes in instances where fruits overlap or occlude one another. Tong et al. [
2] integrated Swin-Transformer and ResNet50 as backbone models with Mask R-CNN and Cascade Mask R-CNN to detect and segment trunks, major branches, and supporting structures within apple trees. Xiong et al. [
24] trained the Faster R-CNN network model to recognize citrus fruits, achieving a test accuracy surpassing 85% on the validation set. Identification errors and omissions primarily occur because of the inadequate generalization capability and insensitivity of the model to smaller fruits.
Two-stage recognition directly employs the CNN to ascertain fruit confidence and location, facilitating faster recognition speeds. Representative algorithms include YOLO and SSD [
25,
26]. In [
27], Kang and Chen achieved apple recognition by devising a lightweight backbone network, enabling a recognition time of approximately 0.028s per image. Although this method enhances recognition speed, the size of the Anchor Box must be preset, and the model’s Average Precision (AP) value for apple recognition stands at 85.3%. In [
23], Zhao et al. improved the YOLO convolutional neural network to recognize apples, achieving a single-image recognition time of about 0.017s. However, the recognition accuracy for occluded and overlapping apples is 75.15%. Similarly, Tian et al. [
28] enhanced the YOLO convolutional neural network, also with a recognition time of around 0.017s, yet encountered a similar accuracy challenge with occluded and overlapping apples. Tian et al. further improved the YOLOv3 algorithm by replacing the feature extraction backbone network with DenseNet for recognizing apples across different growth stages, although recognition in denser scenes remains untested. In [
29], Wang et al. proposed a coordinated control strategy for long and short distances. In the long-distance stage, YOLOv5 is combined with the DBSCAN point cloud clustering method to determine the target position; in the short-distance stage, Mask RCNN is used to segment the bifurcated branches, achieving a positioning success rate of 88.46% for the branches, but showing poor learning ability for small bifurcated branches. In [
30], Wang et al. used the YOLOv8-Seg model to segment lychee fruits and branches, achieving a success rate of 88% in identifying picking points, with an average positioning error of 2.8511 mm and an average recognition time of 0.082 s.
Network architectures must be designed with lightweight and real-time operation for deployment on robots. Therefore, researchers have favored lightweight networks such as the YOLO family and GhostNet, enhancing detection accuracy while mitigating computational overhead. Zhang et al. [
29] proposed an Enhanced YOLOv4, amalgamating GhostNet with a coordinate attention module and depth-separable convolution. This optimization of the neck and YOLO head structure enhances detection accuracy while reducing computational demands. Sekharamantry et al. [
30] introduced a Yolov5 architecture that incorporates an adaptive pooling scheme and an attribute enhancement model, elevating feature quality for apple detection amidst complex backgrounds. Additionally, the introduction of a bounding box loss function ensures precise bounding boxes, thereby maximizing detection accuracy. In [
31], Chen et al. proposed a series of vision algorithms for picking robots, aimed at motion target estimation, real-time self-localization, and dynamic picking, enabling the robot to operate continuously autonomously. However, it performs poorly at night. In [
32], Li et al. introduced the YOLOv7-Litchi algorithm, which integrates channel and spatial attention along with multi-head self-attention from transformers, achieving recall and accuracy rates of 95.9% and 94.6%, respectively. This algorithm is characterized by its fast detection speed and strong robustness. In [
33], Shu et al. proposed an algorithm that can correct image colors according to light conditions by calculating the Hue color layer, analyzing the external features and position data of lychee, effectively solving the occlusion problem of neighboring lychees.
Real-time performance constitutes another pivotal metric. By refining network structures—for instance, by substituting standard convolutional modules with lightweight counterparts—and incorporating mechanisms like attention modules and bounding box loss functions, both detection speed and accuracy have seen enhancements. These methods empower robots to swiftly and accurately identify fruit targets in complex environments. Ji et al. [
34] devised an apple detection method based on Shufflenetv2-YOLOX, utilizing YOLOX-Tiny and lightweight Shufflenetv2 as the backbone, alongside a convolutional block attention module (CBAM) and an adaptive spatial feature fusion (ASFF) module to bolster detection accuracy. Lee et al. [
35] proposed leveraging Rand Augment (RA) to elevate crop detection performance through geometric, photometric, and partial occlusion transformations, thereby optimizing computational efficiency to alleviate burden on mobile platforms. In conclusion, the research and design of intelligent fruit-picking robotic vision systems necessitate the integration of lightweight, real-time, and accurate methodologies to address the demands of modern agriculture, thereby enhancing picking efficiency and reducing costs.
In our prior study, we examined the color attributes and local variations within apple images, constructing feature engineering based on the intrinsic characteristics of the recognition target. Accordingly, a color prior model-based algorithm was developed for apple image recognition, particularly adept at handling variations in lighting and shading, thereby rendering apple recognition more interpretable and comprehensible [
10]. However, the segmentation effect is weak when the target is obscured by branches, leaves, or overlapping occlusions. Conventional model-based apple recognition methods often fail in segmenting apples under such conditions, highlighting a common deficiency in these approaches. In our earlier work, we also proposed a framework for further exploration, advocating for the fusion of a prior knowledge-based approaches with the strengths of deep learning methods for a more reasoned explanation and broader applicability. Accordingly, our team leveraged an improved CenterNet neural network model, achieving an average recognition accuracy of 96.26% on one class of the test set (dense scenes, i.e., long-distance scenes), and 92.47% on another class of the test set (scenes characterized by smooth lighting, backlighting, occlusion, and overlapping phenomena), representing a marginal decrease of 3.79% [
36]. While transfer learning can potentially address this issue, extreme lighting conditions, whether excessively strong or weak, may adversely affect recognition outcomes [
37]. Furthermore, an improved backbone feature extraction network prototype, known as Hourglass, was adopted. Originally devised by the CenterNet designers, the Hourglass network possesses a symmetric structure designed to leverage multi-scale features for recognizing complex target poses [
38]. Notably, upsampling and delayed downsampling are followed by convolution, resulting in substantially increased computational load. This network, characterized by its substantial volume, even with a reduced number of layers, remains computationally intensive. Consequently, its utility in orchard detection hinges significantly on hardware capabilities, with discernible trade-offs in loading and operational speed. The suitability of such architectures warrants must be considered carefully because apple recognition encompasses a single category and does not necessitate an overly robust backbone network.
Addressing the challenges posed by complex natural scenes in apple orchards (e.g., variations in lighting, backlighting, occlusions, and overlapping phenomena) on the vision system of picking robots, we propose an apple recognition method based on a color a prior model, building upon previous research. Specifically, we introduce gray-centered colored spatial vertical decomposition maps, leveraging a prior knowledge that these maps can mitigate the impact of light and shadows. By splicing the original map, our aim is to bolster the picking robots’ recognition rates, enabling them to better withstand the challenges posed by complex scenes characterized by light variations, shadows, and occlusions.
Furthermore, to address issues such as slow system response, bulky model size, challenging deployment, and slow loading speeds encountered during the operation of picking robots in orchard settings, we introduce the Light-Weight Net lightweight feature extraction network. Drawing inspiration from the "Objects as Points" concept in human body gesture detection, as well as ideas surrounding grouped convolution and depth-separable convolution, our objective is to develop a recognition model characterized by high detection rates, low computational demands, and a compact footprint, thereby facilitating its deployment within robotic systems.
The subsequent sections of this paper unfold as follows:
Section 2 describes the network design process, encompassing dataset acquisition, generation, and augmentation, alongside the conceptualization of the network design and its implementation steps.
Section 3 elucidates the experimental findings, including analysis and discussion of the apple localization method leveraging the RealSense D435 depth camera.
Section 4 conducts a comparative analysis of the recognition performance of various convolutional neural networks for apple targets, culminating in the establishment of a platform for evaluating the accuracy of the localization method proposed herein in orchard settings. Finally, our conclusions are articulated.