1. Introduction
Yellow peaches are favored by consumers due to their nutrient composition and deliciousness [
1]. As their quality of life improves, more and more people are becoming concerned about their health, and the focus of the demand for yellow peaches has shifted from quantity to quality. As a result, more and more farmers utilize various new technologies to manage modern orchards, and accurately counting the number of immature yellow peaches is becoming crucial [
2]. In a practical scenario, by utilizing statistical data regarding immature yellow peaches, growers can optimize decision-making about purchasing bagging materials and employing workers [
3]. Furthermore, such data also can significantly enhance both the yield and quality of yellow peaches while optimizing orchard management.
However, due to the complexity of the orchard environment, manual estimation of the number of peaches is the primary method used [
4]. This method is obviously inaccurate, low-efficiency and high-cost, which makes it challenging to adapt it to the intelligent management of large yellow-peach orchards. In response to these challenges, researchers have conducted in-depth research. Particularly with the continuous advancement of computer vision technology, an increasing number of researchers have begun utilizing visual-detection technology for in situ fruit counting [
5]. The series of YOLO algorithms are particularly favored by researchers and engineers due to their rapid detection speed [
6]. In 2021, Song et al. [
7] used the dense connection mechanism of DenseNet to replace the last three down-sampling layers of the Darknet53 feature-extraction network in the YOLOv3 network in order to enhance feature propagation and achieve repeated use of features. As a result, the mean accuracy of green citrus identification reached 80.98%. In 2022, Hao et al. [
8] introduced a hybrid data-enhancement method for detection of green walnuts and replaced the backbone of the YOLOv3 algorithm with MobileNet-v3, achieving an mean accuracy of 94.52%. Song et al. [
9] improved YOLOv5 to allow it to recognize oil fruits in natural scenes, achieving an mean accuracy of 98.71%. Zhang et al. [
10] added a transformer module with attention mechanism and resulted in a 3.77% increase in mAP. Lv et al. [
11] added a stripe attention module to the backbone network of YOLOv5, enabling the model to pay more attention to the stripe-shaped sleeves of citrus fruits and branches. At the same time, they provided a semi-supervised method as a student model in the classroom, enabling the target-detection algorithm to use unlabeled samples and thus improving the performance of the model and reducing its reliance on labeled samples. The mean accuracy of the improved algorithm in the detection of sleeve citrus and branches reached 77.4% and 53.5%, respectively. Xie et al. [
12] added an attention module to YOLOv5 and modified the loss function by including a small-target-detection layer, resulting in a 12.9% improvement in mean accuracy for a litchi dataset. Zhang et al. [
13] developed a fruit-counting algorithm based on YOLOx that utilizes sample enhancement of specific scenes. Tang [
14] successfully developed YOLOv7-Plum to detect plums in natural environments. In 2023, our team proposed an improved scheme based on YOLOv7 for yellow-peach detection; this innovation achieved an improved detection mean accuracy of 80.4% because we incorporated a CA attention-mechanism module and modified the loss function [
15].
The above studies demonstrated the effectiveness of the YOLO algorithm for fruit detection in orchards. Despite this, in immature peach orchards, detecting targets using the original YOLO network is challenging due to factors such as complex and varied backgrounds, small fruit sizes, and frequent occlusions. Thus, further improvements are needed to enhance its target-detection ability.
The YOLOv8 algorithm was put forward in January 2023 and established a new SOTA (state of the art) for the target-detection model. To further improve the recognition rate of immature small yellow peaches, a novel detection algorithm based on YOLOv8, called EMA-YOLO, is proposed. The main contributions of this study can be stated as follows:
① Introduction of the EMA (Efficient Multi-Scale Attention) attention-mechanism module to encode global information and further aggregate pixel-level features through dimensional interaction.
② Combination with the 160 × 160 scale detection head to enhance small-target-detection capabilities.
③ Employ EIoU (Efficient Intersection over Union) as the loss function to reduce the rates of missed detections or false detections of small target yellow peaches in dense environments.
These improvements are tailored to address the specific challenges posed by detecting immature small yellow peaches in natural environments.
4. Experimental Results and Analysis
4.1. Experimental Result
The experimental results demonstrate that the improved EMA-YOLO model achieved a precision (P) of 0.836 and a recall rate (R) of 0.744, with a corresponding F1 score of 0.787. The precision curve, recall curve, mAP curve and loss curve of YOLOv8 and EMA-YOLO were compared, as shown in
Figure 7. It can be seen that the precision, recall and mAP have been improved, and the loss function converges faster. Some of the test results from the orchard are shown in
Figure 8.
4.2. Ablation Experiment
Ablation experiments were conducted on EMA-YOLO to evaluate the impact of each improvement, and the results are shown in
Table 1. The findings suggest that data augmentation resulted in a 1.6% improvement in mAP, demonstrating its effectiveness in expanding the sample space and improving detection accuracy through enhanced sample diversity. Additionally, integration of the EMA attention-mechanism module resulted in a 1.1% improvement in mAP, highlighting its ability to enhance feature extraction and overall network accuracy.
Furthermore, appending the detection head also resulted in an 0.9% improvement in mAP due to the improved suitability of the p2 detection head for small-object detection, thereby mitigating loss of small-object information as network depth increases.
Moreover, introducing the EIoU loss function resulted in an 0.6% improvement in model mAP. In addition, incorporation of FocalLoss effectively addressed sample imbalance within bounding-box regression tasks by prioritizing high-quality anchor boxes over those with minimal overlap with target boxes.
The results of ablation experiments demonstrate a significant improvement in the mean average precision (mAP) of the model. In this experiment, the shallow features extracted from the neck network structure were fused with the context information extracted from the EAM attention-mechanism module, and the mAP was increased by 2% on input into the small-target detection head. This improvement is mainly due to the fact that our homemade yellow-peach dataset contains a large number of heavily obscured yellow peaches. When EMA is combined with the small-target detection head, the network can detect more small targets and the accuracy increases. On this basis, the exclusion loss function was replaced to further improve the model, allowing it to accurately locate each yellow peach in dense environments and reduce the rates of missed and false under conditions of severe occlusion. The incorporation of data augmentation, utilization of EIoU as the loss function, addition of an attention module, and incorporation of the detection head have collectively led to an increase in the mAP of the YOLOv8 model for yellow-peach detection from 79.9% to 84.1%, thereby achieving superior performance.
4.3. Comparison of Different Networks
In order to demonstrate the advantages of the EMA-YOLO model, we conducted a performance comparison with other common object-detection-algorithm models. The classical object-detection network SSD [
29], which is based on object regression (with vgg-16 backbone network), as well as Objectbox [
30] and the YOLO [
31] series, including YOLOv7-Peach [
15], were included for comparison. The results of the comparison are presented in
Table 2.
The table illustrates that the EMA-YOLO model demonstrates superior precision and recall rates compared to other models, with a mAP of 84.1%. Specifically, the EMA-YOLO model shows a 1.5% improvement in precision compared to YOLOv8, along with a 3.6% increase in recall rate. This indicates a reduced rate of missed detections and enhanced overall accuracy. The orchard environment presents complexities such as background leaves that are similar in color to immature peaches, occlusions from dense fruit distribution, and numerous small targets. However, the improved network model addresses these issues by incorporating an EMA attention mechanism and additional small-target detection head. As a result, there are significant improvements in recall rate. While Single Shot Multibox Detector (SSD) boasts high precision with this dataset, its low recall rate limits its mAP (only 54.0%). On the other hand, ObjectBox—a recently developed anchor-free object-detection network—achieves a precision of 83.8% and a recall of 61.4%, yet falls short in terms of its mAP, at only 69.9%.
In summary, the EMA-YOLO model successfully strikes a balance between high precision and recall rates that align with project requirements.
4.4. Comparison at Different Shooting Distances
Capturing images in a natural environment makes it nearly impossible to maintain constant camera angles and shooting distances. Thus, it is necessary to ensure that objects of different sizes are effectively detected. However, manual methods may overlook some small yellow peaches. Therefore, it is essential to verify the model’s performance by detecting yellow-peach images at different shooting distances.
Figure 9 shows the detection effectiveness of YOLOv8 and EMA-YOLO.
Table 3 summarizes the result of the detection comparison. Compared with the aforementioned experimental, it is evident that there is minimal disparity between the YOLOv8 model and EMA-YOLO model at short distances. In scenarios (a) and (b), EMA-YOLO achieved the perfect result, with no peaches missed, while YOLOv8 missed only one peach in both scenarios. At a moderate distance, YOLOv8 missed three peaches and EMA-YOLO missed two peaches in scenario (a). In scenario (b), YOLOv8 missed three yellow peaches compared to EMA-YOLO’s single miss. However, for long-distance image detection, compared to detection at short and moderate distances, there are more severe cases of missed detections: in scenario (a), seven yellow peaches were overlooked; in scenario (b), ten were omitted by YOLOv8, whereas only one or two were overlooked by EMA-YOLO. In summary, despite some missed detections by EMA-YOLO, its rate is lower than that of the YOLOv8 model, demonstrating its superior performance in the detection of unmatured small yellow peaches.
4.5. Comparison of Different Light Intensities
When working in the orchard, it is important to consider weather conditions. For instance, images captured under strong light typically display higher contrast and more pronounced shadows and highlights of objects, whereas those images acquired under low light may show reduced contrast and potential blurring of object details. Additionally, images captured in low light may also suffer from noise or blur, impacting the clarity of details, while high light allows for clearer detail capture.
To assess the robustness of the EMA-YOLO model, we conducted tests under varying light intensities. The results are depicted in
Figure 10.
Table 4 summarizes the ground truth and numbers of peaches detected by various models. It is apparent that while YOLOv8 missed only two yellow peach at most during detections conducted under strong or moderate light intensities, it failed to detect as many as seven or even ten samples when subjected to low-light conditions. In contrast, EMA-YOLO demonstrated significantly better performance, missing only three instances at most in all scenarios. These findings highlight the superior detection capabilities of EMA-YOLO.
Our analysis also revealed that images captured under weak illumination often suffer from noise or blurring effects, consequently impacting accurate target identification. By integrating an EMA attention-mechanism module into its backbone network along with a dedicated small-target detection head component, EMA-YOLO effectively amplifies feature extraction across diverse information types while prioritizing crucial features over interfering ones. Furthermore, this combination minimizes original data loss due to network transmission, thereby enhancing focus on information relating to smaller targets.
4.6. Comparison of Different Densities
During the shooting process, it was observed that yellow peaches tend to have a dense growth distribution in the natural environment. This results in occlusion between the fruits and also between the fruits and leaves. Such occlusion hinders accurate extraction of certain characteristics of the yellow-peach fruits during target detection, leading to missed identification of partially occluded peaches. Therefore, in order to validate the superiority of the EMA-YOLO model, it is essential to compare its detection capability with that of the YOLOv8 model for yellow-peach targets at different densities, as shown in
Figure 11.
The results are summarized in
Table 5. At sparse densities, neither EMA-YOLO nor YOLOv8 missed any peaches in scenario (a). In scenario (b), the YOLOv8 model missed two objects, while the EMA-YOLO model did not miss any. Thus, the EMA-YOLO model is better than YOLOv8 in this case. When dealing with a moderately dense yellow-peach distribution, YOLOv8 missed five and six yellow peaches in two different figures. On the other hand, EMA-YOLO missed only two yellow peaches in each figures respectively.
In the case of extremely dense distributions of yellow peaches, there is a substantial difference between YOLOv8 and EMA-YOLO. In scenario (a), YOLOv8 detected only 145 yellow peaches; although EMA-YOLO also missed five yellow peaches, there was a difference of 17 yellow peaches compared with YOLOv8’s results. In scenario (b), YOLOv8 missed 14 yellow peaches, 10 more than EMA-YOLO. These results show that in the case of a very dense distribution of yellow peaches, YOLOv8 has a serious missed-detection problem due to the severe occlusion and the small size of yellow peaches. This result also indicates that EMA-YOLO has better detection capabilities.
Furthermore, the convolutional neural network experiences information loss after multiple passes, leading to occlusion in dense distributions and resulting in inaccurate or missed detection. The EMA attention-mechanism module enhances the ability of the EMA-YOLO model to extract information from occluded yellow peaches, prioritizing the retention of information that may be lost during layer-wise transmission. This improvement allows for more accurate detection of severely occluded yellow-peach features in densely distributed areas. Consequently, it significantly improves detection performance under these conditions.
4.7. Comparison of Computational Load
Model parameter number (Params) is a metric used to evaluate the spatial complexity and scale of the model, so low parameter number is an important index indicating a lightweight model. Furthermore, model computation (GFLOPs) is the number of floating-point operations performed by a model in one forward propagation, usually in billions of floating-point operations per second. The model computing power is used to evaluate the computing-resource consumption of the model. Lower computing-power requirements are more applicable to devices or scenarios with limited memory or computing power.
Table 6 summarizes the Params and GFLOPs of our model.
By combining the EMA with the small-target detection head, the network can detect more small targets and improve the accuracy of its results. However, the introduction of the EMA module and the small-target detection head inevitably led to an increase in the parameter number and number of model calculations, and the introduction of loss function cannot make the model lightweight. Although a certain amount of computation is sacrificed, in the visualization results, the model shows obvious improvement in small-target detection and performance under conditions of severe occlusion. In the context of increasingly abundant storage and computing resources, the accuracy of the model should be considered first.
4.8. Discussion
After the EMA attention-mechanism module was integrated and a small-target detection head was added to the YOLOv8 model, the experimental comparison clearly demonstrated improved small-target detection, as well as performance enhancement under different lighting and intensity levels. This led to a reduction in missed detections and an overall improvement in the accuracy of the object-detection algorithm. Although the introduction of the EMA module and the addition of the small-target detection head inevitably increased the number of parameters and model calculations, the model has obviously improved performance for small-target detection and under conditions of severe occlusion. The Grad-CAM method [
32] is commonly used to improve the interpretability of neural network models by generating heat maps based on weight features extracted from different layers.
As shown in
Figure 12, it is evident that the improved EMA-YOLO model exhibits more hierarchical red coloration for detected targets compared to the original YOLOv8 model, particularly when dealing with small yellow-peach targets against similar-colored backgrounds.
According to the above three groups of comparative experiments (
Figure 9,
Figure 10 and
Figure 11), it can be seen that for yellow-peach images taken at short and moderate distances, yellow-peach images taken at strong and moderate light intensities, and yellow-peach images with sparse distributions, the detection results of the EMA-YOLO model include fewer missing targets. Therefore, the EMA-YOLO model has certain advantages in the detection of yellow peaches in orchards and this can basically meet the needs of agricultural detection. However, when the distribution of yellow peaches is dense and the target of yellow peaches is small, the detection capacity of EMA-YOLO for yellow peaches is not ideal. There are more missed detections (refer to
Figure 11 and
Table 5), which may be due to the fact that there is less feature information for occluded yellow peaches in the dense distribution with low resolution, resulting in the failure to extract some features. In view of this, the feature extraction of the input picture information should be strengthened in the subsequent research process to reduce the loss of information caused by the increase in network layers and thus further improve the accuracy.
A method named YOLOv7-Peach, proposed in reference [
15], has a precision rate of 79.3%, a recall rate of 73%, and an average accuracy of 80.4%. Compared with YOLOv7-Peach, our method has a higher recall rate, is more suitable for counting function, and reduces the number of missed detections. Compared with YOLOv7, our method has a 4.3% higher precision, a 1.4% higher recall rate, and a 3.7% higher mAP rate on the same data set. Reference [
15] does not give the relevant computational load index, so it is not possible to make further comparisons in terms of the “lightweight” aspect.