*3.5. Comparison of ShufflenetV2-YOLOX with Existing Apple Target Recognition Methods*

Table 5 gives the ShufflenetV2-YOLOX proposed in this paper as well as existing apple detection approaches. In the FPS column, PC and Edge indicate the speed at which the method runs in the computer and edge devices, respectively.

As can be seen from Table 5, the ShufflenetV2-YOLOX model proposed in this paper does not achieve the highest detection accuracy though, being 1.4 percentage points lower in AP compared to other methods mentioned in the literature. The possible reasons for this are considered: On the one hand, the dataset used in this thesis is complex, with three scenarios present. Each image contains an average of 12 apple targets, which raises the difficulty of apple detection. On the other hand, the network designed in this thesis is a light network, which focuses more on the operation speed of the network. Therefore, it is slightly lacking in detection accuracy. Compared with the methods in [12,13], the improved network in this thesis is more lightweight and improves the detection speed by 62 FPS and 61 FPS, respectively. The study in [10] can achieve a detection speed of 30 FPS on edge devices. However, the Jetson AGX Xaver it uses is eight times more expensive than the Jetson Nano used in this paper and is not cost-effective. Its AP is only 83.64%, well below our 96.76%.

Compared to the parameters in the literature, the ShufflenetV2-YOLOX model proposed in this paper has more outstanding advantages. Real-time detection can be achieved while ensuring detection accuracy.


**Table 5.** Comparison between ShufflenetV2-YOLOX and existing detection methods.

#### **4. Conclusions**

To solve the problems associated with apple object detection in natural environments, this paper presented ShufflenetV2-YOLOX, an improved apple object detection method based on YOLOX-Tiny. The method was trained using a dataset of apples under daytime, bagged, and nighttime conditions. By replacing the backbone network, adding an attention mechanism, adding adaptive feature fusion, and reducing the number of feature extraction layers, the detection speed and detection accuracy of the model were improved.

The AP, accuracy, recall, F1, and FPS of the trained model were 96.76%, 95.62%, 93.75%, 0.95, and 65 FPS, respectively. A 6.24% improvement in AP and 10 FPS improvement in detection speed were achieved compared to the original YOLOX-Tiny network work. In addition, compared to the advanced lightweight networks YOLOv5-s, Efficientdet-d0, YOLOv4-Tiny, and Mobilenet-YOLOv4-Lite, the AP increased by 1.32%, 3.87%, 7.62%, and 3.77%, respectively, and the detection speed increased by 47 FPS, 44 FPS, 11 FPS, and 43 FPS, respectively. This shows that the feature fusion mechanism and the attention mechanism can improve the accuracy of apple detection in natural environments at an additional cost. The application of anchorless detectors overcame the drawbacks of past Anchor-based detectors, which were computationally intensive and reduced the setting of hyperparameters and post-processing. At the same time, the application of a lightweight backbone network and the use of only two feature extraction layers reduced the size of the model and increased the detection speed. For some embedded devices with low computational power, such as the NVIDIA Jetson Nano, the detection speed could reach 11.5 FPS, while with TensorRT acceleration, the inference speed of the TensorRT FP16 model reached 26.3 FPS at the expense of only 0.88% AP.

In summary, it offers significant advantages over other current lightweight networks in terms of detection speed and detection accuracy, and significantly improves recall rates for night and bagged apples. It can meet the requirements of real-time and high-precision detection for embedded devices. The method can provide an effective solution for vision systems for apple-picking robots.

**Author Contributions:** Conceptualization, W.J. and Y.P.; methodology, W.J. and Y.P.; software, Y.P. and J.W.; validation, J.W.; formal analysis, Y.P.; investigation, B.X.; data curation, Y.P.; resources, B.X.; writing—original draft preparation, Y.P.; writing—review and editing, W.J. and J.W.; visualization, B.X.; supervision, W.J.; project administration, W.J.; funding acquisition, W.J. and B.X. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China (No. 61973141), the Jiangsu agriculture science and technology innovation fund (No. CX(20)3059), and A Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (No. PAPD-2018-87).

**Institutional Review Board Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.
