*3.3. Apple Detection Effects Contrast Experiment of Different Models*

To verify the superiority of the ShufflenetV2-YOLOX model proposed in this paper, it is compared with YOLOv5-s, YOLOv4-Tiny, Efficientdet-d0, Mobilenetv2-YOLOv4-lite, and YOLOX-Tiny [7,26,27]. Figure 9 shows the apple detection results of ShufflenetV2- YOLOX and other models in the natural environment. ShufflenetV2-YOLOX, YOLOv4- Tiny, YOLOX-Tiny, Mobilenetv2-YOLOv4-lite, and YOLOv5-s have an image input size of 416 × 416, and Efficientdet-d0 has better results because its network settings have a fixed input size of 512 × 512. To make each model have a clearer contrast effect, this paper selects the apples detected by all models as the total set and marks the detection effect diagram of each model. The white circle indicates the missed area, and the blue circle indicates the missed area. The more white and blue circles, the worse the effect of the model. As can be seen from Figure 9, apple targets during the day are bright in color and distinct in shape. Most models perform best on unbagged apples during the day. On the other hand, the plastic bags on the surface of the apples can blur their color and shape characteristics, resulting in the target and background being too close together. Bagged apples are therefore very susceptible to missed detection. At night, apples under strong light and low light are difficult to detect due to illumination problems. However, the ShufflenetV2-YOLOX model proposed in this paper has the least white and blue circles in the detection images, indicating that it has the highest recall rate. In particular, apple images in bagging and at night, although not all targets in the image are detected, have a significant advantage over other lightweight networks. This shows that the model can effectively solve the problem of low recall rate of the apple detection network under bagging and night conditions.

**Figure 9.** *Cont*.

**Figure 9.** Comparison of ShufflenetV2-YOLOX with other advanced networks for apple detection effects.

Figure 10 shows a comparison of the PR curves of the different models for apple detection. Table 4 shows a comparison of AP, precision, recall, F1, parameters, and FPS for the different models. In terms of detection accuracy, YOLOv4-Tiny is a simplified lightweight network from YOLOv4 with an AP of 89.14%, which is close to the performance of YOLOX-Tiny. YOLOv5-s is currently one of the best detection results among lightweight networks, with a relatively high recall and detection accuracy. The AP and F1 reach 95.44% and 0.94, respectively. Mobilenet-YOLOv4-lite achieves an AP of 92.99%. It has the highest accuracy of the tested models with 95.96%, but it does not have a high recall of 83.59%, which does not meet the apple target detection requirements. The performance of Efficientdet-d0 is similar to that of Mobilenet-YOLOv4-lite. The ShufflenetV2-YOLOX model proposed in this paper has a high recognition accuracy with an AP of 96.76% and a detection accuracy of 95.62%. In particular, the recall rate is the highest score among all lightweight networks, reaching 93.75%. Compared to other models, our model can effectively detect bagged and nighttime apple targets from low-resolution images, which is responsible for its high recall rate.

**Figure 10.** PR curve comparison of ShufflenetV2-YOLOX with other advanced networks.


**Table 4.** Comparison of ShufflenetV2-YOLOX with other lightweight networks.

In terms of detection speed, Yolov4-tiny and YOLOX-Tiny have an advantage in detection speed due to their lightweight network structure design, which can reach around 55 FPS. YOLOv5-s is a little slower at 18 FPS, and Efficientdet-d0 has fewer network parameters but is slow because it uses a lot of deeply divisible volume integrals. Although its floating-point operations per second (FLOPS) are small, it spends more time on memory access costs, so the speed is not ideal at 21 FPS. MobilenetV2-YOLOv4-lite uses MobilenetV2 to replace the YOLOv4 backbone, but the PANet is still large, and it uses deep detachable convolution instead of partial convolution, so the detection speed is not ideal, only 22 FPS. Our ShufflenetV2-YOLOX benefits from a lightweight backbone network with a low number of parameters. The anchor-free and two feature extraction layers can in turn reduce parameters and computations while satisfying the actual apple orchard detection. This results in a fast recognition speed of up to 65 FPS.

With higher detection accuracy and speed, ShufflenetV2-YOLOX enables real-time, accurate, and fast recognition of apples in natural environments, making it more suitable for deployment in apple picking robots.

#### *3.4. Apple Detection Effect in Embedded Devices*

Traditional deep learning algorithms use an Industrial Personal Computer (IPC) as the deployment device, which is not suitable for real-time apple detection in the field, due to its weight and power limitations. The edge device has powerful arithmetic power, small size, light weight and low power consumption. It can locally perform arithmetic processing on the collected data and is a good choice to replace IPCs, and NVIDIA Jetson Nano is the most cost-effective edge device available [10].

The apple picking experimental platform with Jetson Nano as the controller is shown in Figure 11. It mainly consists of a moving part, a gripper, a visual recognition system, and a robot arm. When the apple picking robot starts the picking task, it will first detect and select an apple through the visual recognition system. Then, it sends the apple's position information to the control system, and the robot arm is driven to approach the apple. The gripper will be driven to the designated position to grab the apple and use the cutter to cut off the stalk.

**Figure 11.** The apple picking experimental platform.

In this paper, we use Jetson Nano as an embedded deployment platform with software environment JetPack-4.5.1, TensorRT-7.1.3, and the image input size set to 416 × 416. The Pytorch model is first transformed into an ONNX model, and then TensorRT is used to quantify the accuracy of the parameters of the model and to merge the workflow so that it keeps the model running on the GPU as much as possible, thus allowing the model to run faster. We test the inference speed of the Pytorch Single-precision floating-point (FP32) model, ONNX INT64 model, TensorRT FP32 model, and TensorRT FP16 model on Jetson Nano. In Figure 12, the arrows refer to the increase or decrease in accuracy as a result of this operation compared to the previous phase.

**Figure 12.** ShufflenetV2-YOLOX models for inference speed and AP accuracy on Jetson Nano.

On the Jetson Nano, the ShufflenetV2-YOLOX model with Pytorch FP32 can run at a speed of 11.5 FPS. The ONNX model, on the contrary, runs slower because of its parameter precision of double precision (INT64). As shown in Figure 11, we can see that TensorRT is very effective in accelerating the model. The TensorRT FP32 detects 47.8% faster with essentially no change in AP accuracy, reaching 17.1 FPS, while the TensorRT FP16 model detects 26.3 FPS with only a 0.88% loss in AP, a 53.8% improvement compared to the TensorRT FP32, and a 128.3% improvement compared to the original Pytorch FP32 model. ShufflenetV2-YOLOX is fully capable of meeting the real-time requirements of picking robots on embedded devices.
