Prune the Feature Layer

Adding modules can improve the detection accuracy of the network but also reduce the detection speed of the network. To improve the detection speed of the network to meet the real-time requirements, this paper chooses to delete one feature extraction layer in PANet and adjust the structure, and only uses two feature layers (TFL) to reduce the amount of calculation. Figure 7 shows the PANnet part of the YOLOX-tiny network. The black box shows the reduced network structure and the number of anchors. We only keep the 13 × 13 and 26 × 26 outputs, i.e., out2 and out3, respectively.

**Figure 7.** PANet with two feature layers.

Using only two feature layers can not only reduce the number of convolution kernels and computational complexity of PANet but also reduce the computing power required for prediction. In the case of 416 × 416 input size and num\_class = 1, deleting a layer of feature layer Head will reduce the original 3549 anchors to 845 anchors. Although this will reduce the detection ability of small targets, it will not select long-distance small targets as objects during the operation of the apple picking robot, and the actual effect verification will not reduce the detection effect of the model too much.

#### **3. Results and Discussion**

The training equipment used in this paper is a PC device with the Windows 10 operating system. The system is equipped with an Intel e5-2683 processor, 64 GB of memory, four NVIDIA GTX1080ti graphics cards, and 11 GB of video memory. The algorithm programs used in this paper are written in the Python language on PyCharm, and CUDA and cuDNN are used for network training acceleration. The training epoch is

set to 150, the batch size is set to 64, and the input image size is set to 416 × 416. Input size and detection speed are mutually exclusive quantities and a smaller image input size speeds up detection. Therefore, the input image size is set to 416 × 416 to improve the real-time performance of the model detection. For all network models to compare performance fairly, the same input size needs to be set in the comparison experiments. This has a significant impact on the performance of the network models. The test equipment uses the Windows 10 operating system, AMD Ryzen7 4800 h processor, 16 GB of memory, an NVIDIA GTX1650 graphics card, and 4 GB of graphics memory (Table 1).

**Table 1.** Test System Hardware.


To verify the detection effect of the model on apples in the natural environment, this paper uses 30 complex orchard pictures as the test set, including 5 daytime unbagged apple pictures, 12 daytime bagged apple pictures, 10 nighttime unbagged apple pictures, and 3 nighttime bagged apple pictures. As nighttime and bagging are the focus and difficulty of the current research on picking robot vision, this paper chooses the nighttime and bagging test images to account for a higher proportion, which can better reflect the model's detection effect on apples in the natural environment.

In this paper, AP, Precision, Recall, Param, FPS, and F1 are selected as the comparison standards for detection effects to determine the pros and cons of the model. Param represents the number of parameters the network contains, and FPS represents the number of pictures the model can detect in one second. Taking the IOU threshold of 0.5 as the standard, the AP value is the area under the Precision—Recall (PR) curve formed by Precision and Recall. F1 score can be regarded as a weighted average of model accuracy and recall, which takes into account both the accuracy and recall of the model.

#### *3.1. ShufflenetV2-YOLOX Model Performance Verification*

To validate the effectiveness of the network improvement method, we chose to conduct ablation experiments to evaluate each step. AP, Param, and FPS were chosen as the evaluation metrics. The results of the ablation experiment are shown in Table 2.


**Table 2.** Ablation experiment.

It can be seen from the data in Table 2 that each step of improvement is an effective improvement, which effectively improves the detection speed or detection accuracy of the model. The AP value of the ShufflenetV2-YOLOX method is 96.76%, which is 6.24% higher than that of the original YOLOX-Tiny method. Although the Param is increased by 0.4 m, the detection speed is increased by 18% to 65 FPS. Both the CBAM module and ASFF module effectively improve the detection effect of the network, and the method of deleting the feature layer also improves the detection speed within the range of tolerable reductions in accuracy. Due to the use of depthwise separable convolution and channel shuffle operations in ShufflenetV2, when CSPDarknet is replaced, although the amount of network parameters is reduced, the detection speed is not improved.

Different deployment devices are suitable for different network structures. For example, a PC can use a CPU or GPU for inference. Depthwise separable convolutions are more suitable for running on CPUs, and normal convolutions are more suitable for running on GPUs. Due to the depthwise convolution and channel shuffle operations used in ShufflenetV2, inference on a GPU is not the best choice. Using ShufflenetV2 as the backbone network can achieve 15.6 FPS on the Ryzen7 4800 h(CPU), while YOLOX-Tiny can only achieve 11.5 FPS. In practice, we can choose different network structures based on different deployment devices.
