**2. Materials and Methods**

#### *2.1. Apple Image Acquisition and Data Augmentation*

This paper takes the Fuji apple, the largest main apple variety in China, as the research object, and collects apple images from the apple demonstration base in Feng County, Xuzhou City, Jiangsu Province, China. Considering the possible natural environment in the actual orchard picking, the images of unbagged apples, bagged apples, and apples under weak light at night are collected.

In the process of image acquisition, in order to ensure the clarity of the image and meet the working environment of the picking robot, we keep the distance between the camera and the fruit at 0.3 m–2 m. In the night apple image acquisition, a single LED lamp is used for illumination, and the brightness of the fruit area is changed by changing different illumination angles. A total of 1793 pictures are taken during the shooting, including apple images under different natural conditions such as forward light, backlight, side light, overlap, and occlusion, 577 apple images without bagging during the day, 567 apple images bagged during the day, and 649 apple images including bagging at night, as shown in Figure 1. Among them, the appearance of apples in the daytime will vary greatly due to the different angles and intensity of light. Bagging can not only prevent the fruit from being harmed by dust, pests, and pesticide residues, but also make the fruit surface smooth and beautiful, and increase the effective yield and income. However, due to a layer of plastic bags on the surface, the apple will be in an irregular state, and its surface and shape characteristics will be disturbed. This makes traditional image detection methods, such as texture, color difference, and Hough Circles transformation, unable to effectively detect apples [8]. At the same time, there are often water droplets in the plastic bag, which will bring greater difficulties to image detection. Because the image of apples at night is presented under the irradiation of a strong light source, there may be significant contrast on the same picture. For example, the surface of apples directly illuminated by the light source will be strong and bright, resulting in the lack of surface feature information, while those not directly directed will be relatively dark and difficult to detect. Therefore, apple images in the above cases will interfere with image detection to a certain extent [13].

**Figure 1.** Apple image in natural state.

The apple dataset images collected in this experiment are small in number and contain complexities such as bagging, nighttime, occlusion, and overlap. Deep learning has certain requirements on the size of the dataset. If the original dataset is relatively small, it cannot meet the training of the network model well, thus affecting the performance of the model. Image enhancement is the process of expanding the dataset by processing the original image, which can improve the performance of the model to a certain extent. Therefore, we use the imgaug algorithm for data enhancement, using mirror flip, changing brightness, flipping up and down, adding Gaussian noise, dropout, scaling, and other operations to mix and enhance the images with a 10-fold enhancement factor, while ensuring the morphological features are intact. Finally, 17,930 images are obtained, as shown in Figure 2. Although the augmented dataset is slightly different from the actual situation, the blurring is quite beneficial in improving the robustness of the model. The models trained with the data-enhanced dataset have higher accuracy compared to the unfuzzed dataset [14].

**Figure 2.** Image after data augmentation.

The annotation software used in this paper is LabelImg, and the annotation file format is "xml". To better compare different types of networks and training sets, the images are converted to Pascal VOC format. At the same time, the training set and verification set are generated according to the ratio of 9:1, and 30 apple images in the complex natural environment are selected as the test set to verify the detection effect of the model. All networks used in this paper are based on the pre-training of the ImageNet dataset, use migration learning to train 150 epochs on this dataset, and select the best one as the detection weight parameter to load into the network.
