*2.1. Structure and Working Principle of the Test Platform*

The structure of the pineapple eye recognition and positioning test platform is shown in Figure 1. The notebook is an HP-Shadow Elf equipped with an Intel i7-10750H CPU@2.60 GHz processor, 16 GB RAM, and an NVIDIA GeForce GTX1650Ti graphics card. The 64-bit Windows 10 operating system is installed, and the software development environment is Visual Studio2017 + OpenCV4.0.0. The color camera is an Imaging Source DFK41BU02 with a resolution of 1280(H) × 960(V), a frame rate of 15 fps, and an 8.5 mm Computar lens. A CR-9600-R ring light source is installed directly under the camera lens. The Mitsubishi FX3U-32MT PLC controller is used as the control core, and the PLC is connected to the notebook through the serial communication port. The motion platform is composed of a clamping cylinder, servo motor, linear slide, probe cylinder, and probe. The peeled pineapple is clamped using the clamping cylinder and rotated at a precise angle by the servo motor to acquire the entire circumference of the pineapple image. In this paper, a probe is used to evaluate the accuracy of the identification and positioning algorithm. The probe is installed on the probe cylinder and can be inserted into the pineapple through the telescopic movement of the probe cylinder. The probe cylinder, which can accurately move, is installed and positioned in the direction parallel to the pineapple axis.

**Figure 1.** Structure of the test platform. (**a**) color camera, (**b**) ring light source, (**c**) notebook, (**d**) light source controller, (**e**) PLC controller, (**f**) linear slide, (**g**) probe cylinder, (**h**) probe, (**i**) pineapple eye, (**j**) servo motor, (**k**) clamping cylinder, and (**l**) pineapple.

#### *2.2. Image Acquisition of Pineapple Eyes*

Goodfarmer Philippine pineapples, which were manually peeled and placed on the test platform for image acquisition, were used for the experiments. Before image acquisition, the dot calibration plate was used to reduce the lens distortion and perspective distortion caused by the tilt of the camera [30]. To obtain the images of all pineapple eyes and provide a sufficient number of images for multiangle image stereo matching, images of pineapples were collected in 60◦ intervals, and 6 images were collected for each pineapple. Figure 2 shows images of the same pineapple collected from different angles. From this figure, there are obvious differences in the shape and size of pineapple eyes.

**Figure 2.** Images of the same pineapple at different angles. (**a**) 0 degrees; (**b**) 60 degrees; (**c**) 120 degrees; (**d**) 180 degrees; (**e**) 240 degrees; (**f**) 300 degrees.

#### *2.3. Pineapple Eye Recognition Algorithm Based on YOLOv5*

In this paper, YOLOv5 is selected as the target detection network for pineapple eye recognition. Among the commonly used object detection networks, strong detection performance is achieved with the YOLOv5 network [31], which uses mosaic data enhancement, adaptive anchor frame calculation, and adaptive image scaling at the input end. In the backbone network, the features of the target adopted through Focus and CSPNet (crossstage partial network) can be quickly extracted. In the neck network, FPN (feature pyramid network) and PANet are used for multiscale fusion of the extracted features. GIoU (generalized intersection over union) loss is used as the loss function of the target detection frame in the output end. NMS (nonmaximum suppression) is introduced to filter out the overlapping candidate frames and obtain the best prediction output. These improvements ensure the detection accuracy and speed of small targets and have the advantages of a shallow structure, small weight file, and relatively low requirements for the configuration of the mounted equipment.

There are 4 versions of YOLOv5 [32]: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The width and depth of the YOLOv5s model are the initial values. This model is small and fast and is suitable for the detection of small and simple datasets. The YOLOv5m and YOLOv5x models have the deepest depths and are suitable for detection on large and complex datasets. As the depth of the network increases, the detection accuracy is improved, while the detection speed is reduced. In YOLOv5, the learning ability of

the neural network improves, the amount of calculation is reduced, and high detection accuracy is maintained. To maximize the detection speed while maintaining sufficient detection accuracy, YOLOv5l is used in this paper as the pineapple eye detection model. The structure of YOLOv5l is shown in Figure 3.

To construct the experimental dataset, 240 pineapple images were obtained from 40 pineapples. Then, the image was processed with data enhancements, such as rotation and horizontal and vertical mirror images, to improve the robustness of the recognition mode, and 600 pineapple images were finally obtained, with a total of approximately 18,000 pineapple eyes. The pineapple eye images were manually labeled one by one by labeling software. Pineapple eyes in the image were marked with a rectangular box and then named P. The labeling information was stored in the PASCALVOC (Pattern Analysis, Statical Modeling and Computational Learning, Visual Object Classes) format [33], in which the coordinates, labels, and serial numbers of each box are contained. The pineapple eye image, labeled data, and other files were saved according to the PASCALVOC dataset directory structure to build the pineapple eye dataset.

The 600 pineapple eye images enhanced by the dataset were divided into a training set, validation set, and test set at an 8:1:1 ratio. Because the size of the pineapple eye target is small, to improve the detection accuracy, the input size is 640 × 640 pixels, 32 images were taken as a batch, and the weight parameters were updated once for each batch of images trained.

YOLOv5 incorporates the current mainstream detection approach FPN (feature pyramid network) [34] and inherits the grid generation idea of the YOLO algorithm. The 640 × 640 feature plot is divided into grid areas of equal size S × S cells (usually 80 × 80, 40 × 40, or 20 × 20). After maximum suppression, the output end of the network outputs the prediction information of all grid information. The prediction information of each grid includes the classification probability and confidence of the target as well as the center coordinates and length and width of the box surrounding the detection target. The classification probability represents the classification information of the predicted target in the grid region, and the confidence represents the probability of the detection target in the

grid region. The central coordinates and length-and-width information of the box represent the specific size and position of the target predicted by the grid.
