*2.2. System Architecture*

The flow of the field pest 3D locating system proposed in the study is shown in Figure 3, which mainly includes three parts: (1) pest identification and instance segmentation of the Mask R-CNN, (2) locating the laser strike point by extracting the skeleton of the pest, and (3) the 3D localization of laser strike point involved matching template preprocessing, multi-constraint narrowing of the matching region, subpixel stereo matching, and 3D coordinate extraction.

**Figure 3.** Structure diagram of the 3D locating system for field pests.

2.2.1. Instance Segmentation of *Pieris rapae* Image Area Based on Mask R-CNN

(1) Mask R-CNN Model

The accuracy of pest contour segmentation directly affects the accuracy of the laser strike point and stereo matching parallax. Based on the self-built NIR field *P. rapae* image dataset, this paper selected ResNet50-based Mask R-CNN [11] to identify and segment the pests' image area. The model structure is shown in Figure 4, which mainly includes the following steps:


In total, 1000 images of *P. rapae* larvae in different poses were collected in the Brassica oleracea field. The sample numbers were expanded to 2000 by rotation, magnification, and horizontal and vertical mirroring, which improves the robustness of the recognition model [31]. Among them, each image contains at least one *P. rapae* larvae. We then marked the outline of *P. rapae* with the help of the open-source tool LabelMe. This tool can pick *P. rapae* masks from images and output a dataset in COCO format. Finally, the dataset was divided into a training set and a validation set according to the ratio of 8:2 for model training.

**Figure 4.** Overall Mask R-CNN with the ResNet50 model structure.

#### (3) Transfer training

The model training was completed using a PC with the following hardware environment: 32 GB RAM, Inter-Xeon E5-2623 v3\*2@3.00 GHz, and NVIDIA GeForce RTX2080. The software system uses the TensorFlow deep learning framework under Windows 10 and 64-bit operating systems for coding and training and was configured with Python3.6, Anaconda 5.3.1, and CUDA10.0 compilation environments.

The training method adopted the transfer training method. The Mask R-CNN was initialized with the feature extraction network weights of the pre-trained model, while the object classification, bounding box regression, and FCN parameters were randomly initialized. During training, the initial learning rate was 0.001, the momentum parameter was 0.9, and the batch size was set to 1. In the RPN structure, the anchor point sizes were 32, 64, 128, 256, and 512. The anchor point frame ratio was 0.5:1:2.

The model object detection and region segmentation results are shown in Figure 5. The high-quality segmentation mask distinguishes pests from the background, which can be used to calculate the location of the laser strike point directly.

**Figure 5.** Visualization results of the ResNet50-based Mask R-CNN. (**a**–**d**) *P. rapae* larvae in different positions and postures taken from the collected NIR images. (**a**) Multiple pests, (**b**) curled pests, (**c**) occlusion state, and (**d**) dorsal position of the leaf.

2.2.2. Pest Skeleton Extraction and Strike Point Location

(1) Laser strike point

Laser pest control requires focusing the laser on the middle of the pest abdomen to ensure that the laser kills the pests with intense energy. The body of *P. rapae* larvae is tubular and segmented, as shown in Figure 6. The middle part of the abdomen irradiation position was between the 8th and 9th segments, near the midpoint of the skeleton [5,32]. Therefore, this paper set the laser strike point as the midpoint of the skeleton of the pest image area. The improved ZS thinning algorithm was used to extract pest skeletons. Then, pest skeleton chain code was established to extract the skeleton midpoint coordinates to determine the final strike point.

**Figure 6.** Characteristics of the *Pieris rapae* larvae and locating the laser strike point. The body of *P. rapae* larvae can be divided into the head (I), the thorax (II), and the abdomen (III). The numbers 1–14 denote the different segments of the larvae, separated by blue lines.

#### (2) Pest skeleton extraction based on improved ZS thinning algorithm

The skeleton consists of a single pixel, which provides an orientation for extracting the laser strike point coordinates. However, due to the different positions and postures of pests in the field and the sensitivity of the traditional skeleton extraction algorithm to the boundary, the extracted pest skeletons display the phenomenon of a non-single-pixel width and end branches, as shown in Figure 7.

To solve these problems in the above-mentioned thinning process, this paper introduced an improved ZS thinning algorithm [33] with smoothing iterations to extract pest skeletons. The whole skeleton process was divided into three iterative processes: smooth iteration, global iteration, and two-stage scanning.

In the smooth iteration, the candidate deletion points were extracted based on the refinement constraints of the traditional ZS algorithm. Then, the smooth pixel points in the candidate deletion points were preserved in the smooth iteration process, which suppress the branching at the end of the pest skeleton, as shown in Figure 8. Among them, the definition of smooth pixel points satisfies Equation (1):

$$\mathcal{S} \le \mathcal{N}\_b(P\_0) \le \mathcal{G} \tag{1}$$

where *Nb*(*P*0) denotes the number of pixels with value 1 in the neighborhood of the scanning point *P*0.

**Figure 8.** Example of the smooth pixel point determination. The numbers 1–6 denote the candidate deletion points extracted by the ZS thinning algorithm, where 2, 4, and 6 denote the smooth pixel points.

In smoothing iteration and global iteration, the reserved template under 24 neighborhood subdomains was added. The candidate deletion points that meet the retention template were reserved, which avoided the problem of topological structure deletion. Figure 9a–i shows the pixel set of the retention templates. The 24 neighborhood pixels were divided into 4 × 4 subdomains in 4 different directions for generating specific structures in different directions. Figure 9a–h was used to maintain diagonal lines of two-pixel widths, and Figure 9i was used to maintain the 2 × 2 square structure.

**Figure 9.** The retention templates and the deletion templates. (**a**–**i**) The retention templates in different directions. (**j**–**m**) The deletion templates in different directions. The pixels of scanning points are marked as *P*0, and pixel sets *Px* of 8 neighborhoods and 24 neighborhoods are constructed, where *x* = 1, 2, . . . 24. The pixel *Px* in the gray square can be either 1 or 0.

In the two-stage scanning, the deletion templates under 8 neighborhoods were used to eliminate the pixels with non-single-pixel widths that form an included angle of 90. The definition of the deletion templates satisfied Figure 9j–m.

Based on the improved ZS thinning algorithm, the pest skeletons in Figure 7a,d were extracted again. The visualization is shown in Figure 10.

**Figure 10.** Visualization of the improved ZS thinning algorithm. (**a**,**b**) The pest skeleton images extracted from Figure 7a,d.

#### (3) Strike point location

After extracting the skeleton of pests with a single-pixel width, the system used Freeman chain code notation [34] to extract the linked list. Then, the skeleton pixel length was calculated by combining the chain code and the midpoint position coordinate was located according to the pointer. The visualization results of different processing stages are shown in Figure 11.

**Figure 11.** Visualization of the pest skeleton extraction and laser strike point location for different stages: (**a**) The identification and segmentation result of an NIR *P. rapae* image, (**b**) extracted segmentation mask image, (**c**) thinning treatment, and (**d**) coordinates of laser strike points.
