2.2.3. The Multi-Constrained Stereo Matching Method

In this study, we only need to calculate the 3D spatial coordinates of the laser strike point and, thus, a multi-constraint stereo matching algorithm was proposed. As shown in Figure 12, the algorithm constructs two constraints in the matching process.

#### (1) The first construct: Row Constraint

After the binocular camera (Figure 12a) completed the camera calibration and stereo correction, the same pest satisfied the constraint of peer-to-peer sequential consistency in the stereo rectification images [35]. Therefore, using the pest segmentation mask in the image on the left as the template, template matching was performed on the same row in the image on the right according to the row constraint.

Assuming that the coordinate of the laser strike point in the image on the left was *p*1(*x*1, *y*1), the range of the coordinate *p*2(*x*2, *y*2) of the center point of the matching box in the image on the right can be limited to *y*<sup>2</sup> = *y*1, as shown in Figure 12b.

**Figure 12.** Search range of the multi-constraint stereo matching method. (**a**) The binocular vision locating system. The red frame is the binocular public area, the blue frame is the operation area for locating pests, and the depth range is *Hmin* ∼ *Hmax*. *L* is the leaf spreading degree; *h* is the plant height; and *l* is the bottom leaf height of the cabbage. *fx* is the camera fixed parameter. (**b**) The spatial geometric diagram. *OC*<sup>1</sup> and *OC*<sup>2</sup> are the optical centers of the cameras on the left and right, *C*<sup>L</sup> and *C*<sup>R</sup> are the imaging planes of the binocular cameras, and the image coordinate systems are *X*1*O*1*Y*<sup>1</sup> and *X*2*O*2*Y*2, respectively. *p*1(*x*1, *y*1) is the laser strike point in the image on the left; *p*2(*x*2, *y*2) is the center point of the best matching box in the image on the right; *P*(*X*,*Y*, *Z*) are the target pests.

#### (2) The second construct: Column Constraint

For the laser pest control robot to effectively identify field pests and facilitate the trajectory planning of its striking equipment, the working area was regarded as a cuboid (Figure 12a). According to the principle of triangulation [35], the coordinate of the target point in the world coordinate system can be calculated by Equation (2):

$$Z = \frac{fB}{(\mathbf{x}\_1 - \mathbf{x}\_2)\mu\_x} = \frac{f\_X B}{\mathbf{x}\_1 - \mathbf{x}\_2} \tag{2}$$

where *B* is the baseline distance of the binocular cameras, *f* is the focal length of the cameras, *μ<sup>x</sup>* is the physical size of each pixel in the *X*-axis direction of the imaging plane, and *fx* is the fixed parameter of the camera, which is determined during camera calibration.

In Equation (2), if the depth range of the operation area, the coordinate *p*1(*x*1, *y*1) of the target in the image on the left, and the camera fixed parameter *fx* were known, the range of the *X*-axis of the target in the image on the right can be limited. The specific equation of *x*<sup>2</sup> was as follows.

$$x\_1 - \frac{f\_x B}{H\_{\min}} \le x\_2 \le x\_1 - \frac{f\_x B}{H\_{\max}} \tag{3}$$

where *Hmin* and *Hmax* are the value ranges of the *Z*-axis of the system operation area in the world coordinate system (Figure 12).

Based on the multiple constraints above, the matching range of the template on the polar line of the target image on the right can be further restricted.

In the matching process, the normalized cross-correlation coefficient with linear illumination invariance was selected to measure the match similarity [36]:

$$R(\mathbf{x}, y, d) = \frac{\sum\_{i=1}^{u} \sum\_{j=1}^{m} \left[ T(\mathbf{x} + i, y+j) - \overline{T}(\mathbf{x}, y) \right] \left[ I(\mathbf{x} + i - d, y+j) - \overline{I}(\mathbf{x} - d, y) \right]}{\sqrt{\sum\_{i=1}^{u} \sum\_{j=1}^{m} \left[ T(\mathbf{x} + i, y+j) - \overline{T}(\mathbf{x}, y) \right]^2} \sqrt{\sum\_{i=1}^{u} \sum\_{j=1}^{m} \left[ I(\mathbf{x} + i - d, y+j) - \overline{I}(\mathbf{x} - d, y) \right]^2}}, d \in \left[ \frac{f\_{\overline{X}} \mathcal{B}}{H\_{\max}}, \frac{f\_{\overline{X}} \mathcal{B}}{H\_{\min}} \right] \tag{4}$$

where *R*(*x*, *y*, *d*) is the normalized correlation quantity when the midpoint (*x*, *y*) is located in parallax *d* in the matching area of the camera image on the right. Here, *n* is the width of the template window; *m* is the height of the template window; *T*(*x* + *i*, *y* + *j*) is the pixel value of the template window point (*x* + *i*, *y* + *j*); and *T*(*x*, *y*) is the average pixel value of the template window. *I*(*x* + *i* − *d*, *y* + *j*) is the pixel value of the matching area point (*x* + *i* − *d*, *y* + *j*); and *I*(*x* − *d*, *y*) is the average pixel value of a template window with a side length of *m* × *n* defined by the point (*x* − *d*, *y*) as the center.

After obtaining the parallax *d*<sup>0</sup> with the maximum similarity (Equation (4)), the algorithm extracted the matching similarity *R*(*x*, *y*, *d*) of the adjacent parallaxes (*d*<sup>0</sup> − 2, *d*<sup>0</sup> − 1, *d*<sup>0</sup> + 1, *d*<sup>0</sup> + 2) with phase-pixel-level accuracy and constructed a parallax-similarity (*d*-*R*) pointset, as shown in Figure 13. Then, the quadratic, cubic, and quartic polynomial fitting curves were performed on the pointset to obtain the polynomial curve with the highest fitting degree (*R*2). The abscissa of the crest (Figure 13, Point *S*) at the best fitting curve was the parallax under subpixel accuracy. Finally, the 3D coordinates of each pest in the world coordinate system were calculated by the subpixel parallax.

**Figure 13.** Polynomial fitting curves of disparity and similarity.

#### **3. Test and Results**

#### *3.1. Experiments*

To evaluate the recognition and localization accuracy of the laser strike point, combined with the characteristics of the actual operating conditions of the cabbage greenhouse, we further collected the *P. rapae* images at different positions in the vegetable field to construct a test set (Experiment 1: *n* = 70, Experiment 2: *n* = 30). The system automatically outputs and saves the identification and segmentation results of the *P. rapae* pixel area and records the 3D coordinates of the laser strike point.

The experiment was conducted in the cabbage field (28.18 N, 113.07 E) of Hunan Agricultural University in Changsha, Hunan Province, as shown in Figure 14. According to the leaf spreading degree (350 ± 46.6 mm), plant height (300 ± 25.6 mm), and bottom leaf height (32 ± 6.7 mm) of the field cabbage, the distance between the origin of the binocular camera and the effective operation area of the laser was set to 400–600 mm. The length of the working area along the *XC*-axis was 400 mm and the *YC*-axis was 260 mm.

**Figure 14.** Accuracy test platform site. Key: 1. visual processing platform; 2. binocular camera with an 850 nm filter; 3. linear displacement sensor; 4. fixed support frame; 5. digital display for displacement sensor; 6. cabbage.

3.1.1. Experiment 1: Accuracy Evaluation of Pest Identification and Instance Segmentation Network

Combined with the test sample images (*n* = 70) of different scenarios, the number of *P. rapae* that were manually labeled and automatically identified by the model were recorded. Three indicators, precision value (Equation (5)), recall value (Equation (6)), and *F*1-measure (Equation (7)), were used to evaluate the recognition performance of the Mask R-CNN model on the target.

$$P = \frac{TP}{TP + FP} \tag{5}$$

$$R = \frac{TP}{TP + FN} \tag{6}$$

$$F\_1 = \frac{2 \times PR}{P + R} \tag{7}$$

where *TP* is a correctly predicted positive sample, *FP* is an incorrectly predicted negative sample, and FN is an incorrectly predicted positive sample.

#### 3.1.2. Experiment 2: Performance Evaluation of the 3D Locating System

The image coordinate deviation and the actual depth deviation between the autolocation results of the laser strike point and the manual annotation results were used to evaluate the performance of the 3D locating system.

Given that the absolute deviation of coordinates represents different physical distances in images of different scales, it is impossible to characterize the true locating error

quantitatively. In experiment 2, we collected 30 pairs of binocular images of the same *P. rapae* at different locations in the vegetable field. Therefore, it is assumed that the physical diameter of the *P. rapae* body width in the area of the laser strike point was constant and *d* represented the pixel width of *P. rapae* body in images of different scales (Figure 6). The *X*-axis, *Y*-axis location error of the world coordinate system was represented by the ratio of the pixel deviations (*ex*,*ey*) and *d* of the system output and the manually marked point on the *x* coordinate, *y* coordinate of the image.

In experiment 2, a linear displacement sensor (provided by Shenzhen Howell Technology Co., Ltd. (Shenzhen, China), KPM18-255) was used to measure the vertical distance from the pest surface to the camera plane. The sensor position accuracy was 0.05 mm. The displacement sensor is installed in a base with a magnet. The base can be adsorbed on the top plate in such a way that the displacement sensor is always perpendicular to the imaging plane and can move horizontally in the plane of the top plate, as shown in Figure 15.

**Figure 15.** Accuracy testing experiment equipment. Key: 1. digital display for displacement sensor; 2. 850 nm diffuse light bar; 3. binocular camera with an 850 nm filter; 4. base with a magnet; 5. linear displacement sensor; 6. *Pieris rapae*.
