*2.2. Pose Optimization*

Through the CO-HDC algorithm, we can accurately separate the object, especially the contour of the object, removing the feature points on the dynamic object and retaining the static feature points so as to achieve good feature point matching and complete pose estimation well. In visual SLAM, posture refers to the robot in spatial position and posture of the entire environment map. Both spatial position and robot posture position need to be accurately located in the three-dimensional space.

Figure 8 shows the principle of spatial measurement. It is assumed that in two adjacent frames, the camera has no distortion, and the two projection planes are parallel and coplanar. In the figure, *P* is an object, *Z* is its depth, *f* is the focal length of the camera, *T* is the center distance of two adjacent frames, *Ol* and *Or* are the optical centers of two adjacent frames of the camera, respectively, and *xl* and *xr* are the horizontal axis coordinates of the projection of object *P* in two adjacent frames, respectively. The depth calculation formula of object *P* can be obtained from the relationship of similar triangles:

$$\frac{T - (\mathbf{x}\_l - \mathbf{x}\_r)}{Z - f} = \frac{T}{Z} \Rightarrow Z = \frac{fT}{\mathbf{x}\_l - \mathbf{x}\_r} \tag{3}$$

**Figure 8.** The principle of spatial measurement.

*d* = *xl* − *xr* is defined as parallax, so that the depth information of the target point can be obtained through the parallax and *f* , *T* of the target. After obtaining the parallax map, the coordinates of the target point in the world coordinate system can be obtained through the re-projection matrix. The re-projection matrix is:

$$Q = \begin{bmatrix} 1 & 0 & 0 & -c\_x \\ 0 & 1 & 0 & -c\_y \\ 0 & 0 & 0 & f \\ 0 & 0 & \frac{-1}{T} & \frac{\left(c\_x - c\_y\right)}{T} \end{bmatrix} \tag{4}$$

In the above formula, *cx* is the *x* coordinate value of the main point of the first frame, and *cy* is the *y* coordinate value of the main point of the second frame. Assuming that the identified coordinate of the target point is (*x*, *y*), and the parallax in the two adjacent frames is *D*, its coordinate value in the world coordinate system can be recovered through Formula (5):

$$Q \begin{bmatrix} \mathbf{x} \\ \mathbf{y} \\ d \\ 1 \end{bmatrix} = \begin{bmatrix} \mathbf{x} - \mathbf{c\_x} \\ \mathbf{y} - \mathbf{c\_y} \\ f \\ \frac{-\left[d - \left(\mathbf{c\_x} - \mathbf{c\_y}\right)\right]}{T} \end{bmatrix} = \begin{bmatrix} X \\ Y \\ Z \\ W \end{bmatrix} \tag{5}$$

In proposed SLAM, the robot's posture is calculated through the translation vector and rotation quaternion number representation of seven paraments, as shown in the following type (6):

$$T = [\mathbf{x}, \; y, \; z, \; q\mathbf{x}, \; qy, \; qz, \; qw] \tag{6}$$

The first three are translation vectors. The last quaternion is the quaternion for rotation.

The task of the tracking thread is to calculate the posture of two adjacent frames according to the image change. This means not only the distance moved in the next frame should be calculated, but also the angle of rotation should be calculated. The results are then handed over to the back end, which accumulates and optimizes the relative positions between the two frames.

The images obtained by the pre-recognition before and after are *I*1 and *I*2. After feature extraction, the feature point *p*1 is obtained in *I*1. The feature point *p*2 is obtained in *I*2. Assuming the result of feature matching is that *p*1 is obtained and *p*2 is the closest point pair, it means that *p*1 and *p*2 is the projection of the same 3D point *P* on two frames of images.

$$p1 = KP, \; p2 = T(KP) \tag{7}$$

where, *T* is the camera's internal parameter matrix. When the camera is in different positions, point *P* obtains different pixel coordinates through the transformation of the internal parameter matrix. They are projection *p*1 and *p*2. *K* is the pose of *I*1 relative to *I*2. Assuming that multiple sets of point pairs can be matched between the two frames, the equation can be constructed by these point pairs to solve the relative pose. Specifically, it can be solved by solving the basis matrix and the homology matrix.

However, *T* must be calculated in the space *P*, where the whole environment's stationary conditions are valid. If the points in the pose estimation are in the process of moving, type (4) is set up. The error would arise. The worst-case scenario is to use the camera to participate in the pose estimation of all pixels for the same shipment. Then the pose estimation will always be 0.

### *2.3. Global Optimization Module and Mapping Module*

The tracking module estimates the camera poses through keypoint matching and pose optimization. An instance segmentation function is added to the tracking thread, and the original image is segmented at the same time as the feature extraction. Then, the pixel coordinates of the human and the animal are obtained. Finally, some feature points distributed on the human or animal are removed from the original feature point.

After culling feature points, the feature matching and pose estimation are performed. After getting rid of the interference of the pixel points, the instance SLAM shows better antiinterference ability under dynamic scenes. The accuracy is greatly improved. This module also determines whether to insert a new keyframe. When a frame is considered suitable for a new keyframe, it is sent to the mapping module and global optimization module.

In the mapping module, to eliminate mismatches or inaccurate matches, a new 3D point is triangulated by inserting a keyframe, optimizing the projected points and lines and adding a projection matrix. This process is equivalent to minimizing the photometric difference between blocks of projected pixels *ui* and the blocks corresponding to the 3D point on the current frame *ur*. The model expression is:

$$\mathcal{U}\_i = \operatorname\*{argmin}\_{\mathcal{U}\_i} \frac{1}{2} \sum\_{i} ||I\_c(\mathcal{U}\_i) - I\_r[A(\mu\_i)]||\_2^2 \tag{8}$$

where, *Ic* and *Ir* are the first and second frames, respectively, and *A* is the projection matrix. The projection matrix formula is as follows:

$$
\begin{bmatrix} \mathbf{x'}\\\mathbf{y'} \end{bmatrix} = \mathcal{R} \begin{bmatrix} \mathbf{x} \\\mathbf{y} \end{bmatrix} + \begin{bmatrix} t\_x \\\ t\_y \end{bmatrix} \tag{9}
$$

where, *R* is the matrix representing rotation and scaling, *x* and *y* are the coordinates before projection, and *Tx* and *Ty* represent translation distance.

In the process of global optimization, it is necessary to eliminate the accumulated errors caused by the odometer. The matching algorithm we use is a kind of image matching based on pixel value. Its purpose is to find a strict geometric transformation to make each pixel in the local map and the global map equal as much as possible. The inverse compositional algorithm can solve the problem of image matching, which is completed in three steps. The specific steps are given in the following formulations:

The first step is to calculate the Hessian matrix *H*:

$$H = \sum\_{\mathbf{x}} \left[ \nabla I\_{PM}(\mathbf{x}) \frac{\partial \mathcal{W}}{\partial P} \right]^T \left[ \nabla I\_{PM}(\mathbf{x}) \frac{\partial \mathcal{W}}{\partial p} \right] \tag{10}$$

where, *IPM* is the global map image, *x* is the coordinates of pixels in the image, *P* = [(Δ*x*, ·*y*, *θ*)]ˆ*T* represents translation and rotation vectors, and *I* (*W* (*x*; *P*)) represents the Euclidean transformation of vector *P* on image *I(x)*.

The second step is to calculate the new vector Δ *p*:

$$
\Delta p = H^{-1} \sum\_{\mathbf{x}} \left[ \nabla I\_{PM}(\mathbf{x}) \frac{\partial \mathcal{W}}{\partial p} \right] \left[ I\_{LM}(\mathcal{W}(\mathbf{x}; p)) - I\_{PM}(\mathbf{x}) \right]^2 \tag{11}
$$

where, *ILM* is the image of a local subgraph.

Step 3: Update vector *p*:

$$p = p + \Delta p \tag{12}$$

The final output *p* of the algorithm represents the translation and rotation between maps, which can eliminate the accumulated errors in global map construction, and also solves the problem of trajectory drift that often occurs in visual SLAM.

#### **3. Tests and Results Analysis**

In order to demonstrate the advantages of the CO-HDC instance segmentation algorithm proposed in this paper and test the actual effect of visual SLAM based on CO-HDC instance segmentation, our experiment will be divided into two parts. Firstly, we will experiment with the performance of the CO-HDC instance segmentation algorithm. Secondly, we will test the performance of the visual SLAM based on the CO-HDC instance segmentation algorithm proposed in this paper and judge the effect of feature point matching and real-time modeling.

#### *3.1. Experiment of CO-HDC Instance Segmentation Algorithm*

In order to test the accuracy and efficiency of the proposed contour enhancement instance segmentation algorithm, the following experiments are carried out:


#### 3.1.1. The Network Hyperparameters Selection and Controlled Experiment

Instance segmentation can remove the dynamic object, which increases the accuracy of visual SLAM. In order to integrate with visual SLAM better, the instance segmentation network model needs to be optimized. Therefore, ten comparative experiments were conducted under hybrid dilated CNN to select appropriate network parameters and observe the effect of transfer learning on training time, accuracy and training data volume. The hyperparameters selection and the corresponding results are shown in Table 3. *mAP* is the average precision, and *mIoU* is the average intersection ratio. In this paper, *mAP* and *mIoU* are used to evaluate the quality of network training structure. In order to strictly evaluate the performance of the method, the thresholds of *mAP* are set to 0.5 and 0.7, respectively. Those greater than or equal to the threshold are true positive, while those less than the threshold are false positive. The *mIoU* and *mAP* indicators for each experiment are shown in the last three rows of the table for detailed analysis of the experiment contents and results.


**Table 3.** Hyperparameters selection comparison experiments.

Train obj. and Val obj. correspond to the total number of training objectives and verification objectives of the training, respectively. Train imag. and Val imag. are the number of training images and verification images. Epochs is the number of iterations of all training sets, and the Mini-mask Shape is the minimum mask size. Img. Size is the size of the input image, and RPN Anchor Scales is the proportion Size of the Anchor. The Pretrain Model is the 80 classification pre-training model of coco data sets.

Test 1 and Test 2 use the same Non-Maximum Suppression (NMS) threshold, the basic learning rate, and other hyperparameters but use different amounts of epochs. Feeding all data into the network for iteration is called an epoch, and the number of epochs is set to 100 and 200, respectively. With the increase of epochs, the value of *mAP* (*IoU* > 0. 5) in test 1 increased from 0.569 to 0.586 with a low volatility effect. So, on a low number of iterations, it was still easy to converge, indicating that the convergence effect of the algorithm in this paper was great.

In Test 3 and Test 6, we used images of more data for training and testing, and epochs were the same as before. The results showed a decrease in detection rate, which was later improved in test 4 by increasing the number of epochs, resulting in an *mAP* (*IoU* > 0. 5) of 0. 565.

In Test 5, we evaluate the effects of the image width and height, the size of the training images from 1024 × 800 to 1920 × 1080, learning rate from the default of 0. 001 to 0. 02, the rest of the parameters like Test 4. We get a poor performance of the algorithm (*mAP* (*IoU* > 0. 5) = 0.395). It indicates that the accuracy of images of high resolution is low under the current parameters.

In Test 6, we reduced the size of the mini-mask from 56 × 56 to 28 × 28, and compared with Test 4; we found some improvement in network performance.

Therefore, in Test 7, we reduced the Scales of RPN Anchor and improved the input image resolution to 1920 × 1080 and the small mask to 28 × 28. It was found that the performance of the network was greatly improved, which was close to the network performance in Test 6.

In Test 8, we used the same configuration as Test 7 and further reduced the RPN Anchor Scales. It was found that the performance of the network with reduced RPN Anchor Scales was greatly improved, and (8, 16, 32, 64) was considered the best RPN Anchor Scales of the network.

In Test 9, in order to improve the training accuracy, reduce the training time and prevent network overfitting, we reduced the amount of training data on the basis of Test 8 and found that the network performance decreased significantly.

In Test 10, we substantially recompressed the training data on the basis of Test 9, other parameters remained unchanged, and we used 80 classification models of the pre-trained COCO data sets for transfer learning. The results showed that the network performance was basically the same as that of Test 8, and the network performance reached a higher level, but the training time was half that of Test 8. Network performance can accurately detect and segment vehicle images.

Through 10 comparative experiments, it can be seen that the more training data, the higher the image resolution, the smaller the mask and the smaller the scale of RPN anchor will lead to better network performance. The results show that 100 epochs are enough to achieve convergence for target detection. At the same time, an increasing pre-training model can reduce the training data. In conclusion, Test 10 achieves the most perfect balance among training data, image resolution, mask size, epochs, scale of the RPN anchor and other parameters. Appropriate data volume and resolution ensure not only high speed but also high precision. At the same time, the transfer learning method can reduce the training data, training time and improve the detection accuracy. Therefore, we set the parameters of Test 10 as our optimal network parameters and carried out subsequent experiments and studies with the parameters of Test 10.
