1. Introduction
Visual odometry (VO), as the core of solving the autonomous positioning problem of robots, has been of great interest to researchers in the visual field. Although binocular vision has advantages in accuracy, the monocular system still has certain advantages for automobiles [
1], UAVs [
2], and other industries due to the steady decline in the cost of consumer-level monocular cameras in recent years and the lower calibration workload. Therefore, the challenges of visual odometry are both fundamental and practical. At present, the typical monocular visual odometry methods include ORB-slam3 [
3] based on the feature point method, DSO [
4] based on a direct method, SVO [
5] based on a semi-direct method, and VINS-Mono [
6] combined with inertial navigation equipment. At the same time, with the development of the neural network, researchers have carried out many explorations of the visual odometry and SLAM methods based on deep learning, for the SLAM method combining SVO and CNN [
7], for the SLAM method of semantic segmentation [
8], for the localization method of planar target feature [
9], and for the unsupervised features used in visual odometry [
10,
11], etc.
At the same time, scale drift and uncertainty have been another focus and difficulty in monocular vision research. Researchers have done a lot of research on reducing the scale drift of monocular vision slam, such as the monocular SLAM scale correction method based on Bayesian estimation [
12], the low drift SLAM scheme estimated by geometric information and surface normal [
13], and the VIO method combined with the characteristics of inertial navigation devices [
6,
14,
15]. Among them, the VINS method effectively solves the monocular scale uncertainty problem due to the combination of inertial navigation devices, and greatly reduces the error accumulation caused by drift in the monocular system. However, the scheme combined with inertial navigation devices has three main disadvantages: (1) sensor equipment is more expensive; (2) the calibration of equipment, i.e., the synchronization mechanism between inertial data and visual data, is more complex; and (3) the front-end that is the visual odometry and the back-end optimization algorithms brought by data fusion have higher complexity.
Based on the above summary and thinking, combined with the characteristics that the target often has the size reference information in the target tracking problem, we designed a multi-level scale stabilizer to use the feature baselines at different levels to solve the real scale of the camera. Using the size of the target as the top-level feature baseline, the baseline information is transmitted to the features of each level, and then the real proportion of the displacement T in the motion of the monocular camera is solved, so as to solve the scale uncertainty and reduce the drift. We note that the feature matching in traditional visual odometry often directly uses the extraction of feature points or pixel gradient information such as oriented fast and rotated brief (ORB) [
16], and then uses the epipolar constraint [
17] and random consistency algorithm [
18] to obtain reliable motion matching. At the same time, due to the scale equivalence of the essential matrix E in the epipolar constraint, the resulting displacement T also faces the scale uncertainty problem. In order to solve this problem, we propose a multi-level abstract feature mechanism in the target tracking problem. The top-level feature is the target in the tracking problem. In the second level, we obtain the required feature matching region set based on the self-supervised feature matching model of the Siamese neural network, and the third level is the feature point set of the traditional orb class. Finally, the spatial scale transfer is carried out by using the prior information of the top target size, so as to obtain the baseline size of the feature area and solve the real motion scale
.
The research is aimed at solving the problem of failed navigation caused by the insufficient ability of autonomous positioning and spatial depth estimation encountered by mobile robots (autonomous vehicles, UAVs, etc.) carrying monocular vision sensors when fulfilling the tasks of target tracking and obstacle avoidance crossing in an unfamiliar environment. The main reason for the problem is the error of spatial depth estimation caused by the uncertain scale in the traditional monocular vision odometer and the accumulation of positioning error resulting from scale drift. We combined the target tracking model and monocular visual odometer to obtain better depth estimation and reduce error accumulation. We studied how to use the size information of the target to solve and transfer the scale to reduce the drift of the monocular vision system in the indoor target tracking problem. A multi-level scale stabilizer was defined using a self-supervised learning model. Based on the scale information of the target, the feature baseline extracted from the feature region as obtained through self-supervised learning was treated as the basis for scale information. With the spatial positioning of the target achieved, the scale information was transmitted to the original visual odometry, which was effective in reducing scale drift. According to this method, the extraction and transmission of feature baseline were classified into three levels, while the clear transmission relations and confidence weights between different transmission levels were defined. The main contributions of this paper are detailed as follows:
A multi-level scale stabilizer (MLSS-VO) based on monocular VO is proposed. The size of the target in the task of tracking is taken as the prior information of the top level (Level 1), the feature region extracted using the self-supervised network is treated as the second level (Level 2), and the feature point set as obtained by the traditional method is regarded as the third level (Level 3). In particular, priority is given to the feature points in the self-supervised matching region for more reliable matching constraints. Then, the feature points on the third level are used to construct the original VO using the feature point method, thus obtaining the attitude and motion of the camera with scale error. On this basis, the size information of the top level and the feature baseline information of the second and third levels are combined to solve the real trajectory of camera motion;
Based on the deep local descriptors intended for an instance-level recognition network model, a Siamese neural network model [
19] suitable for the matching of motion video stream feature is proposed, and a baseline acquisition mechanism in the feature region is designed;
Through the combination of MLSS-VO and a target space positioning algorithm, an algorithm framework is designed for autonomous positioning and target tracking based on a monocular mobile platform. During multiple sets of indoor target tracking experiments, the motion capture system is adopted to verify that the root mean squared error (RMSE) of the algorithm is less than 3.87 cm in the indoor test environment, and that the RMSE of the moving target tracking is less than 4.97 cm, which indicates the effectiveness of the algorithm in indoor autonomous positioning and target tracking.
The rest of this paper is organized as follows. In
Section 2, an introduction is made of the self-supervised learning model and feature baseline extraction method.
Section 3 elaborates on the process and framework of the MLSS-VO algorithm based on multi-level scale stabilizer. In
Section 4, it is detailed how multi-level feature baselines can be used to solve the scale, with a set of target tracking algorithm frameworks proposed for a monocular motion platform on the basis of the MLSS-VO.
Section 5 presents the performance analysis of the MLSS-VO algorithm and the verification of the indoor tracking experiments.
Section 6 is a summary to conclude the paper.
2. Multi-Level Feature Extraction
The target recognition and feature baseline extraction in Level 1 can refer to the previous work of the author [
20]. The extraction method of the orb in Level 3 has been referred to by a large number of related studies. Therefore, the feature extraction methods for Levels 1 and 3 are no longer described in detail in this paper, and the extraction of the self-supervised feature matching area in Level 2 level is emphasized.
2.1. Self-Supervised Feature Region Learning
We studied the feature matching model of deep local descriptors for instance-level recognition [
19,
21], and improved its training model to reduce the cost of self-supervised feature learning. The original model focuses on the description of local features. In this paper, the Siamese network is used to extract feature descriptors, which changes from focusing on feature descriptions to focus on the gap between features. It has higher universality and is more suitable for visual applications such as VO: during the training period, the Siamese network method is used to extract the
dimensional feature descriptors of the sample images for correlation calculation. If the similarity of the two input images is higher than 80%, the output is 1, otherwise 0. In practical applications, we only need to obtain a small number of reference images of similar environments for training. The training sample construction method is to construct positive samples by shifting and scaling the same image and construct anti-samples between unrelated images.
The structure of the training model is shown in
Figure 1 above. The training model used in this paper is introduced as follows:
Feature strength is used to weigh the contribution of each feature descriptor in matching, so that it can be reduced from D dimension to 1 dimension. Considering that the influence of weak features during training is limited, can select the strongest feature as the preferred matching during testing.
- 3.
Local smoothing: we propose to spatially smooth the activations by average pooling in an 3 × 3 neighborhood. The smoothing result is indicated by . The main function here is to make the features more dispersed and smoother;
- 4.
Descriptor whitening: local descriptor dimensions are de-correlated by a linear whitening transformation. Using the PCA dimension reduction method [
22], using Function
which is implemented by 1 × 1 convolution with bias to achieve feature tensor dimension reduction:
;
- 5.
Training based on a Siamese network: Let
be sets of dense feature descriptors in images
and
, image similarity is given by:
, the actual use, the corresponding is weight adjustment. When is a positive sample, the output value of similarity is 1. For negative samples, the output value of similarity is 0.
Figure 1.
Training architecture overview.
Figure 1.
Training architecture overview.
Accordingly, in the test, we extract the feature matrix and weight matrix multiplied to obtain the key region feature matching, see in
Figure 2. The structure of the training model and test model is shown in
Figure 2. We use
to select the required N strongest feature matching regions to provide a reference region for the extraction of the feature baseline below.
The main function of Level 2 has two advantages: (1) Using the reliable feature matching area of self-supervised learning as the scale transfer reference of Level 2, the training cost is low and the speed is fast. (2) We first select the orb feature points in the Level 2 region as the matching points of original VO to solve the camera motion, making use of the advantages of richer information and higher reliability in the self-monitoring feature region.
2.2. Feature Baseline Extraction
For Level 1, the feature baseline is abstracted as the size feature of the target being tracked, such as the edge length, radius, etc., the extraction method is referred to our previous work [
20]. The following focuses on feature baseline extraction methods in Levels 2 and 3.
Figure 3 shows the extraction method of feature baseline, as follows:
Level 2 feature baseline: the dashed box shown in
Figure 3a above is the feature region learned by Level 2, and the dots represent the ORB feature points in the region. A point
(red) is randomly selected as one end of the baseline, and five points with the furthest distance of 2D pixels in the region are selected as the alternatives (color feature points). Then point
(yellow feature points) with the furthest distance of Hamming distance
from the descriptor of point
is selected to form feature baseline
. For robustness, we can select point
(blue feature points), which is farthest from the point
descriptor Hamming distance, to form a spare feature baseline
.;
Level 3 feature baseline: as shown in
Figure 3b above, a point
(red) is randomly selected as the end of the baseline. We select a ring area with an inner diameter of
and an outer diameter of
as a candidate feature area. The next baseline extraction method is the same as the Level 2. In addition, for
and
, at the resolution of 1280 × 720, we recommend a pixel radius of 10 to 30, the distance is too small, the error of the solution is large, and the distance is too large due to the camera movement. The existence period of the baseline is very short and easy to lose.
Figure 3.
Feature baseline extraction: (a) feature baseline extraction in Level 2; (b) feature baseline extraction in Level 3.
Figure 3.
Feature baseline extraction: (a) feature baseline extraction in Level 2; (b) feature baseline extraction in Level 3.
It should be noted that the ORB feature points mentioned above are the feature points filtered by RANSAC (random sample consensus) algorithm in VO, that is, the static feature points in the environment. The static characteristics of the feature baseline are the necessary conditions for the feature baseline in Levels 2 and 3.
4. Multi-Level Scale Stabilizer Intended for Visual Odometry (MLSS-VO) Implementation Based on Target Tracking
4.1. Coordinate System and Typical Target
Figure 7 shows the relationship between the three coordinate systems, where blue, green, and orange represent projections under the world coordinate system, camera coordinate system, and pixel coordinate system, respectively.
The input of the depth estimation module is a known pinhole camera model and the pixel coordinates of the feature points, and the output is the depth estimation equations with the depth information of feature points as the unknown number. In this paper, target features are firstly abstracted as geometric shapes such as parallelograms, as shown in
Figure 8. The coordinates of the target in the world coordinate system are defined as
. The coordinates in the camera coordinate system are defined as
. The coordinates of the target in the two-dimensional image coordinate system are defined as
. The normalized projection points of the normalized plane.
are defined as
Similarly, for the circular target, considering the properties of circular projection: the projection of the circular is circular or elliptical, and the projection of the circular is still the center of the projection. Using the image algorithm, we can obtain the center of the projection surface image. According to the projection properties, the line segment obtained by the intersection of any two rays in the center and the graph is radius R, as shown in
Figure 9. According to the symmetry of the circle, the shape composed of four intersections must be a parallelogram, so solving the depth problem essentially returns to the above method of solving the parallelogram.
It should be noted that in practical engineering, the quadrilateral side length value obtained by this method is difficult to obtain, and the diagonal distance is the radius Rc. As above, we often use the combination of the diagonal distance Rc and the quadrilateral parallel condition equation to obtain the solution depth of the equation set F to obtain the spatial position of the circular target.
4.2. Scale Solver
Figure 10 shows the schematic diagram of the typical double-motion problem. Here, we abstract the moving target as a rigid body like a parallelogram. The rotation
of the monocular camera between the two frames has been obtained by the traditional VO method. Because of the scale uncertainty of the monocular vision,
is a parameter that is proportional to the actual spatial motion. What we need to solve is
. At the same time, given the baseline size
of the moving rigid body target, considering the length of P1P2 in the graph is the size
in Level 1, we can solve the actual
by solving the motion of the rigid body.
Here we illustrate the solution. First, the pixel coordinates of each point in
Figure 10 above are defined as:
The camera coordinate system of each point is:
The camera internal matrix
is known, from the projection relationship:
Then, we use the key information
to establish the baseline length constraint equation:
Finally, using the properties of rigid body,
and
can establish rigid body parallel constraints:
Considering that are a known quantity, we can solve the equations containing before and after the movement in the above figure, respectively. It is worth noting that a reasonable selection of equations and appropriate numerical iterative methods can improve efficiency.
Next, we solve the scale
, first define the displacement of
to
relative to the camera as
:
At the same time, according to the spatial relationship between camera motion and target motion, we can know:
At this point, we obtain the real
T0, that is, the scale information. So far, we have completed the scale recovery of VO using the baseline of Level 1. Next, for the features of Levels 2 and 3, the problem becomes simpler, and
Figure 10 is still taken as an example. At this time, the target length becomes the feature baseline length extracted by the algorithm from the image and
is known. The problem is transformed into solving the unknown scale
in Levels 2 and 3. Using the same method, the feature baseline values in Levels 2 and 3 can be obtained.
4.3. Scale Weighting and Updating
We use image checking to continuously learn and extract new Levels 2 and 3 features to join the multi-level scale stabilization of this article and remove the features that leave the image from the scale stabilizer. Theoretically, if our detection and scale calculation are absolutely accurate, then we can use any set of features in the stabilizer to complete the scale calculation. In fact, considering various problems such as error matching and feature learning failure, we add the random sample consensus method into the scale updater to first propose the error matching baseline in self-supervised learning. In fact, the Level 2 level has a certain uncertainty for the extracted feature region due to the dependence on deep neural network learning, so the RANSAC algorithm is used for investigation.
Finally, the scale weighted vector
and the camera motion scale vector
solved by the feature baseline are defined:
where
Represents the number of scale features of each level. Finally,
can be obtained. It is important that the assignment of
follows the following principles and properties:
Since ideally all values in all are the , vector corresponds to satisfying Property ;
Since the scale learning obtained in Level 1 is the most credible and stable a priori information, when tracking a target in an image, priority is given to using the prior scale information in Level 1, which is a strategy of ;
When the target disappears, the weight distribution of the feature scale using Levels 2 and 3 can be determined according to the actual application scenario. For example, when there are a large number of self-supervised learning features indoors, the scales obtained in Level 2 are assigned higher weights, and when the self-supervised feature region is unstable, the scales obtained in Level 3 can be assigned higher weights.
4.4. The Advantages and Disadvantages of MLSS-VO
The advantages of a multi-level scale stabilizer based on a target tracking problem are:
(1) During the initialization of the VO, the scale uncertainty in camera estimation is solved by using the size information of a target. (2) The multi-level feature baseline can be used to update the scale value T of camera motion in real time to prevent scale drift, especially when the tracking starts again after VO loss. (3) While solving the scale , the target spatial localization algorithm is realized. (4) The feature regions obtained by the self-monitoring method are matched with orb features, which reduces the possibility of false feature matching. (5) The transfer of the real size of the feature baseline in the space is realized by using the target size, which can provide the real scale reference for various applications in the region of interest space.
There are also some disadvantages to be improved:
(1) Self-supervised region extraction consumes a large amount of computation, which affects real-time performance. To solve this problem, in some platforms with limited computing resources, such as small UAV platforms, only the characteristic baselines of the first and third levels can be used for scale updating, thus greatly reducing the computational complexity. (2) For a looped motion environment, the self-supervised region of the method can obtain a longer feature life cycle; for the motion without loopback, this algorithm needs to continuously extract new self-supervised feature regions, which leads to a large computational burden. Therefore, this method is more suitable for motion scenes with a loop.
4.5. Target Location and Tracking Framework
As shown in
Figure 11, considering that the first level feature baseline acquisition of the scale stabilizer in this paper is dependent on the target recognition in the 3D tracking problem, we propose a 3D positioning and tracking algorithm framework for moving targets based on a monocular motion platform combined with the target tracking algorithm.
This framework mainly includes two parts:
4.6. Improved Newton Iteration
For the scale transfer equations constructed in
Section 4.2, we can synchronously solve the spatial position of the moving target relative to the camera. In the actual test, we use the numerical iteration method to solve the equations. The traditional Newton iteration method has the characteristics of second-order convergence and accurate solution. The iterative equation is as follows:
For our algorithm, X is the vector space of the unknown depth value to be solved in our algorithm, is the Jacobian derivative matrix and is the inverse matrix of the Hessian matrix. However, the traditional Newton iterative optimization has a large amount of calculation when calculating the inverse matrix of the Hessian matrix, and the second derivative may fall into the endless loop at the inflection point, i.e., .
Here, we give an improved Newton–Raphson method with the fifth-order convergence is adopted [
23,
24]. This method can solve higher-order equations more stably, whose iterative equation is as follows:
where,
is the vector to be solved,
and
are the intermediate variables of the iteration. Finally, according to the updated
and
obtained by MLSS-VO we can further solve the spatial position:
where
is the rotation matrix of the camera relative to the world coordinate system, and
is the displacement vector of the camera relative to the world coordinate system. The coordinates
of the target feature point in the world coordinate system can be obtained by solving Equation (22).
5. Experiments
The author designed multiple sets of experiments based on indoor environment to test and verify the algorithm. In those experiments, a visual motion capture system named ZVR was employed to calibrate the true value of the target’s trajectory. The motion capture system (MCS) is composed of eight cameras, which can cover the space experiment range of 4.7 m × 3.7 m × 2.6 m, and can achieve posture tracking and data recording with the refresh rate of 260 Hz. All images in our experiments were taken by a rolling shutter monocular pinhole camera with fixed-focus.
Aiming at the problem of target tracking, combined with the monocular motion platform test requirements involved in this article, we proposed a method for establishing a benchmark for object tracking with motion parameters (OTMP). All samples were taken by a monocular fixed-focus pinhole camera. The trajectory information of multiple sets of sample targets in space and the pose information of the camera itself are simultaneously recorded using an indoor motion capture system. Besides, the camera internal parameter matrix
, sample calibration set
, sample space motion trajectory parameters
, and the camera’s pose
after calibration using the checkerboard are provided. This data set can be used for visual scientific research such as the visual slam of indoor dynamic environments, spatial positioning of moving targets, and dynamic target recognition and classification. This dataset has been uploaded to GitHub:
https://github.com/6wa-car/OTMP-DataSet.git (accessed on 5 December 2021).
Figure 12 shows a panoramic view of the entire experimental scene. The moving targets in OTMP are shown in
Figure 13.
5.1. Feature Region Extraction Based on Siamese Neural Network
Experiment 1: The purpose of this experiment is to verify the performance of Level 2 feature extraction matching. After training the self-supervised Siamese neural network model with a small number of indoor environment images, we use a monocular motion camera to track and shoot the indoor target, and finally extract and match the static feature regions in the environment.
Figure 14 and
Figure 15, respectively, correspond to a schematic diagram of randomly extracting orb feature points matching in Level 3 and a schematic diagram of orb feature point matching based on self-supervised feature region constraints in Level 2.
When extracting from the supervised feature region, we merge the adjacent feature regions on the image to obtain a simpler feature region division. The Correct Matches we give represent the number of correct unsupervised matching regions in Level 2.
Table 1 illustrates the relevant performance of the self-supervised feature matching model. In particular, we tested the performance of the algorithm on the Central Processing Unit (CPU) and Graphics Processing Unit (GPU) platforms:
It can be seen from
Figure 14 and
Figure 15 that our feature region-based method can prevent false matches caused by the original orb matching method. Compared to directly using orb feature point matching, we can control the distribution of self-supervised feature regions, such as choosing to selectively extract features with the largest local feature weights in each region of the image. Therefore, the feature point matching method based on Level 2 feature region constraints can make the distribution of feature point pairs on the image more uniform, and prevent feature point matching that is too concentrated from adversely affecting the motion solution in the VO problem, such as the increasing in motion attitude error, systematic deviation of feature matching (error calculation of camera motion attitude caused by unknown motion of feature points used in a certain concentrated area), etc.
Furthermore, from the test results in
Table 1, for research and application of SLAM, the model of the improved Siamese neural network for feature matching relies on GPU-based computing platforms to obtain better real-time performance. For the CPU-based computing platform, this matching method is only suitable for non-real-time applications such as structure from motion (SFM).
It can be seen from
Table 2 that in the actual application of MLSS-VO, it is necessary to consider selecting the number of feature matching in Levels 2 and 3 reasonably according to the computing power of the computing platform. Under the GTX1060 platform, when selecting five matching areas in Level 2 and 30 orb feature points in Level 3, our algorithm can run at a speed of about 9.7 fps.
The timing performance of MLSS-VO under different numbers of feature matching in Levels 2 and 3 are as follows:
Table 2.
Performance of timing for MLSS-VO in experiment 2.
Table 2.
Performance of timing for MLSS-VO in experiment 2.
Method | Matches in Level 2 | Matches of Feature Points in Level 3 | Timing | GPU | Input Resolution |
---|
MLSS-VO | 5 | 30 | 103.2 ms | GTX1060 | 1280 × 720 |
60 | 122.1 ms |
10 | 60 | 121.1 ms | GTX1060 |
90 | 140.7 ms |
20 | 90 | 146.2 ms | GTX1060 |
120 | 169.6 ms |
5.2. Performance of MLSS-VO
Experiment 2: the purpose of this experiment is to investigate the spatial positioning performance of MLSS-VO in indoor environment. In addition to our open-source data set OTMP, we also used the TUM data set which is commonly used in slam research as verification. Meanwhile, we have compared our MLSS-VO with two typical methods in monocular visual odometry: ORB-SLAM2 [
25] and RGBD-SLAM v2 [
26] for a more comprehensive analysis.
We first verified the effectiveness of MLSS-VO using OTMP data set. The experimental results are shown in
Figure 16.
The RMSE of MLSS-VO on
x-
y-
z axis are shown in
Table 3:
The timing performance of MLSS-VO under different numbers of feature matching in Levels 2 and 3 are as follows.
It can be seen from the experimental results in
Figure 16,
Table 2 and
Table 3 that: (1) compared with the traditional feature point method corresponding to green data, the scale information in MLSS-VO effectively solves the scale uncertainty in VO initialization and restores the real proportion of motion trajectory. (2) Considering the real-time nature of the SLAM problem, the computing power of the computing platform needs to be considered when using MLSS-VO. The feature area in Level 2 should not exceed 10 while the orb feature points in Level 3 should not exceed 60. (3) In the experimental test of the actual indoor environment, the motion estimation value of MLSS-VO does not show scale drift, and the root mean square error of positioning error is controlled within 2.73 cm, which can be applied to most indoor visual motion platforms such as drones and unmanned vehicles.
Furthermore, we compared the three algorithms of MLSS-VO, ORB-SLAM2, and RGBD-SLAM v2 using the TUM data set. It is worth noting that because ORB-SLAM2’s monocular module requires scale alignment when testing TUM data, we used the slam evaluation tool Evo to correct it. In addition, RGBD-SLAM v2 needs to use the depth map information of the data set when experimenting with the TUM data set. For the MLSS-VO experiment in this paper, we give the assumed target with calibration during initialization. The experimental results are shown in
Figure 17 and
Figure 18 below.
The comparison of RMSE for these three methods on the
x-
y-
z axis are shown in
Table 4:
As can be seen from
Figure 18 and
Table 4, the performance of MLSS-VO slightly lags behind that of state-of-the-art slam framework ORB-SLAM2 and RGBD-SLAM v2. However, considering that we manually calibrated the scale of ORB-SLAM2 to make the comparison meaningful, and RGBD-SLAM V2 requires dense depth map information. MLSS-VO only needs to use the target baseline in monocular vision and tracking problems. From the perspective of the tracking scene, MLSS-VO has unique advantages for solving the problem of scale alignment to some extent, and no additional depth map information is required. In fact, our multi-layer feature baselines can also be understood as solving the depth of sparse features in space, including the key point depth information of dynamic targets such as level 1, as well as the key point depth information of static features such as levels 2 and 3.
5.3. Target Tracking
Experiment 3: the purpose of this experiment is to verify the target tracking performance of a moving platform using the MLSS-VO positioning method. Considering that the ground truth requires the motion trajectories of the camera and the target, we use the OTMP open-source data set for verification. The experimental scene is shown in
Figure 19, and the experimental results are shown in
Figure 20.
Table 5 shows the root mean square error (RMSE) of target tracking.
It can be seen from the experimental results in
Figure 20 and
Table 5 that the VO with a multi-level stabilizer solves the pose estimation of the moving platform itself, and realizes the spatial tracking task of the target. The proposed monocular moving platform target tracking framework can effectively track the target, and the tracking error is less than 4.97 cm.
6. Summary
Allowing for the problem of target tracking encountered in monocular vision, a VO with a multi-level scale stabilizer is proposed in this paper, namely MLSS-VO, to resolve monocular scale drift and scale uncertainty. On this basis, the target positioning and tracking framework of monocular motion platform are further described. The core idea of MLSS-VO is that the prior size information of the target and the attitude information of the original VO are used to transmit the spatial size information to the feature baselines at all levels, so as to calculate the real motion scale of the camera. In addition, a feature matching model is proposed on the basis of a Siamese neural network, which is conducive to the extraction of self-supervised feature matching and contributory to providing a reliable reference and constraint for the selection of orb feature points in the original VO. The proposed algorithm can be applied to various sporting platforms such as UAVs and self-driving cars fitted with monocular vision sensors.
Indoor experiments have revealed the following points. Firstly, the self-supervised feature matching based on Siamese neural network proposed in this study is effective in determining the matching region between moving images. Secondly, the scale information in MLSS-VO can be used to solve the scale uncertainty arising from VO initialization and restore the real proportion of the motion trajectory. With an appropriate number of Level 2 and Level 3 features selected, the real-time speed can reach about 9.7 FPS. Next, we compare MLSS-VO with two state of the art slam frameworks: ORB-SLAM2 and RGBD-SLAM v2, and analyze the advantages of MLSS-VO. Lastly, the root mean square error of motion estimation of MLSS-VO is restricted to within 3.87 cm, and the root mean square error of the target location error based on this method is less than 4.97 cm.
The method proposed in this paper will be further extended to various motion platforms for different purposes such as obstacle avoidance, trajectory monitoring, visual navigation, as well as the tracking of various small UAVs and autonomous cars.