MLSS-VO: A Multi-Level Scale Stabilizer with Self-Supervised Features for Monocular Visual Odometry in Target Tracking

Wang, Zihao; Yang, Sen; Shi, Mengji; Qin, Kaiyu

doi:10.3390/electronics11020223

Open AccessArticle

MLSS-VO: A Multi-Level Scale Stabilizer with Self-Supervised Features for Monocular Visual Odometry in Target Tracking

¹

School of Aeronautics and Astronautics, University of Electronic Science and Technology of China, Chengdu 611731, China

²

Aircraft Swarm Intelligent Sensing and Cooperative Control Key Laboratory of Sichuan Province, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(2), 223; https://doi.org/10.3390/electronics11020223

Submission received: 6 December 2021 / Revised: 1 January 2022 / Accepted: 6 January 2022 / Published: 11 January 2022

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

In this study, a multi-level scale stabilizer intended for visual odometry (MLSS-VO) combined with a self-supervised feature matching method is proposed to address the scale uncertainty and scale drift encountered in the field of monocular visual odometry. Firstly, the architecture of an instance-level recognition model is adopted to propose a feature matching model based on a Siamese neural network. Combined with the traditional approach to feature point extraction, the feature baselines on different levels are extracted, and then treated as a reference for estimating the motion scale of the camera. On this basis, the size of the target in the tracking task is taken as the top-level feature baseline, while the motion matrix parameters as obtained by the original visual odometry of the feature point method are used to solve the real motion scale of the current frame. The multi-level feature baselines are solved to update the motion scale while reducing the scale drift. Finally, the spatial target localization algorithm and the MLSS-VO are applied to propose a framework intended for the tracking of target on the mobile platform. According to the experimental results, the root mean square error (RMSE) of localization is less than 3.87 cm, and the RMSE of target tracking is less than 4.97 cm, which demonstrates that the MLSS-VO method based on the target tracking scene is effective in resolving scale uncertainty and restricting scale drift, so as to ensure the spatial positioning and tracking of the target.

Keywords:

monocular vision; visual odometry; multi-level scale stabilizer; self-supervised features; 3D tracking

1. Introduction

Visual odometry (VO), as the core of solving the autonomous positioning problem of robots, has been of great interest to researchers in the visual field. Although binocular vision has advantages in accuracy, the monocular system still has certain advantages for automobiles [1], UAVs [2], and other industries due to the steady decline in the cost of consumer-level monocular cameras in recent years and the lower calibration workload. Therefore, the challenges of visual odometry are both fundamental and practical. At present, the typical monocular visual odometry methods include ORB-slam3 [3] based on the feature point method, DSO [4] based on a direct method, SVO [5] based on a semi-direct method, and VINS-Mono [6] combined with inertial navigation equipment. At the same time, with the development of the neural network, researchers have carried out many explorations of the visual odometry and SLAM methods based on deep learning, for the SLAM method combining SVO and CNN [7], for the SLAM method of semantic segmentation [8], for the localization method of planar target feature [9], and for the unsupervised features used in visual odometry [10,11], etc.

At the same time, scale drift and uncertainty have been another focus and difficulty in monocular vision research. Researchers have done a lot of research on reducing the scale drift of monocular vision slam, such as the monocular SLAM scale correction method based on Bayesian estimation [12], the low drift SLAM scheme estimated by geometric information and surface normal [13], and the VIO method combined with the characteristics of inertial navigation devices [6,14,15]. Among them, the VINS method effectively solves the monocular scale uncertainty problem due to the combination of inertial navigation devices, and greatly reduces the error accumulation caused by drift in the monocular system. However, the scheme combined with inertial navigation devices has three main disadvantages: (1) sensor equipment is more expensive; (2) the calibration of equipment, i.e., the synchronization mechanism between inertial data and visual data, is more complex; and (3) the front-end that is the visual odometry and the back-end optimization algorithms brought by data fusion have higher complexity.

Based on the above summary and thinking, combined with the characteristics that the target often has the size reference information in the target tracking problem, we designed a multi-level scale stabilizer to use the feature baselines at different levels to solve the real scale of the camera. Using the size of the target as the top-level feature baseline, the baseline information is transmitted to the features of each level, and then the real proportion of the displacement T in the motion of the monocular camera is solved, so as to solve the scale uncertainty and reduce the drift. We note that the feature matching in traditional visual odometry often directly uses the extraction of feature points or pixel gradient information such as oriented fast and rotated brief (ORB) [16], and then uses the epipolar constraint [17] and random consistency algorithm [18] to obtain reliable motion matching. At the same time, due to the scale equivalence of the essential matrix E in the epipolar constraint, the resulting displacement T also faces the scale uncertainty problem. In order to solve this problem, we propose a multi-level abstract feature mechanism in the target tracking problem. The top-level feature is the target in the tracking problem. In the second level, we obtain the required feature matching region set based on the self-supervised feature matching model of the Siamese neural network, and the third level is the feature point set of the traditional orb class. Finally, the spatial scale transfer is carried out by using the prior information of the top target size, so as to obtain the baseline size of the feature area and solve the real motion scale

T

.

The research is aimed at solving the problem of failed navigation caused by the insufficient ability of autonomous positioning and spatial depth estimation encountered by mobile robots (autonomous vehicles, UAVs, etc.) carrying monocular vision sensors when fulfilling the tasks of target tracking and obstacle avoidance crossing in an unfamiliar environment. The main reason for the problem is the error of spatial depth estimation caused by the uncertain scale in the traditional monocular vision odometer and the accumulation of positioning error resulting from scale drift. We combined the target tracking model and monocular visual odometer to obtain better depth estimation and reduce error accumulation. We studied how to use the size information of the target to solve and transfer the scale to reduce the drift of the monocular vision system in the indoor target tracking problem. A multi-level scale stabilizer was defined using a self-supervised learning model. Based on the scale information of the target, the feature baseline extracted from the feature region as obtained through self-supervised learning was treated as the basis for scale information. With the spatial positioning of the target achieved, the scale information was transmitted to the original visual odometry, which was effective in reducing scale drift. According to this method, the extraction and transmission of feature baseline were classified into three levels, while the clear transmission relations and confidence weights between different transmission levels were defined. The main contributions of this paper are detailed as follows:

A multi-level scale stabilizer (MLSS-VO) based on monocular VO is proposed. The size of the target in the task of tracking is taken as the prior information of the top level (Level 1), the feature region extracted using the self-supervised network is treated as the second level (Level 2), and the feature point set as obtained by the traditional method is regarded as the third level (Level 3). In particular, priority is given to the feature points in the self-supervised matching region for more reliable matching constraints. Then, the feature points on the third level are used to construct the original VO using the feature point method, thus obtaining the attitude and motion of the camera with scale error. On this basis, the size information of the top level and the feature baseline information of the second and third levels are combined to solve the real trajectory of camera motion;
Based on the deep local descriptors intended for an instance-level recognition network model, a Siamese neural network model [19] suitable for the matching of motion video stream feature is proposed, and a baseline acquisition mechanism in the feature region is designed;
Through the combination of MLSS-VO and a target space positioning algorithm, an algorithm framework is designed for autonomous positioning and target tracking based on a monocular mobile platform. During multiple sets of indoor target tracking experiments, the motion capture system is adopted to verify that the root mean squared error (RMSE) of the algorithm is less than 3.87 cm in the indoor test environment, and that the RMSE of the moving target tracking is less than 4.97 cm, which indicates the effectiveness of the algorithm in indoor autonomous positioning and target tracking.

The rest of this paper is organized as follows. In Section 2, an introduction is made of the self-supervised learning model and feature baseline extraction method. Section 3 elaborates on the process and framework of the MLSS-VO algorithm based on multi-level scale stabilizer. In Section 4, it is detailed how multi-level feature baselines can be used to solve the scale, with a set of target tracking algorithm frameworks proposed for a monocular motion platform on the basis of the MLSS-VO. Section 5 presents the performance analysis of the MLSS-VO algorithm and the verification of the indoor tracking experiments. Section 6 is a summary to conclude the paper.

2. Multi-Level Feature Extraction

The target recognition and feature baseline extraction in Level 1 can refer to the previous work of the author [20]. The extraction method of the orb in Level 3 has been referred to by a large number of related studies. Therefore, the feature extraction methods for Levels 1 and 3 are no longer described in detail in this paper, and the extraction of the self-supervised feature matching area in Level 2 level is emphasized.

2.1. Self-Supervised Feature Region Learning

We studied the feature matching model of deep local descriptors for instance-level recognition [19,21], and improved its training model to reduce the cost of self-supervised feature learning. The original model focuses on the description of local features. In this paper, the Siamese network is used to extract feature descriptors, which changes from focusing on feature descriptions to focus on the gap between features. It has higher universality and is more suitable for visual applications such as VO: during the training period, the Siamese network method is used to extract the

W \times H \times d

dimensional feature descriptors of the sample images for correlation calculation. If the similarity of the two input images is higher than 80%, the output is 1, otherwise 0. In practical applications, we only need to obtain a small number of reference images of similar environments for training. The training sample construction method is to construct positive samples by shifting and scaling the same image and construct anti-samples between unrelated images.

The structure of the training model is shown in Figure 1 above. The training model used in this paper is introduced as follows:

Fully convolutional network (FCN): $f (\cdot)$ stands for a deep FCN, denoted by the function: $f (\cdot)$ : $R^{w \times h \times 3} \to R^{W \times H \times D}$ that maps an input image $I$ to a 3D tensor of activations $f (I)$ which is used as an extractor of feature descriptors;
Feature strength and attention: define the tensor of the feature descriptor as $u = f (I)$ , the feature strength is estimated as the L2 norm of the feature descriptor $u$ by the attention function:

$w (u) = ||u||$

(1)

Feature strength is used to weigh the contribution of each feature descriptor in matching, so that it can be reduced from D dimension to 1 dimension. Considering that the influence of weak features during training is limited,

w (u)

can select the strongest feature as the preferred matching during testing.

3.: Local smoothing: we propose to spatially smooth the activations by average pooling in an 3 × 3 neighborhood. The smoothing result is indicated by $\bar{U} = h (f (I))$ . The main function here is to make the features more dispersed and smoother;
4.: Descriptor whitening: local descriptor dimensions are de-correlated by a linear whitening transformation. Using the PCA dimension reduction method [22], using Function $o (\cdot)$ which is implemented by 1 × 1 convolution with bias to achieve feature tensor dimension reduction: $R^{D} \to R^{d}$ ;
5.: Training based on a Siamese network: Let $U = f (I) and V = f (J)$ be sets of dense feature descriptors in images $I$ and $J$ , image similarity is given by:

$S (U, V) = γ (U) \cdot γ (V) \sum_{u \in U} \sum_{v \in V} u^{T} v$

(2)

γ (U) = 1 / ||\sum u \in U u||

, the actual use, the corresponding is

w (u) = ||u||

weight adjustment. When

J

is a positive sample, the output value of similarity is 1. For negative samples, the output value of similarity is 0.

Figure 1. Training architecture overview.

Accordingly, in the test, we extract the feature matrix and weight matrix multiplied to obtain the key region feature matching, see in Figure 2. The structure of the training model and test model is shown in Figure 2. We use

w (u)

to select the required N strongest feature matching regions to provide a reference region for the extraction of the feature baseline below.

The main function of Level 2 has two advantages: (1) Using the reliable feature matching area of self-supervised learning as the scale transfer reference of Level 2, the training cost is low and the speed is fast. (2) We first select the orb feature points in the Level 2 region as the matching points of original VO to solve the camera motion, making use of the advantages of richer information and higher reliability in the self-monitoring feature region.

2.2. Feature Baseline Extraction

For Level 1, the feature baseline is abstracted as the size feature of the target being tracked, such as the edge length, radius, etc., the extraction method is referred to our previous work [20]. The following focuses on feature baseline extraction methods in Levels 2 and 3.

Figure 3 shows the extraction method of feature baseline, as follows:

Level 2 feature baseline: the dashed box shown in Figure 3a above is the feature region learned by Level 2, and the dots represent the ORB feature points in the region. A point $P_{1}$ (red) is randomly selected as one end of the baseline, and five points with the furthest distance of 2D pixels in the region are selected as the alternatives (color feature points). Then point $P_{2}$ (yellow feature points) with the furthest distance of Hamming distance $D_{h}$ from the descriptor of point $P_{1}$ is selected to form feature baseline $\vec{P_{1} P_{2}}$ . For robustness, we can select point $P_{3}$ (blue feature points), which is farthest from the point $P_{1}$ descriptor Hamming distance, to form a spare feature baseline $\vec{P_{1} P_{3}}$ .;
Level 3 feature baseline: as shown in Figure 3b above, a point $P_{1}$ (red) is randomly selected as the end of the baseline. We select a ring area with an inner diameter of $R 1$ and an outer diameter of $R 2$ as a candidate feature area. The next baseline extraction method is the same as the Level 2. In addition, for $R 1$ and $R 2$ , at the resolution of 1280 × 720, we recommend a pixel radius of 10 to 30, the distance is too small, the error of the solution is large, and the distance is too large due to the camera movement. The existence period of the baseline is very short and easy to lose.

Figure 3. Feature baseline extraction: (a) feature baseline extraction in Level 2; (b) feature baseline extraction in Level 3.

It should be noted that the ORB feature points mentioned above are the feature points filtered by RANSAC (random sample consensus) algorithm in VO, that is, the static feature points in the environment. The static characteristics of the feature baseline are the necessary conditions for the feature baseline in Levels 2 and 3.

3. Multi-Level Scale Stabilizer (MLSS)

In this chapter, we give the definition of multi-level feature baseline in detail. While using the feature baseline to solve the scale T, we also complete the spatial positioning and tracking of moving targets. The core solution method related to the scale solution is given in Section 4.2, and the solution of equations based on numerical iteration will be supplemented in Section 4.6.

3.1. Multi-Level Features

As shown in Figure 4 above, we first define the characteristic baseline in the three-level scale stabilizer:

Level 1: the moving target i to be tracked is taken as the top-level baseline size, and its baseline $λ_{1 i}$ is known;
Level 2: feature region j obtained by self-supervised learning, and its baseline $λ_{2 j}$ obtained by baseline $λ_{1 i}$ in Level 1 and camera pose;
Level 3: a large number of traditional ORB class features k. Firstly, the feature points in Level 3 are used to solve the camera motion attitude with scale uncertainty, and a small number of paired feature points are selected to obtain the baseline $λ_{3 k}$ . Through the baselines $λ_{1 i}$ and $λ_{2 j}$ in Level 1 and Level 2, the baseline $λ_{3 k}$ is obtained by combining the camera attitude.

Figure 4. Schematic diagram of the relationship between multi-level feature baselines and camera motion.

Figure 4 above shows the form in which the multi-layer feature baseline appears in 4-frame moving images. Then, rectangle, circle, and stick are adopted to denote the characteristics in Levels 1–3. Moreover, the level where the character is located, the number of the characteristic, and the baseline number of the characters are labeled. What should be noted is that the characteristics of levels 2 and 3 in actual camera images are manually abstracted and they exist in the images in a large amount. Here, only a small amount of them are drawn as examples. Due to the motion of cameras and targets, some old characteristics would deviate from existing characteristics and new characteristics would enter continuously. Hence, green, blue, and gray, respectively, represent the characters existing in the previous frame, the newly added characteristic of the present frame, and the invalid characteristic deviating from the image. Once the target disappears, MLSS is used to solve the multi-level feature baseline through the size information of the target, and the new reference baseline size is continuously obtained, so as to realize the real-time scale update.

3.2. Framework and Data Pipeline

Figure 5 is the algorithm framework with multi-level scale stabilizer:

The algorithm framework is mainly divided into the following four parts:

Multi-level target extractor: according to the features in Level 1, we use traditional image processing methods or extract them through target prior information; for the features in Level 2, we use a self-supervised learning model to obtain and select a small number of feature baselines with high confidence between multiple frames as transfer objects. For the features in Level 3, we select the appropriate feature pairs as a baseline after using traditional features such as SIFT, ORB;
Visual odometer: compared with the traditional visual odometer, the visual odometer in this paper has the following improvements: (1) Using the feature segmentation in the image extractor, the dynamic part (Level 1 feature) of the image is shielded to reduce the interference of dynamic features on the VO solution. (2) Using the target size information to solve the problem of monocular scale uncertainty. (3) The MLSS-VO will take the updated scale calculated in real-time as the motion scale $T$ of the current camera to reduce the scale drift;
Scale transmitter: using the feature points in Level 3 to complete the traditional VO to obtain the attitude $[R | T]$ , the projection relationship of the target in Level 1 in 2D plane and its prior size $λ_{1 i}$ , the feature baseline values of Levels 2 and 3 are transmitted and estimated.

Multi-level scale updater: considering that Level 1 is the target in tracking task, Level 2 has done motion matching in the initial VO to eliminate the wrong solution. Here, we again use the RANSAC (random sample consensus) algorithm to eliminate the error matching in the second level of features. Finally, the final scale value

T

is output and updated after the scale weighting.

Figure 6 below illustrates from the data flow level how the research in this paper connects the visual odometer model and the target space tracking model.

It can be seen from Figure 6 that the feature of the target and size information in Level 1 is used in the spatial positioning algorithm, and it also assists in removing the uncertainty of the scale with the pose information MLSS-VO of the visual odometer. The feature regions in Level 2 help extract the feature baselines in Level 2, and also assist in constraining the matching of feature points in Level 3. The feature points in level 3 are mainly used in the traditional feature point method to solve the initial camera pose with scale blur. At the same time, feature baselines in Level 3 can also be proposed to help reduce scale drift.

4. Multi-Level Scale Stabilizer Intended for Visual Odometry (MLSS-VO) Implementation Based on Target Tracking

4.1. Coordinate System and Typical Target

Figure 7 shows the relationship between the three coordinate systems, where blue, green, and orange represent projections under the world coordinate system, camera coordinate system, and pixel coordinate system, respectively.

The input of the depth estimation module is a known pinhole camera model and the pixel coordinates of the feature points, and the output is the depth estimation equations with the depth information of feature points as the unknown number. In this paper, target features are firstly abstracted as geometric shapes such as parallelograms, as shown in Figure 8. The coordinates of the target in the world coordinate system are defined as

P_{w} i = {[X_{w i}, Y_{w i}, Z_{w i}]}^{T}, i = 1, 2, 3, 4

. The coordinates in the camera coordinate system are defined as

P_{i} = {[X_{i}, Y_{i}, Z_{i}]}^{T}, i = 1, 2, 3, 4

. The coordinates of the target in the two-dimensional image coordinate system are defined as

p_{u v} i = {[u_{i}, v_{i}, 1]}^{T}, i = 1, 2, 3, 4

. The normalized projection points of the normalized plane.

Z = 1

are defined as

P_{c} i = {[p_{c x i} p_{c y i} 1]}^{T}, i = 1, 2, 3, 4 .

Similarly, for the circular target, considering the properties of circular projection: the projection of the circular is circular or elliptical, and the projection of the circular is still the center of the projection. Using the image algorithm, we can obtain the center of the projection surface image. According to the projection properties, the line segment obtained by the intersection of any two rays in the center and the graph is radius R, as shown in Figure 9. According to the symmetry of the circle, the shape composed of four intersections must be a parallelogram, so solving the depth problem essentially returns to the above method of solving the parallelogram.

It should be noted that in practical engineering, the quadrilateral side length value obtained by this method is difficult to obtain, and the diagonal distance is the radius Rc. As above, we often use the combination of the diagonal distance Rc and the quadrilateral parallel condition equation to obtain the solution depth of the equation set F to obtain the spatial position of the circular target.

4.2. Scale Solver

Figure 10 shows the schematic diagram of the typical double-motion problem. Here, we abstract the moving target as a rigid body like a parallelogram. The rotation

R_{0}

of the monocular camera between the two frames has been obtained by the traditional VO method. Because of the scale uncertainty of the monocular vision,

T_{0}

is a parameter that is proportional to the actual spatial motion. What we need to solve is

T_{0}

. At the same time, given the baseline size

λ_{1}

of the moving rigid body target, considering the length of P1P2 in the graph is the size

λ_{1}

in Level 1, we can solve the actual

T_{0}

by solving the motion of the rigid body.

Here we illustrate the solution. First, the pixel coordinates of each point in Figure 10 above are defined as:

p_{i} = [\begin{matrix} u_{i} \\ v_{i} \\ 1 \end{matrix}] a n d p_{i_{^{'}}} = [\begin{matrix} u_{i_{^{'}}} \\ v_{i_{^{'}}} \\ 1 \end{matrix}] i = 1, 2, 3, 4 .

(3)

The camera coordinate system of each point is:

P_{i} = [\begin{matrix} X_{i} \\ Y_{i} \\ Z_{i} \end{matrix}] a n d P_{i^{'}} = [\begin{matrix} X_{i_{^{'}}} \\ Y_{i_{^{'}}} \\ Z_{i_{^{'}}} \end{matrix}] i = 1, 2, 3, 4

(4)

The camera internal matrix

C M

is known, from the projection relationship:

Z_{i} \times p_{i} = C M \times [\begin{matrix} X_{i} \\ Y_{i} \\ Z_{i} \end{matrix}] i = 1, 2, 3, 4 .

(5)

Z_{i_{^{'}}} \times p_{i_{^{'}}} = C M \times [\begin{matrix} X_{i_{^{'}}} \\ Y_{i_{^{'}}} \\ Z_{i_{^{'}}} \end{matrix}] i = 1, 2, 3, 4 .

(6)

Then, we use the key information

λ_{1}

to establish the baseline length constraint equation:

{|λ_{1}|}^{2} = {|P_{1} P_{2}|}^{2} = {(X_{1} - X_{2})}^{2} + {(Y_{1} - Y_{2})}^{2} + {(Z_{1} - Z_{2})}^{2} = {|P_{3} P_{4}|}^{2} = {(X_{3} - X_{4})}^{2} + {(Y_{3} - Y_{4})}^{2} + {(Z_{3} - Z_{4})}^{2} = {|P_{1^{'}} P_{2^{'}}|}^{2} = {(X_{1^{'}} - X_{2^{'}})}^{2} + {(Y_{1^{'}} - Y_{2^{'}})}^{2} + {(Z_{1^{'}} - Z_{2^{'}})}^{2} = {|P_{3^{'}} P_{4^{'}}|}^{2} = {(X_{3^{'}} - X_{4^{'}})}^{2} + {(Y_{3^{'}} - Y_{4^{'}})}^{2} + {(Z_{3^{'}} - Z_{4^{'}})}^{2}

(7)

Finally, using the properties of rigid body,

\vec{P_{1} P_{2}} ∥ \vec{P_{3} P_{4}}

and

\vec{P_{1_{^{'}}} P_{2_{^{'}}}} ∥ \vec{P_{3_{^{'}}} P_{4_{^{'}}}}

can establish rigid body parallel constraints:

(X_{1} - X_{2}) : (Y_{1} - Y_{2}) = (X_{3} - X_{4}) : (Y_{3} - Y_{4})

(8)

(X_{1} - X_{2}) : (Z_{1} - Z_{2}) = (X_{3} - X_{4}) : (Z_{3} - Z_{4})

(9)

(Y_{1} - Y_{2}) : (Z_{1} - Z_{2}) = (Y_{3} - Y_{4}) : (Z_{3} - Z_{4})

(10)

(X_{1_{^{'}}} - X_{2_{^{'}}}) : (Y_{1_{^{'}}} - Y_{2_{^{'}}}) = (X_{3_{^{'}}} - X_{4_{^{'}}}) : (Y_{3_{^{'}}} - Y_{4_{^{'}}})

(11)

(X_{1^{'}} - X_{2^{'}}) : (Z_{1^{'}} - Z_{2^{'}}) = (X_{3^{'}} - X_{4^{'}}) : (Z_{3^{'}} - Z_{4^{'}})

(12)

(Y_{1_{^{'}}} - Y_{2_{^{'}}}) : (Z_{1_{^{'}}} - Z_{2_{^{'}}}) = (Y_{3_{^{'}}} - Y_{4_{^{'}}}) : (Z_{3_{^{'}}} - Z_{4_{^{'}}})

(13)

Considering that

p_{i}, p_{j}, λ and C M

are a known quantity, we can solve the equations containing

P_{i}, P_{i_{^{'}}}

before and after the movement in the above figure, respectively. It is worth noting that a reasonable selection of equations and appropriate numerical iterative methods can improve efficiency.

Next, we solve the scale

T_{0}

, first define the displacement of

P_{i}

to

P_{i_{^{'}}}

relative to the camera as

T_{i}

:

P_{i_{^{'}}} - P_{i} = T_{i} i = 1, 2, 3, 4 .

(14)

At the same time, according to the spatial relationship between camera motion and target motion, we can know:

P_{i_{^{'}}} = (R_{0} \times P_{i} + T_{0}) + T_{i} i = 1, 2, 3, 4 .,

(15)

At this point, we obtain the real T₀, that is, the scale information. So far, we have completed the scale recovery of VO using the baseline of Level 1. Next, for the features of Levels 2 and 3, the problem becomes simpler, and Figure 10 is still taken as an example. At this time, the target length becomes the feature baseline length extracted by the algorithm from the image and

T_{0}

is known. The problem is transformed into solving the unknown scale

λ

in Levels 2 and 3. Using the same method, the feature baseline values in Levels 2 and 3 can be obtained.

4.3. Scale Weighting and Updating

We use image checking to continuously learn and extract new Levels 2 and 3 features to join the multi-level scale stabilization of this article and remove the features that leave the image from the scale stabilizer. Theoretically, if our detection and scale calculation are absolutely accurate, then we can use any set of features in the stabilizer to complete the

T_{0}

scale calculation. In fact, considering various problems such as error matching and feature learning failure, we add the random sample consensus method into the scale updater to first propose the error matching baseline in self-supervised learning. In fact, the Level 2 level has a certain uncertainty for the extracted feature region due to the dependence on deep neural network learning, so the RANSAC algorithm is used for investigation.

Finally, the scale weighted vector

ψ

and the camera motion scale vector

Λ

solved by the feature baseline are defined:

ψ = {[ω_{11}, \dots, ω_{1 i}, ω_{21}, \dots ω_{2 j}, ω_{31}, \dots, ω_{3 k}]}^{T}

(16)

Λ = [T_{λ_{11}}, \dots, T_{λ_{1 i}}, T_{λ_{21}}, \dots T_{λ_{2 j}}, T_{λ_{31}}, \dots, T_{λ_{3 k}}]

(17)

where

i, j, k

Represents the number of scale features of each level. Finally,

T_{0} = Λ \times ψ

can be obtained. It is important that the assignment of

Λ

follows the following principles and properties:

Since ideally all values in all $Λ$ are the $T_{0}$ , vector $Λ$ corresponds to $ψ$ satisfying Property $\sum_{i, j, k} ψ = 1$ ;
Since the scale learning obtained in Level 1 is the most credible and stable a priori information, when tracking a target in an image, priority is given to using the prior scale information in Level 1, which is a strategy of $\sum_{i} ψ \geq 0.95$ ;
When the target disappears, the weight distribution of the feature scale using Levels 2 and 3 can be determined according to the actual application scenario. For example, when there are a large number of self-supervised learning features indoors, the scales obtained in Level 2 are assigned higher weights, and when the self-supervised feature region is unstable, the scales obtained in Level 3 can be assigned higher weights.

4.4. The Advantages and Disadvantages of MLSS-VO

The advantages of a multi-level scale stabilizer based on a target tracking problem are:

(1) During the initialization of the VO, the scale uncertainty in camera estimation is solved by using the size information of a target. (2) The multi-level feature baseline can be used to update the scale value T of camera motion in real time to prevent scale drift, especially when the tracking starts again after VO loss. (3) While solving the scale

T

, the target spatial localization algorithm is realized. (4) The feature regions obtained by the self-monitoring method are matched with orb features, which reduces the possibility of false feature matching. (5) The transfer of the real size of the feature baseline in the space is realized by using the target size, which can provide the real scale reference for various applications in the region of interest space.

There are also some disadvantages to be improved:

(1) Self-supervised region extraction consumes a large amount of computation, which affects real-time performance. To solve this problem, in some platforms with limited computing resources, such as small UAV platforms, only the characteristic baselines of the first and third levels can be used for scale updating, thus greatly reducing the computational complexity. (2) For a looped motion environment, the self-supervised region of the method can obtain a longer feature life cycle; for the motion without loopback, this algorithm needs to continuously extract new self-supervised feature regions, which leads to a large computational burden. Therefore, this method is more suitable for motion scenes with a loop.

4.5. Target Location and Tracking Framework

As shown in Figure 11, considering that the first level feature baseline acquisition of the scale stabilizer in this paper is dependent on the target recognition in the 3D tracking problem, we propose a 3D positioning and tracking algorithm framework for moving targets based on a monocular motion platform combined with the target tracking algorithm.

This framework mainly includes two parts:

Monocular VO autonomous positioning method with scale stabilizer (red part): this part has been elaborated in detail in the above;
The target in this article are located as a circle or a parallelogram. In Section 4.2, we have given the geometric constraint equations of the target relative to the camera coordinate system. An improved high-order Newton iterative algorithm is used to solve the equations numerically. Finally, the camera pose solved by MLSS-VO and the relative coordinates solved by the target positioning algorithm are used to calculate the real trajectory of the target in space.

4.6. Improved Newton Iteration

For the scale transfer equations constructed in Section 4.2, we can synchronously solve the spatial position of the moving target relative to the camera. In the actual test, we use the numerical iteration method to solve the equations. The traditional Newton iteration method has the characteristics of second-order convergence and accurate solution. The iterative equation is as follows:

X^{n + 1} = X^{n} - \nabla^{2} F {(X)}^{- 1} \cdot \nabla F (X)

(18)

For our algorithm, X is the vector space of the unknown depth value to be solved in our algorithm,

\nabla F (X)

is the Jacobian derivative matrix and

\nabla^{2} F {(X)}^{- 1}

is the inverse matrix of the Hessian matrix. However, the traditional Newton iterative optimization has a large amount of calculation when calculating the inverse matrix of the Hessian matrix, and the second derivative may fall into the endless loop at the inflection point, i.e.,

f^{″} (x_{n}) = 0

.

Here, we give an improved Newton–Raphson method with the fifth-order convergence is adopted [23,24]. This method can solve higher-order equations more stably, whose iterative equation is as follows:

Y^{n} = X^{n} - \frac{F (X^{n})}{F^{'} (X^{n})}

(19)

Z^{n} = X^{n} - \frac{2 F (X^{n})}{F^{'} (X^{n}) + F^{'} (Y^{n})}

(20)

X^{n + 1} = Z^{n} - \frac{F (Z^{n})}{F^{'} (Y^{n})}

(21)

where,

X^{n}

is the vector to be solved,

Y^{n}

and

Z^{n}

are the intermediate variables of the iteration. Finally, according to the updated

R

and

T

obtained by MLSS-VO we can further solve the spatial position:

R \cdot Pi + T = P_{w} i \Rightarrow Pi = R^{- 1} \cdot P_{w} i - T

(22)

where

R

is the rotation matrix of the camera relative to the world coordinate system, and

T

is the displacement vector of the camera relative to the world coordinate system. The coordinates

Pi

of the target feature point in the world coordinate system can be obtained by solving Equation (22).

5. Experiments

The author designed multiple sets of experiments based on indoor environment to test and verify the algorithm. In those experiments, a visual motion capture system named ZVR was employed to calibrate the true value of the target’s trajectory. The motion capture system (MCS) is composed of eight cameras, which can cover the space experiment range of 4.7 m × 3.7 m × 2.6 m, and can achieve posture tracking and data recording with the refresh rate of 260 Hz. All images in our experiments were taken by a rolling shutter monocular pinhole camera with fixed-focus.

Aiming at the problem of target tracking, combined with the monocular motion platform test requirements involved in this article, we proposed a method for establishing a benchmark for object tracking with motion parameters (OTMP). All samples were taken by a monocular fixed-focus pinhole camera. The trajectory information of multiple sets of sample targets in space and the pose information of the camera itself are simultaneously recorded using an indoor motion capture system. Besides, the camera internal parameter matrix

C M

, sample calibration set

S

, sample space motion trajectory parameters

T_{s}

, and the camera’s pose

[R_{c} |T_{c}]

after calibration using the checkerboard are provided. This data set can be used for visual scientific research such as the visual slam of indoor dynamic environments, spatial positioning of moving targets, and dynamic target recognition and classification. This dataset has been uploaded to GitHub: https://github.com/6wa-car/OTMP-DataSet.git (accessed on 5 December 2021).

Figure 12 shows a panoramic view of the entire experimental scene. The moving targets in OTMP are shown in Figure 13.

5.1. Feature Region Extraction Based on Siamese Neural Network

Experiment 1: The purpose of this experiment is to verify the performance of Level 2 feature extraction matching. After training the self-supervised Siamese neural network model with a small number of indoor environment images, we use a monocular motion camera to track and shoot the indoor target, and finally extract and match the static feature regions in the environment. Figure 14 and Figure 15, respectively, correspond to a schematic diagram of randomly extracting orb feature points matching in Level 3 and a schematic diagram of orb feature point matching based on self-supervised feature region constraints in Level 2.

When extracting from the supervised feature region, we merge the adjacent feature regions on the image to obtain a simpler feature region division. The Correct Matches we give represent the number of correct unsupervised matching regions in Level 2. Table 1 illustrates the relevant performance of the self-supervised feature matching model. In particular, we tested the performance of the algorithm on the Central Processing Unit (CPU) and Graphics Processing Unit (GPU) platforms:

It can be seen from Figure 14 and Figure 15 that our feature region-based method can prevent false matches caused by the original orb matching method. Compared to directly using orb feature point matching, we can control the distribution of self-supervised feature regions, such as choosing to selectively extract features with the largest local feature weights in each region of the image. Therefore, the feature point matching method based on Level 2 feature region constraints can make the distribution of feature point pairs on the image more uniform, and prevent feature point matching that is too concentrated from adversely affecting the motion solution in the VO problem, such as the increasing in motion attitude error, systematic deviation of feature matching (error calculation of camera motion attitude caused by unknown motion of feature points used in a certain concentrated area), etc.

Furthermore, from the test results in Table 1, for research and application of SLAM, the model of the improved Siamese neural network for feature matching relies on GPU-based computing platforms to obtain better real-time performance. For the CPU-based computing platform, this matching method is only suitable for non-real-time applications such as structure from motion (SFM).

It can be seen from Table 2 that in the actual application of MLSS-VO, it is necessary to consider selecting the number of feature matching in Levels 2 and 3 reasonably according to the computing power of the computing platform. Under the GTX1060 platform, when selecting five matching areas in Level 2 and 30 orb feature points in Level 3, our algorithm can run at a speed of about 9.7 fps.

The timing performance of MLSS-VO under different numbers of feature matching in Levels 2 and 3 are as follows:

Table 2. Performance of timing for MLSS-VO in experiment 2.

Method	Matches in Level 2	Matches of Feature Points in Level 3	Timing	GPU	Input Resolution
MLSS-VO	5	30	103.2 ms	GTX1060	1280 × 720
	5	60	122.1 ms	GTX1060
	10	60	121.1 ms	GTX1060
	10	90	140.7 ms	GTX1060
	20	90	146.2 ms	GTX1060
	20	120	169.6 ms	GTX1060

5.2. Performance of MLSS-VO

Experiment 2: the purpose of this experiment is to investigate the spatial positioning performance of MLSS-VO in indoor environment. In addition to our open-source data set OTMP, we also used the TUM data set which is commonly used in slam research as verification. Meanwhile, we have compared our MLSS-VO with two typical methods in monocular visual odometry: ORB-SLAM2 [25] and RGBD-SLAM v2 [26] for a more comprehensive analysis.

We first verified the effectiveness of MLSS-VO using OTMP data set. The experimental results are shown in Figure 16.

The RMSE of MLSS-VO on x-y-z axis are shown in Table 3:

The timing performance of MLSS-VO under different numbers of feature matching in Levels 2 and 3 are as follows.

It can be seen from the experimental results in Figure 16, Table 2 and Table 3 that: (1) compared with the traditional feature point method corresponding to green data, the scale information in MLSS-VO effectively solves the scale uncertainty in VO initialization and restores the real proportion of motion trajectory. (2) Considering the real-time nature of the SLAM problem, the computing power of the computing platform needs to be considered when using MLSS-VO. The feature area in Level 2 should not exceed 10 while the orb feature points in Level 3 should not exceed 60. (3) In the experimental test of the actual indoor environment, the motion estimation value of MLSS-VO does not show scale drift, and the root mean square error of positioning error is controlled within 2.73 cm, which can be applied to most indoor visual motion platforms such as drones and unmanned vehicles.

Furthermore, we compared the three algorithms of MLSS-VO, ORB-SLAM2, and RGBD-SLAM v2 using the TUM data set. It is worth noting that because ORB-SLAM2’s monocular module requires scale alignment when testing TUM data, we used the slam evaluation tool Evo to correct it. In addition, RGBD-SLAM v2 needs to use the depth map information of the data set when experimenting with the TUM data set. For the MLSS-VO experiment in this paper, we give the assumed target with calibration during initialization. The experimental results are shown in Figure 17 and Figure 18 below.

The comparison of RMSE for these three methods on the x-y-z axis are shown in Table 4:

As can be seen from Figure 18 and Table 4, the performance of MLSS-VO slightly lags behind that of state-of-the-art slam framework ORB-SLAM2 and RGBD-SLAM v2. However, considering that we manually calibrated the scale of ORB-SLAM2 to make the comparison meaningful, and RGBD-SLAM V2 requires dense depth map information. MLSS-VO only needs to use the target baseline in monocular vision and tracking problems. From the perspective of the tracking scene, MLSS-VO has unique advantages for solving the problem of scale alignment to some extent, and no additional depth map information is required. In fact, our multi-layer feature baselines can also be understood as solving the depth of sparse features in space, including the key point depth information of dynamic targets such as level 1, as well as the key point depth information of static features such as levels 2 and 3.

5.3. Target Tracking

Experiment 3: the purpose of this experiment is to verify the target tracking performance of a moving platform using the MLSS-VO positioning method. Considering that the ground truth requires the motion trajectories of the camera and the target, we use the OTMP open-source data set for verification. The experimental scene is shown in Figure 19, and the experimental results are shown in Figure 20.

Table 5 shows the root mean square error (RMSE) of target tracking.

It can be seen from the experimental results in Figure 20 and Table 5 that the VO with a multi-level stabilizer solves the pose estimation of the moving platform itself, and realizes the spatial tracking task of the target. The proposed monocular moving platform target tracking framework can effectively track the target, and the tracking error is less than 4.97 cm.

6. Summary

Allowing for the problem of target tracking encountered in monocular vision, a VO with a multi-level scale stabilizer is proposed in this paper, namely MLSS-VO, to resolve monocular scale drift and scale uncertainty. On this basis, the target positioning and tracking framework of monocular motion platform are further described. The core idea of MLSS-VO is that the prior size information of the target and the attitude information of the original VO are used to transmit the spatial size information to the feature baselines at all levels, so as to calculate the real motion scale of the camera. In addition, a feature matching model is proposed on the basis of a Siamese neural network, which is conducive to the extraction of self-supervised feature matching and contributory to providing a reliable reference and constraint for the selection of orb feature points in the original VO. The proposed algorithm can be applied to various sporting platforms such as UAVs and self-driving cars fitted with monocular vision sensors.

Indoor experiments have revealed the following points. Firstly, the self-supervised feature matching based on Siamese neural network proposed in this study is effective in determining the matching region between moving images. Secondly, the scale information in MLSS-VO can be used to solve the scale uncertainty arising from VO initialization and restore the real proportion of the motion trajectory. With an appropriate number of Level 2 and Level 3 features selected, the real-time speed can reach about 9.7 FPS. Next, we compare MLSS-VO with two state of the art slam frameworks: ORB-SLAM2 and RGBD-SLAM v2, and analyze the advantages of MLSS-VO. Lastly, the root mean square error of motion estimation of MLSS-VO is restricted to within 3.87 cm, and the root mean square error of the target location error based on this method is less than 4.97 cm.

The method proposed in this paper will be further extended to various motion platforms for different purposes such as obstacle avoidance, trajectory monitoring, visual navigation, as well as the tracking of various small UAVs and autonomous cars.

Author Contributions

Conceptualization, Z.W.; Data curation, Z.W.; Formal analysis, Z.W. and S.Y.; Investigation, S.Y. and M.S.; Methodology, Z.W.; Project administration, K.Q.; Resources, Z.W.; Software, Z.W.; Supervision, K.Q.; Visualization, Z.W.; Writing—original draft, Z.W.; Writing—review and editing, Z.W. and S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Science and Technology Department of Sichuan Province under Grant No. 2020YJ0044, 2021YFG0131, the Fundamental Research Funds for the Central Universities under Grant No. ZYGX2020J020 and the National Numerical Wind Tunnel Project, China under Grant No. NNW2021ZT6-A26.

Data Availability Statement

Experimental Data included in this study are available upon request by contact with the First Author. Part of the code required to reproduce these findings cannot be shared at this time as the data also forms part of an ongoing study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, P.; Zhao, H.; Liu, P.; Cao, F. Rtm3d: Real-time monocular 3d detection from object keypoints for autonomous driving. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 644–660. [Google Scholar]
Zhao, X.; Pu, F.; Wang, Z.; Chen, H.; Xu, Z. Detection, tracking, and geolocation of moving vehicle from uav using monocular camera. IEEE Access 2019, 7, 101160–101170. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Yang, N.; Wang, R.; Stuckler, J.; Cremers, D. Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 817–833. [Google Scholar]
Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast semi-direct monocular visual odometry. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 15–22. [Google Scholar]
Qin, T.; Li, P.; Shen, S. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef] [Green Version]
Loo, S.Y.; Amiri, A.J.; Mashohor, S.; Tang, S.H.; Zhang, H. CNN-SVO: Improving the mapping in semi-direct visual odometry using single-image depth prediction. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 5218–5223. [Google Scholar]
Wang, K.; Lin, Y.; Wang, L.; Han, L.; Hua, M.; Wang, X.; Lian, S.; Huang, B. A unified framework for mutual improvement of SLAM and semantic segmentation. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 5224–5230. [Google Scholar]
Ok, K.; Liu, K.; Frey, K.; How, J.P.; Roy, N. Robust object-based slam for high-speed autonomous navigation. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 669–675. [Google Scholar]
Li, R.; Wang, S.; Long, Z.; Gu, D. Undeepvo: Monocular visual odometry through unsupervised deep learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 7286–7291. [Google Scholar]
Zhan, H.; Garg, R.; Weerasekera, C.S.; Li, K.; Agarwal, H.; Reid, I. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 340–349. [Google Scholar]
Sucar, E.; Hayet, J.B. Bayesian scale estimation for monocular slam based on generic object detection for correcting scale drift. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 5152–5158. [Google Scholar]
Li, Y.; Brasch, N.; Wang, Y.; Navab, N.; Tombari, F. Structure-slam: Low-drift monocular slam in indoor environments. IEEE Robot. Autom. Lett. 2020, 5, 6583–6590. [Google Scholar] [CrossRef]
Yang, Y.; Geneva, P.; Eckenhoff, K.; Huang, G. Pl-vio: Tightly-coupled monocular visual–inertial odometry using point and line features. Sensors 2018, 18, 1159. [Google Scholar]
Han, L.; Lin, Y.; Du, G.; Lian, S. Deepvio: Self-supervised deep learning of monocular visual inertial odometry using 3d geometric constraints. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6906–6913. [Google Scholar]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 2564–2571. [Google Scholar]
Mohamed, M.A.; Mirabdollah, M.H.; Mertsching, B. Monocular epipolar constraint for optical flow estimation. In Proceedings of the 11th International Conference on Computer Vision Systems (ICVS 2017), Shenzhen, China, 10–13 July 2017; Springer: Cham, Switzerland, 2017; pp. 62–71. [Google Scholar]
Derpanis, K.G. Overview of the RANSAC Algorithm. Image Rochester NY 2010, 4, 2–3. [Google Scholar]
Tolias, G.; Jenicek, T.; Chum, O. Learning and aggregating deep local descriptors for instance-level recognition. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 460–477. [Google Scholar]
Wang, Z.H.; Chen, W.J.; Qin, K.Y. Dynamic Target Tracking and Ingressing of a Small UAV Using Monocular Sensor Based on the Geometric Constraints. Electronics 2021, 10, 1931. [Google Scholar] [CrossRef]
Noh, H.; Araujo, A.; Sim, J.; Weyand, T.; Han, B. Large-scale image retrieval with attentive deep local features. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3456–3465. [Google Scholar]
Jolliffe, I. Principal component analysis. In Encyclopedia of Statistics in Behavioral Science; John Wiley & Sons: Hoboken, NJ, USA, 2005. [Google Scholar]
Kou, J.; Li, Y.; Wang, X. Some modifications of Newton’s method with fifth-order convergence. J. Comput. Appl. Math. 2007, 209, 146–152. [Google Scholar] [CrossRef] [Green Version]
Sharma, J.R.; Gupta, P. An efficient fifth order method for solving systems of nonlinear equations. Comput. Math. Appl. 2014, 67, 591–601. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardós, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef] [Green Version]
Endres, F.; Hess, J.; Sturm, J.; Cremers, D.; Burgard, W. 3-D mapping with an RGB-D camera. IEEE Trans. Robot. 2013, 30, 177–187. [Google Scholar] [CrossRef]

Figure 2. Testing architecture overview.

Figure 5. Multi-level scale stabilization algorithm: the red part corresponds to the features in Level 1; the blue part corresponds to the features in Level 2; the green part corresponds to the features in Level 3; the yellow part corresponds to the original visual odometry (VO) and the update for scale

T

.

Figure 5. Multi-level scale stabilization algorithm: the red part corresponds to the features in Level 1; the blue part corresponds to the features in Level 2; the green part corresponds to the features in Level 3; the yellow part corresponds to the original visual odometry (VO) and the update for scale

T

.

Figure 6. The overall research diagram showing data reduction pipeline.

Figure 7. Coordinate systems.

Figure 8. Schematic diagram of projection, transfer coordinate system and tracking.

Figure 9. Projection of circular target.

Figure 10. The dynamic target in the continuous frame of the mobile platform.

Figure 11. Target tracking framework of monocular motion platform based on multi-level scale stabilizer intended for visual odometry (MLSS-VO).

Figure 12. Measurement of indoor environment: (a) Motion capture system monitoring station perspective; (b) Visual interface of the motion capture system.

Figure 13. The type of testing targets: (a) rectangle target with grid pattern; (b) circular target with grid pattern. The size of various targets is about 40 cm.

Figure 14. The schematic diagram of randomly extracting orb feature point matching: (a–c) is the orb feature points matching between two pairs of pictures.

Figure 15. The schematic diagram of extracting orb feature point matching under the domain constraint of the Level 2 self-supervised feature area (the yellow circle area): (a–c) shows two sets of orb feature point matching under the constraints of the Siamese network feature matching area. (d) The attention heatmap of two different frames in the mobile camera.

Figure 16. Camera localization experiment: Red represents ground truth measured by motion capture system, blue represents the camera motion trajectory estimated by MLSS−VO, and green represents the camera motion trajectory estimated by the traditional feature point method: (a) comparison of three−dimensional trajectories; (b) comparison of x-axis; (c) comparison of y-axis; (d) comparison of z-axis.

Figure 17. Schematic diagram of the feature matching using MLSS in the TUM dataset.

Figure 18. Schematic diagram of the test trajectories of MLSS-VO, oriented fast and rotated brief (ORB)-SLAM2 and RGBD-SLAM v2 in the TUM_desk dataset: (a) comparison of three-dimensional trajectories; (b) comparison of x-axis; (c) comparison of y-axis; (d) comparison of z-axis.

Figure 19. Indoor moving target tracking scene: (a–d) is a schematic diagram of the scene of a moving platform tracking a circular target.

Figure 20. Results of the target tracking experiment: Red represents the ground truth of target measured by the motion capture system, and blue represents the target motion trajectory estimated by MLSS-VO and the target positioning algorithm based on geometric constraints: (a) comparison of three-dimensional trajectories; (b) comparison of x-axis; (c) comparison of y-axis; (d) comparison.

Table 1. Performance of timing for Level 2 features.

Method	Correct Matches	Timing	CPU/GPU	Input Resolution
Improved Siamese neural network for feature matching	5	90.2 ms	GPU-GTX1060	1280 × 720
	5	1041.5 ms	CPU-I7-4700QM
	10	92.1 ms	GPU-GTX1060
	10	1161.9 ms	CPU-I7-4700QM
	50	92.6 ms	GPU-GTX1060
	50	1272.1 ms	CPU-I7-4700QM

Table 3. Root mean square error (RMSE) of MLSS-VO.

Axis	x	y	z
RMSE	2.24 cm	2.73 cm	2.17 cm

Table 4. Comparison of RMSE in MLSS-VO, ORB-SLAM2 and RGBD-SLAM v2 in TUM.

Methods	Datasets	Resolution	x-Axis	y-Axis	z-Axis
MLSS-VO	TUM_desk	$640 \times$ 480	3.24 cm	3.87 cm	3.52 cm
ORB-SLAM2	TUM_desk	$640 \times$ 480	2.28 cm	2.94 cm	2.04 cm
RGBD-SLAM v2	TUM_desk	$640 \times$ 480	2.47 cm	3.12 cm	3.14 cm

Table 5. Root mean square error (RMSE) of target tracking.

Axis	x	y	z
RMSE	4.52 cm	3.69 cm	4.97 cm

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Yang, S.; Shi, M.; Qin, K. MLSS-VO: A Multi-Level Scale Stabilizer with Self-Supervised Features for Monocular Visual Odometry in Target Tracking. Electronics 2022, 11, 223. https://doi.org/10.3390/electronics11020223

AMA Style

Wang Z, Yang S, Shi M, Qin K. MLSS-VO: A Multi-Level Scale Stabilizer with Self-Supervised Features for Monocular Visual Odometry in Target Tracking. Electronics. 2022; 11(2):223. https://doi.org/10.3390/electronics11020223

Chicago/Turabian Style

Wang, Zihao, Sen Yang, Mengji Shi, and Kaiyu Qin. 2022. "MLSS-VO: A Multi-Level Scale Stabilizer with Self-Supervised Features for Monocular Visual Odometry in Target Tracking" Electronics 11, no. 2: 223. https://doi.org/10.3390/electronics11020223

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MLSS-VO: A Multi-Level Scale Stabilizer with Self-Supervised Features for Monocular Visual Odometry in Target Tracking

Abstract

1. Introduction

2. Multi-Level Feature Extraction

2.1. Self-Supervised Feature Region Learning

2.2. Feature Baseline Extraction

3. Multi-Level Scale Stabilizer (MLSS)

3.1. Multi-Level Features

3.2. Framework and Data Pipeline

4. Multi-Level Scale Stabilizer Intended for Visual Odometry (MLSS-VO) Implementation Based on Target Tracking

4.1. Coordinate System and Typical Target

4.2. Scale Solver

4.3. Scale Weighting and Updating

4.4. The Advantages and Disadvantages of MLSS-VO

4.5. Target Location and Tracking Framework

4.6. Improved Newton Iteration

5. Experiments

5.1. Feature Region Extraction Based on Siamese Neural Network

5.2. Performance of MLSS-VO

5.3. Target Tracking

6. Summary

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI