1. Introduction
Visual odometry (VO) plays an important role in remote sensing and robotics [
1,
2,
3], as many applications rely on visual information, especially in GNSS-denied environments. In general, VO estimates the 6DoF (Degrees of Freedom) motion of the camera in 3D space. Popular examples of such algorithms are ORB-SLAM [
4], SVO [
5], LSD-SLAM [
6] and DSO [
7]. Certain applications of VO only estimate 4DoF, while avoiding any roll or pitch of the camera. Examples for those are down-looking cameras on satellites [
8], aerial vehicles [
9,
10] or underwater vehicles [
11].
Usually, Fourier-Mellin transform (FMT) is used to estimate such 4DoF motion for remote sensing [
12,
13,
14]. FMT is based on Fourier transform analysis, which is important for image analysis [
15,
16]. It was used to estimate the motion between two images with the phase-only matched filter [
17]. Reddy and Chatterji [
18] presented the classic Fourier-Mellin transform to calculate the rotation, zoom and translation between images.
Please note that the zoom is described as scaling in [18], which represents the image change when the camera moves along the direction perpendicular to the imaging plane, but we use “zoom” in this paper to distinguish it from “re-scaling” of visual odometry. Monocular cameras cannot recover absolute scale, thus all translations among frames are re-scaled w.r.t the estimated translation between the first two frames [1]. In [
19,
20], FMT was improved to speed up the computation and boost the robustness. Also, FMT was shown to be more accurate and faster than SIFT in certain environments in [
19]. In addition, the visual odometry based on FMT performs more accurate and robust than that based on different features, such as ORB and AKAZE, especially in feature-deprived environments [
21]. Most of current VO methods rely on features or pixel brightness. For example, ORB-SLAM uses ORB feature detectors to find correspondences between two images; SVO, LSD-SLAM and DSO estimate the motion between two frames based on the brightness consistency. There are also some methods using global appearance descriptors, which are more robust in feature-deprived scenarios. FMT is one of them. Furthermore, [
22,
23] compare the performance of different holistic descriptors for localization and mapping, such as discrete Fourier transform, principal components analysis and histogram of oriented gradients.
Due to FMT’s robustness and high accuracy, it has been successfully applied in multiple applications, such as image registration [
11,
24,
25], fingerprint image hashing [
26], visual homing [
27], point cloud registration [
28], 3D modeling [
29], remote sensing [
12,
30], and localization and mapping [
31,
32]. However, it requires that the capture device doesn’t roll or pitch and that the environment is planar and parallel to the imaging plane. There are already several efforts on solving the first restriction. For instance, Lucchese calculated the affine transform via optimization based on the affine FMT analysis [
33]. In [
34], the oversampling technology and Dirichlet-based phase filter were used to make FMT robust to some image skew. Moreover, the sub-image extraction strategy [
21,
35,
36,
37] is popular in addressing the 3D motion problem. In this paper, we mainly focus on the second case, i.e., to relax the constraints of equidistance and planar environments. If the depths of objects are different, the pixels’ motion will be different when the camera’s motions are the same, which is due to perspective projection. Since FMT can only gives the image motion of the dominant depth, the camera’s speed cannot be correctly inferred from the FMT’s results when the dominant plane changes. Thus, an FMT-based visual odometry cannot work in multi-depth scenarios. For example,
Figure 1 shows a multi-depth scenario, which contains buildings with different heights, lower ground and river. If images are collected in such scenarios with a down-looking camera mounted on an Unmanned Aerial Vehicle (UAV), FMT may first track the building roofs and then track the lower ground, such that it cannot estimate the camera’s motion correctly because the dominant depth in the camera’s view changes.
To overcome the drawback of FMT, this paper presents the extended Fourier-Mellin Transform: eFMT. It extends FMT’s 3D translation (translation and zoom), while keeping the original rotation estimation, because multiple depths result in multiple zooms and translations, which will be discussed in detail in
Section 3. Since FMT has already been used in all kinds of applications, such as remote sensing, image registration, localization and mapping, 3D modeling, visual homing, etc. (see above), we see a great potential of eFMT further enlarging the application scenarios of FMT. In this paper, we proceed to a highly practically relevant application of our proposed eFMT odometry algorithm, which is the motion estimation in the scenario of
Figure 1 with a down-looking camera.
As we will also shown in this paper, in contrast to FMT, feature based and direct visual odometry frameworks usually do not perform well in challenging environments, such as low-texture surfaces (e.g., lawn, asphalt), underwater and fog scenarios [
20]. The main advantage of FMT over other approaches—its robustness—is preserved in eFMT. To maximize the robustness and accuracy of FMT, we use the implementation of the improved FMT in [
19,
20] as comparison and build eFMT upon it in this paper. Our main contributions are summarized as follows:
To the best of our knowledge, we are the first to apply the zoom and translation estimation in FMT to multi-depth environments. Our method is more general than FMT but maintains its strengths;
We implement an eFMT-based visual odometry (VO) framework for one potential use-case of eFMT;
We provide benchmarks in multi-depth environments between the proposed eFMT, the improved FMT [
19,
20], and popular VO approaches. The state-of-the-art VO methods, ORB-SLAM3 [
4], SVO [
5] and DSO [
7], are chosen as comparison because they are the representative of feature-based, semi-direct and direct VO methods, respectively [
38].
The rest of this paper is structured as follows:
Section 2 recalls the classic FMT algorithm;
Section 3 formulates the image registration in the multi-depth environment; Then we propose the eFMT algorithm for multi-depth visual odometry in
Section 4; Experiments and analysis are present in
Section 6; Finally, we conclude our work in
Section 8.
2. Classical FMT
This section recaps the main idea of classic FMT [
18]. Given two image signals
, the relationship between them is
where
z and
are constant and represent the zoom and rotation, respectively, and
is the translation between
and
. The motion parameters
can be estimated by FMT via the following steps:
Fourier transform on the image signals from both sides of Equation (
1):
Convert the magnitude
of Equation (
2) in polar coordinates, ignoring the coefficients:
Take the logarithm of
of Equation (
3):
where
.
Obtain
z and
from Equation (
4) based on the shift property of the Fourier Transform. Re-rotate and re-zoom
to
so that
Thus, all the motion parameters
can be calculated by conducting phase correlation on Equations (
4) and (
5). Taking Equation (
5) as an example, we first calculate the cross-power spectrum by
where ∘ is the element-wise product and
represents the complex conjugate. By applying the inverse Fourier transform, we can obtain the normalized cross-correlation
which is also called
phase shift diagram (PSD) in this paper. Then the translation
corresponds to the location of the highest peak in
q:
In the implementation the PSD is discretized into a grid of cells. Note that there exist partial non-corresponding regions between two frames due to the motion. Instead of contributing to the highest peak, these regions generate noise in the PSD. Since the energy of this noise is distributed over the PSD, it will not influence the detection and position of highest peak when the overlap between the frames is big enough.
Classical FMT describes the transformation between two images, which corresponds to the 4DoF motion of the camera, including 3DoF translation (zoom is caused by the translation perpendicular to the imaging plane) and yaw (assume
z-axis is perpendicular to the imaging plane). However, as we mentioned in
Section 1, it is limited to single-depth environments because it assumes zoom
z and translation
as consistent and unique, which does not hold in multi-depth environments. In the next section, we formulate the image transformation in the multi-depth scenarios, i.e., considering multi-zoom and multi-translation. The solution provided by eFMT to handle this issue is presented in
Section 4.
3. Problem Formulation
This section formulates the general image transformation with the 4DoF camera motion in multi-depth scenarios.
Given a pixel
of
, it is normalized to
with the focal length
and image center
. Assume the pixel
p corresponds to the 3D point
P with depth
, then the coordinate of
P in the
’s frame is
Suppose the transformation between the camera poses of
and
is a 4DoF motion with the rotation around the camera principal axis, i.e., yaw
, the 2D translation in the imaging plane
, and the translation perpendicular to the imaging plane
, then
P in the
’s frame is projected to
at point
:
that is
Thus we can derive a general equation
to describe the pixel transformation between
and
, where
,
and
It can be found that a zoom
and a translation
of a pixel depend on its depth
, while rotation
is independent. Equation (
1) is a simplification of Equation (
10) under the condition that the depth
of each pixel is the same. For
and
in a multi-depth scenario, there will be multiple solutions to Equations (
10)–(
13), depending on the depth of the individual pixel, so there are multiple zooms and translations. The energy of the cells in the PSD is positively correlated with the number of pixels with depth
for which
falls in that cell. Since FMT assumes an equidistant environment, the depth
is considered constant for every pixel. i.e., FMT supposes that the translation
and zoom
is the same for all pixels
p. Thus for FMT all
fall in a singe cell, forming a peak.
In this paper, we propose eFMT that relaxes the equidistance constraint by solving (
10) with different depths
to estimate camera poses.
4. Methods
In this section, we first solve (
10) with the translation-only case and zoom-only case, respectively. Then we present how to handle the general case with 4DoF motion. Since the absolute magnitude of the monocular camera’s poses cannot be found, the re-scaling for translation and zoom is also discussed to estimate the up-to-scale transformation.
Without loss of generality, we use frame indices 1, 2 and 3 for any three consecutive frames in this paper.
4.1. Translation-Only Case
FMT decouples the translation estimation from rotation and zoom calculation. Thus we only consider that the camera moves in the
plane in the translation-only case. Then Equation (
10) is simplified to
As indicated by Equations (
12) and (
13), translation
is not a single energy peak in the PSD as in Equation (
9), due to the multi-depth environment.
Figure 2 shows a translation PSD in the multi-depth environment. It can be seen that there are multiple peaks in the PSD and the
view shows that these high peaks lie on one line. The collinear property is derived from the definition of
and
. In the translation-only case, Equations (
12) and (
13) are reduced to
It can be found that the direction of each translation
is the same, i.e.,:
which is independent on the pixel depth
. Also, the translation
lies in the line:
Thus, the peaks with high values lie in a line across the center of the PSD. Additionally, pixels cannot move in the opposite direction. So the peaks lie in a line that starts from the center. The extreme case is a slanted plane in the camera’s view. Then there are not distinguishable peaks, but a continuous line segment in the PSD. To keep it general and don’t rely on peak detection, this paper proposes the following way to estimate the translation.
Independent of their depth, with a given camera translation, all pixels will move with collinear translation vectors - the magnitude of this translation depends on their depth and the magnitude of the camera translation. Thus we can treat the translation estimation in a novel way different from finding only the highest peak. Concretely, starting from the center of the PSD, which represents the no-translation case, we perform a polar search for the sector that sums up the most energy. This sector now represents the direction of the translation vector, abbreviated as t. We have no concrete estimate for the magnitude of the motion, which would be anyways up to the unknown scale factor, therefore the estimated translation vector t is a unit vector, which is called unit translation vector in this paper.
As introduced in
Section 1, one weakness of FMT is that it does not consider the scale consistency for visual odometry, where the estimated translation between images
and
has to be re-scaled to be in the same unit as the one between
and
. To overcome this drawback, eFMT calculates the re-scaling factor on the
sector. For that, we sample a translation energy vector
from the
sector of the PSD. With a given camera translation, regions with different depths correspond to different indices in the translation energy vector. The more pixels correspond to a region, the higher the energy. Assume the translation energy vector between
and
is
and that between
and
is
. The second image
is shared between both translations, thus the depths of the regions are the same for both translations. Any difference between the translation energy vectors
and
must thus come from different magnitudes of translation, independently from the direction of that translation. In fact, the vectors are simply scaled by the ratio of the translation magnitudes, which then also maintains the correspondence of the regions and their size/ energy values in the vectors. Thus, the re-scaling factor
can be calculated via pattern matching on
and
by
where
uses
s to scale the vector
in length and value. Details are presented in
Section 5.
Differences in the regions from changing occlusions and field of views add noise to the PSD but can be ignored in most cases, analogous to the image overlap requirement in the classical FMT [
39].
4.2. Zoom-Only Case
As implied in Equation (
4), rotation and zoom share the same PSD (see
Figure 3). Also, the rotation is depth-independent and the same for all pixels, as shown in Equation (
10). Thus, eFMT calculates rotation in the same way as FMT does. In this section, we just consider the zoom-only case, i.e., the camera moves perpendicular to the imaging plane. In this case, the Equation (
10) is simplified to
Meanwhile, Equation (
4) becomes
where
. Therefore, the multiple peaks of zoom lie in one column in the rotation and zoom PSD, because all the zoom peaks correspond to one rotation, i.e., the same column index. Note that these zoom peaks are sometime continuous in real applications due to the continuous depth change, then these zoom peaks become high values in the PSD. For that, we no longer search for multiple peaks. Instead, a set of multi-zoom values
is uniformly sampled between the maximum zoom
and minimum zoom
estimated from the column
with maximum sum energy.
can be found by
where
is the rotation and zoom PSD. Then we find the highest peak value
of the column
. Only the values whose energy is larger than half
are called high values in
. The maximum zoom
and minimum zoom
are searched from these high values by calculating zooms from the index of the high values. In addition, as derived in
Section 3, the zoom
is described by Equation (
11), which is inversely proportional to the depth
. Thus, the minimum and maximum zooms, estimated from the PSD, indicate the maximum and minimum pixel depths, respectively. Since the energy in the translation PSD also relates to the pixel depths, we can build correspondences between zoom energy and translation energy, which will be discussed in the next section.
Additionally, re-scaling for zoom is also essential in the zoom-only case for visual odometry. For that, a zoom energy vector
is extracted from
.
is the half of
with higher energy. This is based on the prior knowledge that all regions should consistently either zoom in or out. Suppose the zoom energy vector between
and
is
and that between
and
is
. The re-scaling factor
between
and
is found by
where
is the function of shifting the vector
. It is a variant of the pattern matching used in translation re-scaling. The only difference is that, while the translation energy vectors above are matched via scaling, the zoom energy vectors must be matched via shifting. Both algorithms will be shown in
Section 5.
4.3. General 4DoF Motion
When the 4DoF motion of the camera happens, the transformation between two poses is estimated following the scheme of the FMT. Our eFMT pipeline is shown in
Figure 4. Since monocular visual odometry algorithms are up-to-scale [
1], we use three frames to calculate the up-to-scale transformation.
Similar to the FMT pipeline [
18], we firstly calculate the rotation and zoom between two frames. Instead of searching for the highest peak value on the rotation and zoom PSD, we exploit all the information of half a column of the PSD in eFMT, yielding multi-zoom values
and the zoom energy vector
, as introduced in
Section 4.2. In addition, the multi-zoom values
are uniformly sampled between the minimum and maximum high zoom values of the PSD, which takes the energy instead of peaks into consideration. Thus it is robust to the continuous energy in the PSD. Afterwards, we obtain translation PSDs for the rotation
and each zoom value
, by first re-rotating and re-zooming the second image:
and then performing phase correlation on image
and
with Equation (
7). With the method introduced in
Section 4.1, the translation energy vector
is extracted from the translation PSD. Then these multiple translation energy vectors are combined according to the weight of the zoom energy:
where
is the function to find the energy corresponding to the zoom
and
. Since the higher the zoom value is, the more pixels correspond to the zoom, the corresponding translation energy vector should get the higher weight accordingly. Thus the Equation (
20) holds.
4.4. Tidbit on General 4DoF Motion
Classical FMT decouples rotation and zoom from the translation. For eFMT this is not as simple: as the camera moves along the z-axis (perpendicular to the image plane), objects of different depth are zoomed (scaled) by different amounts. In a combined zoom and translation case, the apparent motion of a pixel depends on its depth, the zoom and translation. However, for the pattern matching of the translation energy vectors (Equation (
15)) to be based just on a simple scaling, the energy in the pixel motions has to be based just on the pixel depth and translation speed, so they must be independent of the zoom. As described above, eFMT will calculate translation energy vectors
for different zoom values. This means that in multi-depth images there will be parts of the image that are zoomed with the incorrect zoom value but are then used as input in Equation (
7) and ultimately combined into the translation energy vector from Equation (
20).
One could assume that those incorrectly zoomed image parts lead to wrong pixel translation estimations, thus leading to a compromised translation energy vector. However, this is not the case: The phase correlation (Equation (
7)) is sensitive to the zoom! It will only notice signals that are in the same zoom (scale)—other parts will just be noise. This is because with a wrong zoom Equation (
14) does not hold.
Figure 5 shows how a wrong zoom will influence the translation PSD. It can be found that wrong zoom decreases the energy of the correct translation and distributes the energy over the PSD. Also, a slight difference does not change the translation PSD too much, whereas a big difference will result in a PSD with mostly uniformly distributed noise. To give a better explanation, we also demonstrate the signal-to-noise ratio (SNR) of the translation PSD with different fixed zoom values in
Figure 6. The SNR value is calculated by the ratio between the mean of the high values from the translation energy vector and the mean of the remaining values in the PSD.
Figure 5 and
Figure 6 show that a deviation of zoom of only
will lead to a SNR below
, which is very noisy already. Thus when eFMT is iterating through the different multi-zoom values in
, the translation energy vectors
will just notice the pixels that are correctly zoomed, because the wrongly zoomed values will be very small compared to the actual high values with the right zoom. Thus the combined translation energy vector
is independent of the zoom. Therefore, also the pattern matching of the translation energy vectors for re-scaling is zoom independent.
4.5. Practical Consideration—Visual Odometry
We demonstrate the advantage of eFMT over FMT on camera pose estimation, i.e., visual odomtery. The main considerations in visual odometry are how to put translation and zoom in the same metric, i.e., translation and zoom consistency.
For that, we analyze the relationship between image transformation and camera motion again. As shown in
Figure 7, assume the objects with size
and depth
are in the FOV of the camera
C in Pose 1. The camera moves to Pose 2 with the motion
in the
plane and
along the
z direction.
According to the basic properties of pinhole cameras, the zoom between the two frames captured in Pose 1 and 2 is
. Similarly, we can derive the translation between two frames of different depths
. We are using
j here, because in the algorithm, zoom and translation are calculated independently. The pixel translation between Pose 1 and 2 are
, where
f is the focal length of the camera. Then the ratio between the translation perpendicular to the imaging plane and that in the
plane can calculated by
if and only if
, meaning that the same object distance
is used.
We can use pattern matching between the zoom energy vector and the translation energy vector to find the corresponding i and j. For simplicity, in this paper we use maximum energy finding to determine the zoom with the highest peak in the zoom energy vector (this corresponds to with depth ). In the translation energy vector for we then find the peak translation vector (, which actually is ). This holds for all pixels with the same depth without the limitation of lying in one continuous plane.
Then we can get the 3D translation
t between the camera poses:
where
is the unit translation vector.
4.6. Summary of Key Ideas
The key ideas of eFMT are outlined as follows:
Observation that multiple depths will lead to multiple strong energies in the PSDs for zoom and translation, and that these signals are collinear.
Instead of finding one maximum peak, as the classical FMT is doing, we represent the translation in a one-dimensional translation energy vector that encodes the number of pixels with certain amounts of motion, which correspond to certain depths. We treat the orientation and the magnitude independently. The orientation from the center of the PSD, from which the translation energy vector was sampled, is the direction of the motion, represented as a unit translation vector. The zoom is represented analogous. Thus, eFMT keeps the accuracy and robustness of FMT w.r.t features and direct methods, and improves the scale consistency of FMT.
We put the zoom and translation in the same reference frame by finding the correspondence between zoom and translation based on pattern matching.
Finally, we assign a magnitude to the second of the two found unit translation vectors of three consecutive frames by estimating a re-scaling factor between the translation energy vectors via pattern matching. The re-scaling for zoom is estimated analogously.
5. Implementation
This section introduces the implementation of a visual odometry framework based on eFMT. We first present this framework and then discuss in detail how to implement re-scaling for translation and zoom.
Algorithm 1 demonstrates the implementation of the eFMT-based visual odometry. FMT is directly applied for the first two frames to estimate rotation
, zoom
z and unit translation vector
t. Additionally, zoom and translation energy vectors
and
, used for pattern matching in the next iteration, are generated from the corresponding PSDs, respectively. For the following frames, eFMT is performed to calculate re-scaled zoom and translation, so that the 4DoF motion between frames can be estimated. Moreover, the trajectory of the camera is generated via the chain rule.
Algorithm 1:eFMT-based Visual Odometry |
- 1:
Input: - 2:
fordo - 3:
if then ▹ Similar to FMT - 4:
Estimate rotation , zoom and translation - 5:
Generate from the rotation and zoom PSD - 6:
Generate from the translation PSD - 7:
else ▹ Multi-zoom and Multi-translation - 8:
Calculate the rotation and zoom PSD between and - 9:
Estimate the rotation
and zoom values vector from the PSD - 10:
Generate from the PSD - 11:
for in do - 12:
Get translation energy vector and unit translation vector - 13:
end for - 14:
Combine translation energy vector to - 15:
- 16:
Estimate re-scaling factor between and via pattern matching - 17:
Estimate re-scaling factor between and via pattern matching - 18:
Update zoom and translation - 19:
Perform chain rule on the 4 DoF transformation - 20:
end if - 21:
end for - 22:
Output: camera poses corresponding to
|
As described above, for translation calculation, we find the sector with maximum sum energy
instead of the highest peak. Concretely, the PSD is divided into
n sectors from the center
b. Then we sum up the energy of the cells in each sector within a certain opening angle
o, e.g.,
, to find the
. Afterwards, the direction from the center
b to the highest value of the sector
is considered to be the translation direction, i.e., unit translation vector
. Furthermore, we represent the values of the maximum sector
as the 1D energy vector
. We sample the energy in the maximum sector
at uniform distances to fill
. Then the translation energy vectors are combined to
with Equation (
20).
Moreover, the pattern matching algorithms used in the re-scaling for translation and zoom (Equations (
15) and (
19)) are shown in Algorithms 2 and 3, respectively. Equations (
15) and (
19) are in the form of least-squares problems, which are often solved by gradient decent methods. However, since the explicit expression of function
in Equation (
15) is pointwise on the variable
s, it is difficult to construct the Jacobian when using the gradient decent methods to solve Equation (
15). Additionally, the gradient decent method is prone to local minima, especially without a good initial guess. In fact, our method does not provide any initial guess. Solving Equation (
19) is analogous. Therefore, the pattern matching Algorithms 2 and 3 are exploited to find re-scaling factors in this work. There are several methods to handle pattern matching, for example phase correlation, search algorithms and dynamic programming. Considering the robustness on outliers of the PSD signals, we use a search method in this paper.
Algorithm 2:Re-scaling for Translation |
- 1:
Input: and - 2:
Initialize distance d with infinity - 3:
fordo - 4:
Scale to with s - 5:
Calculate Euclidean distance between and - 5:
- 6:
if then - 7:
- 8:
- 9:
end if - 10:
end for - 11:
Output: rescaling factor
|
Algorithm 3:Re-scaling for Zoom |
- 1:
Input: and - 2:
Initialize distance d with infinity - 3:
fordo ▹r is the length of - 4:
Shift to with - 5:
Calculate Euclidean distance between and - 5:
- 6:
if then - 7:
- 8:
shift_to_scale{} - 9:
end if - 10:
end for - 11:
Output: rescaling factor
|
6. Results
In this section, we evaluate the proposed eFMT algorithm in both simulated and real-world multi-depth environments. Note again that there are multiple variants of FMT, we use the improved one from [
19,
20] for better robustness and accuracy. Since all the FMT implementations only search for one peak in the PSDs, they will meet difficulties in multi-depth environments, no matter which implementation is used.
We first present basic experiments about the zoom and translation re-scaling in the simulation test. The scenario only includes two planes with different depths to show the basic effectiveness of eFMT. Then eFMT is compared with FMT and the state-of-the-art VO methods, ORB-SLAM3 [
4], SVO [
5] and DSO [
7], in the real-world environments. The three state-of-the-art VO methods that do not rely on FMT are the most popular and representative monocular ones, as pointed out in [
38]. The tests in the real-world environments include two parts: one toy example with two wooden boards and a large-scale UAV dataset (
https://robotics.shanghaitech.edu.cn/static/datasets/eFMT/ShanghaiTech_Campus.zip, accessed on 5 March 2021). The toy example is similar to the simulation environment. Since the features are very similar on the wooden board, the scenario is more difficult than general indoor environments, even though there are only two planes. To evaluate the eFMT algorithm in a more general case and provide a potential use-case of eFMT, we proceed the second test with a down-looking camera mounted on a UAV. The scenario includes many different elements, such as building roofs, grass and rivers. Since there are many different depths in the view, especially that the building will be a slanted plane due to the perspective projection, it is thus challenging for FMT. In addition, the feature-deprived road surface and grass would be a big challenge for classic VO methods. We will show that eFMT can handle both difficulties.
All experiments are conducted with an Intel Core i7-4790
[email protected] GHz and 16 GB Memory without GPU. The algorithm is implemented in C++ using a single thread.
6.1. Experiments on the Simulated Datasets
In this test, images are collected in the Gazebo simulation for accurate ground truth. As shown in
Figure 8, the camera is equipped on the end-effector of a robot arm such that we can control the robot arm to move the camera.
6.1.1. Zoom Re-Scaling
In this case, we move the robot arm along the
z-axis to generate three simulated images with two planes in different depths. As shown in
Figure 9b–d, they are zoomed in from left to right. In each image, the left half is further whereas the right half is closer. Then the rotation and zoom PSDs are shown in the second row of
Figure 9. It can be seen that each diagram has two peaks, which indicates two different depths in the view. Moreover, the higher peak is not always in the left, which implies the majority depth in the view changes, which destroys the scale consistency of the FMT. Traditional FMT only uses the highest peak. Instead, the proposed eFMT takes the zoom energy vector into consideration and puts all zoom values into the same scale through re-scaling—up to one unknown scale factor.
Here we show that the eFMT outperforms FMT by using the three images as a small loop closure. The zoom
between image 0 and 2 should equal to the product of the zoom
between image 0 and 1 and
between image 1 and 2. The result in
Table 1 shows that eFMT estimates the zoom correctly, so that the zoom loop holds, i.e.,
. However, FMT only tracks the highest peak. The plane that the highest peak in
Figure 9g corresponds to is different from that in
Figure 9e,f, so
and
are calculated based on different planes with different depths. Thus
is further away from 1.
6.1.2. Visual Odometry in Simulated Scenario
In this case, the simulated robot arm moves in the
plane to generate images with combined translation and zoom. Here, we compare the visual odometry based on eFMT and FMT on this dataset.
Figure 10 shows that eFMT tracks the correct re-scaling factor to the end while the FMT fails at about
m, which indicates that eFMT also works better than FMT with zoom and translation. This benefits from the re-scaling based on pattern matching, as introduced in
Section 4.
6.2. Experiments on Real Datasets
After the preliminary tests in the simulated environment, we evaluate the performance of eFMT by comparing with FMT and other state-of-the-art VO methods in real-world scenarios. The first example is similar to the simulation setting with two wooden boards in the camera’s view, as shown in
Figure 11a. The ground truth is provided by a tracking system. In the second example, we collect a dataset with an unmanned aerial vehicle flying over our campus. More details are introduced in the following.
6.2.1. A Toy Example
In this case, we evaluate the visual odometry with only translation along the
axis (see in
Figure 11a) with two different depths. Similar to the simulation, the wooden board with smaller depth first goes into the camera’s view, then both boards are in the view, finally only the wooden board with larger depth is observed.
Figure 11b compares the localization results with different methods, including FMT (green triangle), eFMT (blue star), SVO (blue triangle) and ORB-SLAM3 (brown star). The results of DSO are omitted here because it fails tracking in this scenario. To compensate the unknown scale factor, the estimated results are aligned to the ground-truth (via a tracking system) by manual re-scaling. Since the camera only moves in the
x direction, we only show the positions in
x axis versus frames. The absolute error (
Table 2) will include errors in both
x and
y direction.
We can see that FMT begins to suffer from scale drift approximately from the 20th frame, where FMT changes the tracked panel, because the new panel now is bigger in the view and thus has a higher peak. That new panel is further away, thus the pixels move slower, thus FMT underestimates the motion compared to previous frames. In contrast, the proposed eFMT maintains the correct scale till the last frame, because our pattern matching re-scales all unit translation vectors correctly. Compared with SVO and ORB-SLAM3, eFMT tracks each frame more accurately. The absolute trajectory errors in
Table 2, including mean, max and median of errors, also shows that eFMT achieves the smallest error, followed by SVO and ORB-SLAM3. Concretely, the mean error of eFMT is approximately
of that of ORB-SLAM3, and about
against FMT or SVO. This test shows that eFMT outperforms the popular visual odometry algorithms in this challenging environment, thanks to the robustness of the spectral-based registration.
6.2.2. The UAV Dataset
In addition to the above toy examples, we compare the proposed eFMT with FMT, ORB-SLAM3, SVO and DSO on a bigger UAV dataset. Note that even though there are several public UAV datasets [
40,
41,
42,
43,
44], we could not use them in this paper because we require datasets without roll/ pitch due to the properties of our algorithm. Ref. [
41] provides such a dataset, NPU dataset, with no roll/pitch, but the flying height of the UAV is too high, so that the scenario can be considered single-depth. We tested on one sequence of the NPU dataset, which the UAV collected over farmland. Since the features in this scenario are ambiguous, ORB-SLAM3, SVO and DSO failed to estimate the camera trajectory on this dataset. Both FMT and eFMT succeed to track the trajectory with some accumulated error, and their performance are similar due to single depth. The algorithm presented in [
41] works well on this data, but it is using GPS in the algorithm, it assumes a planar environment and is a fully fledged SLAM system, matching against map points, while eFMT is just registering two consecutive image frames.
To show the performance of eFMT in a multi-depth setting, we collect a dataset (
https://robotics.shanghaitech.edu.cn/static/datasets/eFMT/ShanghaiTech_Campus.zip, accessed on 5 March 2021), which is released together with this paper. Our dataset is collected by a down-looking camera equipped on a DJI Matrice-300 RTK. The flying speed is set to 2 m/s and the image capture frequency is
Hz. The path of the drone over our campus is shown in
Figure 1. The DJI aerial vehicle collected 350 frames on a trajectory of about 1400 m. The height above ground is about 80 m, which is approximate 20 m higher than the highest building. As we mentioned in the beginning of the experiment, this dataset contains the all kinds of different elements. These include roofs, road surfaces, a river and grass, where some of them are challenging for the classic VO methods that are not based on FMT. Furthermore, the multiple depths increase the difficulty for FMT. In this case, we will show that the eFMT not only keeps the robustness of FMT but also overcomes its single-depth limitation.
The overall trajectories of different approaches are shown in
Figure 12. The trajectories are aligned with a scale a rotation calculated from the poses of the 0th frame and the 80th frame. We refer readers to the attached
Supplementary Video for the frame-by-frame results. Since SVO and DSO fail to estimate the camera poses, the trajectories of them are not included in this figure. Also, it can be found that the ORB-SLAM3 fails to track several times, as indicated by the red stars. After each failure, the trajectory of ORB-SLAM3 is realigned. Both FMT and eFMT succeed to estimate the camera poses till the end of the dataset, though the translation has some drift. To evaluate the performance of FMT, eFMT and ORB-SLAM3, we compare these methods only up to the frame that ORB-SLAM3 fails. The performances of different approaches are shown in
Figure 13. From the right local enlarged figure, we can find that the estimated speeds of eFMT and ORB-SLAM3 are almost constant, as indicated by the equal distances between the frames. This is consistent with the centimeter-grade RTK GPS ground truth. However, the estimated speed of FMT changes according to the view. For instance, the speed is faster from frame 125 to 132 than that from frame 132 to 138, because the dominate plane is ground in the former case whereas the dominant plane changes to the roof in the latter case. In addition,
Figure 14 displays the absolute translation error versus distances with the evaluation tool from [
45]. If only comparing the performance when all three approaches are tracking successfully, the performance of eFMT is on a par with ORB-SLAM3 and both of them are better than FMT, because FMT suffers from different depths.
Please note that there are continuous line segments in the translation PSD when there are slanted planes in the view. As shown in
Figure 15, the buildings in Image 1 and 2 become inclined due to the perspective projection, which yields the line segments (left to the red center) in the translation PSD below. In the UAV dataset, such inclined planes are common, thus pattern matching is necessary for re-scaling. Moreover, the estimated trajectory shown in
Figure 12 shows that eFMT can handle such slanted planes issues.
Thus, this experiment shows that eFMT has two advantages: (1) it successfully extends FMT to multi-depth environments, that is, no matter the multiple depths are continuous (e.g., slanted plane) or discrete (e.g., roofs and ground), eFMT can track the camera motion; (2) it keeps the robustness of FMT that it can still track the camera motion in the feature-deprived scenarios, such as building roofs, whereas the classic VO methods may fail tracking. The experiment mentioned in the beginning of
Section 6.2.2 also supports the second point.
6.3. Robustness on Continuous Depth
As observed in our UAV dataset, the buildings in the camera’s view may become a slanted plane due to perspective projection. In this case, there will be a continuous line segment in the translation PSD corresponding to the slanted plane (see
Figure 15). In this section, we explore the influence of continuous depths on the PSDs, which is important for the performance of eFMT. For that, we simulate a plane and a robot arm in Gazebo, like
Figure 8, and then make the robot arm deviate from perpendicular with the plane such that the plane becomes a slanted plane in the camera’s view (see
Figure 16p). In other words, the depth in the view is now continuous instead of discrete planes, which can be considered to be a limit condition of multi-depth. We collect images by mainly moving the camera along the
y-axis in the camera’s frame (see
Figure 8), since it is more complex than moving along the
x-axis. When the camera moves along the
axis, the relative depth in the camera view is the same.
Figure 16 shows the translation PSDs with different magnitude of the camera’s motion. When tilt is
, there is only single depth in the view. It can be found that the darker the blue is, the clearer the high values are. From the second row of
Figure 16, we can see that the highest peak distributes to a wide line and the highest energy becomes smaller when the tilt of the camera gets bigger. Since eFMT finds the sector with maximum energy, it can still find the unit translation vector in this condition and implement re-scaling with all the energy in the maximum sector. In contrast, FMT may fail in this case because it only tracks the highest energy. In detail the highest energy peak is prone to change, i.e., prone to be associated with different depths, due to noise and scenario similarity.
From each column of
Figure 16, it can be seen that the energy distributes more along the
sector when the camera moves more. This means that it is more obvious that different depths contribute to different pixels. When the motion is small, different depths may contribute to the same pixel due to the image resolution. Please note that there are multiple peaks with high energy, also in the opposite direction, in
Figure 16d,e, which is due to the periodic structures in the simulated images. eFMT still has a good chance to work in this case, because the maximum energy sector
can still reliably be found. Also, the pattern matching will still determine the re-scaling based on the best matching scaling, which should be the correct one, since its energy is highest and fits best.
Looking at the different tilt values of
Figure 16, we can see that a higher tilt results in a longer line of high energy values and more noise in the PSD. In particular, the PSDs will be too noisy to provide distinguished high values if there is big motion combined with big tilts. For instance, eFMT fails when the tilt is
or more and the camera motion is bigger than 0.8 m.
The rotation and zoom PSDs are shown in
Figure 17. Similar to the translation PSDs, more noise will be introduced to the rotation and zoom PSDs with bigger tilt and bigger motion.
Overall, multiple depths in the scenario will introduce more noise to the PSDs. When the motion between two frames is not too big, eFMT can still handle the case. However, if the motion is too large, no distinguished energies can be found and eFMT will fail to find the correct sector with maximum energy and thus cannot estimate the motion correctly. Luckily, in the visual odometry task, the motion between two frames is usually not too big. However, it will introduce challenges when we want to do loop closure on the frames with big motion in our future work.
6.4. Computation Analysis
Ref. [
39] pointed out that image resolution has a big impact on the FMT algorithm and image down-sampling does not hurt FMT performance. In our preliminary test, we find that this still holds for eFMT. The run-time of eFMT is about
seconds per frame with image resolution
and about
s per frame with a resolution of
based on the single-threaded C++ implementation, which is about two times slower than that of FMT. In addition, the multi-translation calculations with multi-zoom values are independent from each other, thus we can compute these in parallel to speed up the algorithm. Thus, eFMT could run as fast as FMT.
7. Discussion
In the above experiments, we show that there will be no single energy peak of the PSDs when the scenario changes from single depth to multi-depth. Based on this, eFMT extracts the line with maximum sum energy instead of a single peak, such that it achieves better performance, especially scale consistency, than FMT in multi-depth environments. Our experiments, including one example application on a UAV dataset, show that both FMT and eFMT are more robust than the state-of-the-art VO methods, thanks to the robustness of spectral description against feature-deprived images, while eFMT is successfully removing the constraint for a single-depth scenario that FMT exhibits.
One main limitation of eFMT- and FMT- is that they do not work when there is a roll or pitch in the camera, which is narrowing their scope of application. Smaller deviations from this constraint may be compensated by getting the gravity vector from an IMU and rectifying the images accordingly, but this does not mean that eFMT could be extended to be a general 3D VO algorithm this way.
A major influence on the success of an eFMT image registration is the overlap in the images, which depends on translation amount vs. distance of objects. In the general case, when most of the environment is not tilted more than 45 degree in the frames, our experiments in
Section 6.3 indicate that an overlap of 65% or more seems to be sufficient for successful registrations with eFMT. To improve on this, we consider the oversampling strategy in frequency domain proposed in [
34], which could make the high energies of PSD more distinguished even with smaller overlap.
In addition, the experiments reveal further problems of eFMT. One is that the accumulated error of eFMT may get bigger when the camera moves for a long time, as shown in
Section 6.2.2. This is of course true for all incremental pose estimators. To overcome this, in our future work we will introduce loop closing and pose optimization in the same fashion as current popular VO methods.
Like any monocular VO algorithm, eFMT is up to an unknown scale factor and it will fail if the environment is highly repetitive or does not exhibit enough texture, even though it is better than most other approaches regarding that last constraint.
8. Conclusions
This paper extends the classical FMT algorithm to be able to handle zoom and translation in multi-depth scenes. We present a detailed problem formulation and algorithm. Experiments show the clear benefit of our proper re-scaling for Visual Odometry in scenes with more than one depth and compare it to FMT, which indicates that eFMT inherits the advantages of FMT and extends its application scenarios. Moreover, eFMT performs better than the popular VO methods ORB-SLAM3, DSO and SVO in all our experiments, performed on our datasets collected in challenging scenarios.
In our future work, we will continue to make the proposed eFMT more robust and accurate with the following two points. One is to use pose optimization to decrease the accumulated error. Another is exploiting the oversampling strategy to make the PSDs less noisy and the high energies more distinguished.
As we introduced in
Section 1, FMT has already been used in all kinds of applications. We can consider the application of eFMT in similar applications if the scenario is multi-depth, such as underwater robot localization. Underwater turbidity usually prevents feature-based methods to work properly, while spectral methods still have an acceptable performance. Since the bottom of the underwater scenario may be not flat, eFMT is more suitable than FMT.