Hybrid Visual Odometry Algorithm Using a Downward-Facing Monocular Camera

Al-Hadithi, Basil Mohammed; Thomas, David; Pastor, Carlos

doi:10.3390/app14177732

Open AccessArticle

Hybrid Visual Odometry Algorithm Using a Downward-Facing Monocular Camera

by

Basil Mohammed Al-Hadithi

^1,2,*

,

David Thomas

³

and

Carlos Pastor

¹

Intelligent Control Group, Centre for Automation and Robotics UPM-CSIC, Universidad Politécnica de Madrid, C/J, Gutiérrez Abascal, 2, 28006 Madrid, Spain

²

Department of Electrical, Electronics, Control Engineering and Applied Physics, School of Industrial Design and Engineering, Universidad Politécnica de Madrid, C/Ronda de Valencia, 3, 28012 Madrid, Spain

³

School of Industrial Engineering, Universidad Politécnica de Madrid, C/J, Gutiérrez Abascal, 2, 28006 Madrid, Spain

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7732; https://doi.org/10.3390/app14177732

Submission received: 1 July 2024 / Revised: 28 July 2024 / Accepted: 28 August 2024 / Published: 2 September 2024

(This article belongs to the Topic Advances in Mobile Robotics Navigation, 2nd Volume)

Download

Browse Figures

Versions Notes

Abstract

:

The increasing interest in developing robots capable of navigating autonomously has led to the necessity of developing robust methods that enable these robots to operate in challenging and dynamic environments. Visual odometry (VO) has emerged in this context as a key technique, offering the possibility of estimating the position of a robot using sequences of onboard cameras. In this paper, a VO algorithm is proposed that achieves sub-pixel precision by combining optical flow and direct methods. This approach uses only a downward-facing, monocular camera, eliminating the need for additional sensors. The experimental results demonstrate the robustness of the developed method across various surfaces, achieving minimal drift errors in calculation.

Keywords:

visual odometry; feature detection; optical flow; template matching

1. Introduction

In recent years, the market for autonomous mobile robots has experienced significant growth, driven by the necessity of having systems that can navigate in dynamic environments. Now, they are commonplace, from vacuums to industrial robots. All of these machines have a core need to position themselves in the environment so that they can navigate and be useful, for which they need reliable algorithms that can perform with high precision at all times. To achieve this end, mobile robots deploy an array of different technologies, the most widely used being odometry. This can be calculated using a variety of sensors, with some of the most popular in this field being global positioning systems (GPSs), inertial measurement units (IMUs), and LiDARs. However, a GPS relies on external signals that can limit its performance in certain conditions, such as underground environments. Additionally, IMUs and LiDARs may not always provide a reliable and cost-effective solution for a wide range of applications.

An alternative and classical approach to robot navigation consists of the use of onboard cameras for landmark and SLAM techniques. This method, known as visual odometry, estimates the movement of the robot by analyzing the sequence of captured images. This approach, despite being common, has several disadvantages, mainly its high cost, the significant computational resources it requires, and its elevated sensitivity to external conditions and dynamic environments. The objective of this work was to propose a new approach to relative robot positioning that combines the cost savings and simplicity of traditional odometry while, at the same time, addressing the core issues with traditional vision approaches. For this purpose, the proposed system uses a single downward-facing, monocular camera, which provides several benefits: Firstly, it allows for better control over the conditions under which images are captured, enabling the algorithm to perform consistently across a wider range of environmental conditions. Secondly, photos taken directly of the ground are less susceptible to the changes in the environment that outward-facing cameras might encounter (e.g., people passing by), which is a significant benefit in dynamic and unpredictable environments. The feasibility of this configuration has been proven through several studies [1].

Related Works

Various camera configurations may be used to develop VO systems. One of the most popular approaches involves the use of outward-facing sensors that allow for the calculation of the distance from a robot to surrounding objects. Stereo cameras are widely employed for this purpose [1,2,3,4,5]. For example, in [6], stereo, event-based cameras (sensors that respond to changes in the intensity values of an image) are combined with an IMU to compute VO, while in [7], stereo cameras are used in conjunction with a GPS and an IMU to achieve VO. Other commonly used sensors are RGB-D cameras [8,9], which capture both color (RGB) and depth (D) information, and omnidirectional cameras [10], which provide a 360-degree field of view, although they may require more complex image-processing algorithms.

Another popular solution is the use of monocular cameras [11,12,13], which have the advantage of a lower hardware cost, although they may not perform as competitively as other sensors due to their susceptibility to scale drift [14]. To address this drawback and achieve a reliable VO algorithm, sensor fusion can be employed. For example, in [11,15], a monocular camera is combined with an IMU to achieve a robust VO for micro-aerial vehicles and multicopters, respectively. However, the movement estimation problem with monocular cameras can be significantly simplified in the case of indoor mobile robots, for which an assumption of constant height can be made to reduce the problem to a planar motion estimation, which can be effectively handled using single monocular cameras, as in [12,13]. In [16], a VO system for these types of robots was developed using a downward-facing monocular camera in combination with a forward-looking stereo camera.

Once a sequence of images is obtained using these sensors, robust and reliable algorithms are needed to compute the pose of the robot. For this purpose, the most popular algorithms are those that take advantage of the geometry or features detected in the environment to compute the VO [14]. These algorithms can be broadly divided into two classes: feature-based methods and direct methods.

Feature detection algorithms aim to extract a set of image descriptors from each frame and track them across subsequent images to estimate the motion of the robot by minimizing the reprojection error. For this purpose, several types of feature descriptors can be used, with SIFT and ORB descriptors being among the most popular ones. Additionally, after the features are matched, a filtering process may be added to eliminate outliers that could introduce errors into the algorithm. The disadvantage of this approach is its reliance on the detection and matching of image descriptors, which presents problems in low-texture environments. Furthermore, most feature detectors are optimized for speed, rather than precision, necessitating compensation by averaging movement estimations over multiple features [15], which increases the computational cost of the method. These correspondences can then be used to obtain a set of 2D vectors, called the motion field, that describe the movement of all elements in an image and can subsequently be used to compute the relative 3D movement between a robot and the environment.

On the other hand, direct methods aim to estimate the motion of a robot from spatial and temporal variations in image intensity. These algorithms, which use all the information available in each frame, have exhibited better performance than feature-based methods in terms of robustness in scenes with low texture or cases of camera defocus and motion blur [15]. However, the computational cost of obtaining the photometric errors needed for these methods is higher than that of the reprojection errors used in feature-detection methods. In [12,13], a VO algorithm was developed by cutting a template centered on one frame of a sequence and progressively moving it around in the next frame, minimizing the photometric error to obtain the movement between consecutive images. Another tool to develop direct VO systems consists of optical flow methods, which describe the motion between two frames as the apparent movement of intensity patterns in an image. In [1,11], this type of technique was used with downward-facing cameras to estimate velocity in mobile robots and multicopters, respectively. This algorithm achieved a great approximation to the motion field under certain lighting conditions (no photometric distortion and with constant brightness in all directions of an image). However, this calculation involves computing the inverse of a matrix of local brightness gradients, which can be impossible if this matrix is singular for that region. To address this issue, the calculation can be performed by considering regions around significant points in the image obtained with a feature-detection method. This method is the one followed by the Lucas–Kanade (L-K) algorithm [17], which was used in [2,11].

Other methods combine the advantages of both feature-based and direct algorithms, like the hybrid, semidirect approach used in [15]. This technique extracts features only from selected keyframes, employing direct techniques to solve pose matching and an L-K optical flow algorithm to obtain sub-pixel accuracy. This method significantly reduces the processing required for feature matching and improves the robustness of the algorithm [18].

Although these previous approaches still dominate the field of VO [14], recent years have seen a significant increase in the importance of deep learning techniques. These methods have shown great performance both in the development of VO systems [8,14,19,20,21] or just in improving the specific task of geometry-based algorithms, such as aiding with depth estimation in monocular cameras [22] or using the concept of aleatory uncertainty to calculate photometric and depth uncertainties, leading to greater accuracy [23]. However, there are still many challenges to solve in order to achieve a robust and accurate VO using these techniques [24].

The remainder of this paper is organized as follows: Section 2 provides a detailed explanation of the algorithm process, while Section 3 describes the setup developed to validate the VO method and its calibration. Section 4 presents the results of the experiments and an analysis of these efforts, and in Section 5, a discussion of the proposed method is provided. Finally, Section 6 concludes this paper.

2. Proposed VO System

In this section, the working principles of the VO solution are introduced. The objectives and methodology of the implementation are revealed in detail. The section begins with an overview of the entire system, followed by a detailed presentation of each step of the process.

2.1. Intended Application and Considerations

The objective of this work was to present a proof of concept for a new visual odometry solution that will provide positional feedback to a mobile robot using only a single onboard visual sensor. This system is specifically designed for implementation in mobile robots that operate in dense and dynamic environments, such as museums or hospitals, where traditional methods may be challenging to apply due to highly dynamic environments with people or movable objects interfering with sensor readings.

2.2. System Overview

The developed structure was designed to include a monocular camera positioned inside a mobile robot, facing downward at a consistent height above the ground. This setup leverages the detected movement of the floor pattern, common in indoor environments, to estimate the robot’s trajectory, needing only to perform 2D calculations. This approach simplifies the process compared to other VO methods that use external cameras and may need to reconstruct three-dimensional points in order to execute the algorithm.

The proposed method follows the process illustrated in Figure 1 to calculate the VO. For each new frame, a flow analysis step is carried out to perform feature extraction and matching between these features and the ones in the previous frame. After this, the movement is estimated by minimizing the reprojection error of all key points using the least squares method. Finally, a template matching search is performed in a small area around the estimated movement, enabling sub-pixel accuracy for the algorithm.

2.3. Solution Implementation

This section presents the flow of the algorithm, covering the flow analysis process to track feature movement across frames, the estimation of the movement using least squares, and the sub-pixel accuracy method.

The system continuously collects a sequence of images that can be represented as

I_{0 : k} = {I_{0}, \dots, I_{k}}

, where

I_{k}

describes the image taken at time k. Once a new frame is taken,

T_{k + 1}

, an L-K algorithm is employed to extract the features of the image and match them to those of the previous frame in the sequence,

I_{k}

. This process results in a field of movements that describe how each key point moves across frames, which will be used to estimate the global motion of the robot in the next step.

2.3.1. Movement Estimation

The movement between two frames,

I_{k}

and

I_{k + 1}

, can be represented as a transformation matrix,

T_{k + 1, k} \in R^{3 \times 3}

, that takes the following form:

T_{k + 1, k} = [\begin{matrix} R_{k + 1, k} & t_{k + 1, k} \\ 0 & 1 \end{matrix}]

(1)

where

R_{k + 1, k}

represents the rotation matrix, and

t_{k + 1, k}

is the translation matrix that describes the movement. On the other hand, the position of the robot at time K can be represented as a vector,

C_{k} = {[c_{x}, c_{y}, 1]}^{T}

, where

c_{x}

and

c_{y}

represent the pose on the x and y axes, respectively. The sequence of positions of the robot can be described as

C_{0, k} = {C_{0}, C_{1}, \dots, C_{k}}

. These positions refer to both the camera coordinates and the robot coordinates since it can be assumed that they are the same for the VO algorithm. In this way, the new pose of the robot,

C_{k + 1}

, at time

k + 1

can be obtained from the previous one,

C_{k}

, and the calculated movement between the two frames,

T_{k + 1, k}

, can be calculated as follows:

C_{k + 1} = T_{k + 1, k} \cdot C_{k}

(2)

To obtain the transformation matrix

T_{k + 1, k}

between the frames k and

k + 1

, the set of features of each frame and their movement obtained in the flow analysis process are used. First, the translation vector,

t_{k + 1, k}

, is calculated as the difference of the centers of mass of both sets of key points,

t_{k + 1, k} = G_{k + 1} - G_{k}

, where

G_{i}

represents the center of mass of the set of key points in frame i. For the calculation of the rotation matrix,

R_{k + 1, k}

, the reprojection error between the set of key points of both frames is minimized. Describing as

{\hat{P}}_{k}

the set of features from frame k, displaced by the translation vector

t_{k + 1, k}

and as

P_{k + 1}

, the set of key points from the frame

k + 1

, a rotation matrix,

R_{k + 1, k}

, can be obtained that fulfills the following:

{\hat{P}}_{k} = R_{k + 1, k} \cdot P_{k + 1}

(3)

This matrix is obtained by minimizing the expression of the quadratic rotation error, and it declares the problem as minimizing the following statement:

\frac{1}{N} \cdot \sum_{i = 1}^{N} {({\hat{P}}_{k} - R_{k + 1, k} \cdot P_{k + 1})}^{T} \cdot ({\hat{P}}_{k} - R_{k + 1, k} \cdot P_{k + 1})

(4)

By mathematically developing this equation, the problem can be transformed into maximizing the following expression:

\begin{matrix} \frac{1}{N} \cdot \sum_{i = 1}^{N} ({\hat{P}}_{k}^{T} \cdot R_{k + 1, k} \cdot P_{k + 1}) \\ = t r (R_{k + 1, k}^{T} \cdot \frac{1}{N} \cdot \sum_{i = 1}^{N} ({\hat{P}}_{k}^{T} \cdot P_{k + 1})) \end{matrix}

(5)

where

t r (A)

represents the trace of the matrix A. The term

\frac{1}{N} \cdot \sum_{i = 1}^{N} ({\hat{P}}_{k}^{T} \cdot P_{k + 1})

can be represented as a matrix, c, also called a cross-dispersion or correlation matrix. This matrix can be decomposed into singular values as

c = u \cdot w \cdot v^{T}

, where u and v are orthogonal matrices, and w has all singular values of c. Therefore, Equation (5) can be written as follows:

t r (R_{k + 1, k}^{T} \cdot c) = t r (R_{k + 1, k}^{T} \cdot u \cdot w \cdot v^{T})

(6)

Since

t r (a \cdot b)

equals

t r (b \cdot a)

, the terms in (6) can be moved, leaving w at the end and leaving the equation as follows:

t r (v^{T} \cdot R_{k + 1, k}^{T} \cdot u \cdot w) = t r (c^{'} \cdot w)

(7)

where

c^{'}

is defined as

c^{'} = v^{T} \cdot R_{k + 1, k} \cdot u

. Since w is a diagonal matrix, only the diagonal elements of

c^{'}

add to the trace. Furthermore, since all three matrices composing

c^{'}

are orthogonal, it must also be orthogonal. As a consequence,

t r (c^{'} \cdot w)

is maximized when

c^{'}

is equal to

I

. This finally allows for the determination of the rotation matrix as demonstrated below:

R_{k + 1, k} = u \cdot v^{T}

(8)

These calculations finally allow the transformation matrix in Equation (1) to be built. However, it is possible that some false correspondences were made in the flow analysis process, introducing an error into the VO. Therefore, a filtering process is employed. This procedure operates under the principle that all key points from one image must comply with the transformation matrix

T_{k + 1, k}

. Specifically, any point that does not align correctly will exhibit a significantly larger reprojection error with the transformation matrix compared to other points. Consequently, the proposed algorithm discards all points whose reprojection error exceeds the mean error by more than one standard deviation. The results of this process are shown in Figure 2, where a cluster of mismatched features in the lower right side of the image is removed after the filtering is applied. Finally, after the filtering process, the transformation matrix

T_{k + 1, k}

is recalculated.

2.3.2. Sub-Pixel Precision

Optical flow allows for the development of a fast VO algorithm capable of being performed in real time. However, due to the nature of the algorithm and how the tracking is done, the system is prone to deviations and drift. To address this issue and increase precision, a subsequent direct method is used to improve the overall method and obtain sub-pixel accuracy, as illustrated in Figure 3. This method, also called template matching, involves a comparison of a fixed template of size

T_{w} x T_{h}

centered at O in the new image of the sequence

I_{k + 1}

with an equivalent template in the previous frame,

I_{k}

. This template is moved iteratively across a search area to find the maximum correspondence between both templates, which is represented as the area of size

T_{w} x T_{h}

surrounding

O^{'}

. Therefore, using this representation, the true movement between both frames can then be described as the vector

\vec{O O^{'}}

.

This method performs well as part of VO systems in environments with low texture [1,12,13]. However, due to its high computational cost, it is only used within a small search area, based on the assumption that the point

O^{'}

in the frame k will be situated near the point

O^{″}

, which is calculated after moving O by the translation vector

t_{k, k + 1}

obtained previously. This structure allows for a rapid and robust execution that enables the calculation of a translation vector,

t_{T M}

, between the points

O^{'}

and

O^{″}

, which will finally facilitate the computation of the true movement across frames as follows:

t_{k, k + 1} = {\hat{t}}_{k, k + 1} + t_{T M}

(9)

where

{\hat{t}}_{k, k + 1}

represents the estimated movement obtained after the filtering process. The size of the template,

T_{w} x T_{h}

, is maximized for the frame to avoid losing information by using small templates. These dimensions are determined using the following expressions:

\begin{matrix} T_{w} & = I_{w} - S_{w} \\ T_{h} & = I_{h} - S_{h} \end{matrix}

(10)

where

S_{w}

and

S_{h}

represent the width and the height of the defined search area, respectively. In this work, a small search area of size 5 × 5 pixels was chosen, assuming that the movement estimation from the previous process was close to the true solution. This significantly reduced the computational cost of this step. The comparison between templates was done by calculating the normalized cross-correlation coefficient, which obtains great results, reduces false matches, and achieves great performance against changes in external illumination [12]. This coefficient is defined as follows:

R (x, y) = \frac{\sum_{i, j} (T^{'} (i, j) \cdot I^{'} (x + i, y + j))}{\sqrt{\sum_{i, j} T^{'} {(i, j)}^{2} \cdot \sum_{i, j} I^{'} {(x + i, y + j)}^{2}}}

(11)

where

R (x, y)

represents the similarity between the fixed template of the image

k + 1

and the moving template of the image k centered at pixels

(x, y)

.

(i, j)

represent the coordinates of the pixels in the moving template, which fulfill

0 \leq i \leq T_{w} - 1

and

0 \leq j \leq T_{h} - 1

. Finally,

T^{'} (i, j)

and

I^{'} (x + i, y + j)

are obtained as follows:

\begin{matrix} T^{'} (i, j) & = T (i, j) - \frac{\sum_{x, y} T (x, y)}{T_{w} \cdot T_{h}} \\ I^{'} (x + i, y + j) & = I (x + i, y + j) - \frac{\sum_{x, y} I (x + i, y + j)}{T_{w} \cdot T_{h}} \end{matrix}

(12)

where

T (i, j)

represents the

(i, j)

pixel value at fixed template T, and

I^{'} (x + i, y + j)

represents the

(i, j)

pixel value of the moving template at

(x, y)

. The algorithm will repeat this process for all pixels of the defined search area around

O^{'}

, resulting in the one illustrated in Figure 4, whose maximum will be

O^{″}

.

Once the sub-pixel precision step has been completed and the movement between the frames k and

k + 1

has been estimated, the system updates the robot’s new coordinates on the map. After this, a new iteration begins comparing a newly captured image

k + 2

with the last frame

k + 1

. However, if any error arises during the calculation process, the system starts the next iteration without updating the last frame, thus comparing the new frame

k + 2

with the frame k.

3. System Validation

This section aims to demonstrate the effective performance of the proposed solution in a controlled scenario. Firstly, the characteristics of the build setup for the experiments are described, and in the following section, the results obtained from these tests are presented and analyzed.

3.1. Experimental Setup

The VO algorithm proposed in this work was validated through a series of tests carried out in a controlled environment, as shown in Figure 5. For this purpose, a Cartesian gantry was built to allow the free movement of the camera in the horizontal plane. The sensor mounted on the setup was a 2-megapixel ELP USB camera (Ailipu Technology, Shenzhen, China) connected via USB to a basic laptop that was running the developed program in Python. During this initial testing, the program ran at 16 fps on a low-performance computer. This frame rate would only be suitable for slow-moving robots. For the intended application, this might be sufficient, but further optimizations on the software and hardware would be needed to apply this system more widely and maintain a good level of precision.

Under the camera, different patterns can be placed to show the performance of the method in various environments. Specifically, the results were obtained using two different patterns, as shown in Figure 6: an Aruco pattern and a generic floor-tile texture. The Aruco in Figure 6 simulates a specialized surface with high texture where the features are easier to extract. It also demonstrates one key feature of this system. Placing scannable codes in the floor at specific points allows the robot to read out information in the environment. This can be used in several ways, like pointing out landmarks and keep-out areas or using it to provide absolute positioning. The pattern in Figure 6 mimics the random pattern of the ground tiles, validating the performance of the algorithm for its use in indoor environments. This pattern was generated using an initial design obtained with the cv::randpattern::RandomPatternGenerator class from OpenCV. Additional processing was applied to smooth the original pattern and make it resemble a real tile.

3.2. Initial Testing and Analysis

When the first tests of the algorithm were performed, it was observed that a constant drift was present, causing the estimated trajectory to consistently shift to the right. Specifically, the alteration produced all movements on the horizontal axis to the right to be magnified by a constant ratio. This issue is illustrated in the left side of Figure 7, where a test shows the estimated route followed by the camera after four loops were completed in a 70-mm square path. The initial pose is marked in red, and the final pose in orange. This drift is caused by the optic effects from the lenses, the image capturing and processing, and system set-up errors.

3.3. System Calibration

Since the sources of error are constant in nature, we opted to calibrate the algorithm in order to eliminate this drift. To achieve this compensation, a constant,

K_{c o m p e n s a t i o n}

, was introduced into the algorithm in order to adjust the horizontal movement. This factor is calculated by averaging the ratio between the actual length of every pure horizontal movement and its theoretical length with the following expression:

K_{c o m p e n s a t i o n} = \frac{1}{N} \cdot \sum_{i = 1}^{N} (\frac{r e a l l e n g t h o f t h e i h o r i z o n t a l m o v e m e n t}{t h e o r e t i c a l l e n g t h o f t h e i h o r i z o n t a l m o v e m e n t})

(13)

where N represents the number of pure horizontal movements performed in the path. The results of this compensation are shown in the right illustration of Figure 7, where it can be seen that the drift present in the left of Figure 7 is no longer present. This solution will be further refined in the future integration of the algorithm into a mobile robot, which will finally demonstrate its validity in real-world applications. Using a faster camera with a global shutter would also greatly mitigate this issue, but we did not have one at hand, and it would have made the experimental system more costly.

4. Results

To test the algorithm, the camera was moved manually, following 70 × 70 mm square paths on each of the patterns. The paths estimated using the visual odometry algorithm are shown in Figure 8, highlighting the initial pose in red and the final pose in orange. In addition, the true trajectory of the camera is marked in red. However, in some figures, this path is difficult to see due to the low deviation of the system. The results obtained with these tests were analyzed qualitatively by comparing the shape of the actual path with the path estimated using the VO algorithm and quantitatively by analyzing the error of the method in the final position.

After multiple iterations, the errors in the final position were averaged, and they are shown in Table 1. The absolute error shows the difference when the algorithm was tested with a total traveled distance of 1120 mm.

The system was also tested using random motion via hand input. The objective of these tests was to evaluate the drift performance of the system. For this purpose, we moved the camera randomly over the pattern, making sure to start and end on the same point by visually aligning the camera with a dot on the pattern. These tests are shown in Figure 9. In these tests, it can be seen that the error between the initial and final position remained low, consistent with the results in Table 1.

Analysis

The system exhibited strong performance in both test scenarios, as evidenced by Figure 8. The estimated path of the algorithm aligned closely with the true 70 × 70 square path that the camera followed. Furthermore, the errors shown in Table 1 are minimal, even when the manual handling of the structure that held the camera is accounted for. Additionally, the performance of the robot in low-texture environments was proven with the tests using the tile pattern, which showed only slightly higher errors compared to the tests using the high-texture Aruco pattern.

5. Discussion

The developed solution provides robust pose estimation for any mobile robot using only a monocular camera. It was designed and developed for robots operating in dense public spaces, such as museums and hospitals, where determining a robot’s location is particularly challenging. While it can also be applied to industrial environments, achieving optical performance in these settings would require more advanced hardware.

Compared with other pose-estimation methods in the state of the art, the proposed system offers the advantage of achieving high precision with a single downward-facing sensor. This disposition offers the advantages of ensuring better control of image-capturing conditions and being immune to external perturbations that may interfere with location estimation, as can occur with systems employing VO odometry with outward-facing cameras or LiDARs. Furthermore, unlike traditional odometry, where the position is estimated by summing the movements of each motor, VO is immune to wheel slippage, which is common on polished indoor surfaces. Finally, this method is applicable to any mobile robot, unlike other state-of-the-art solutions, which may focus only on certain models, such as those in [12,13], whose study was limited to robots with the Ackerman model.

As is typical of odometry solutions, the system presents a degree of drift, as seen in the testing results. We did use the calibration procedure to mitigate this error, but unlike other odometry solutions, in our system, this issue can be addressed by strategically placing floor markers in an installation. These markers can be used by the system to determine its exact, absolute position, enabling the constant correction of the system. This can be achieved using stickers placed directly onto a floor, such as QR codes, or by using more subtle geometric patterns like dot or line patterns to encode information. For example, this can be applied directly by using floor-tile boundaries to minimize drift where available. These solutions allow the system to determine the robot’s absolute position or control the behavior of the robot with minimal installation costs.

6. Conclusions

In this work, a proof of concept of a hybrid visual odometry solution has been presented. This solution combines optical flow and direct methods, and it was designed to be used with a single downward-facing, monocular camera on low-texture surfaces in dynamic, indoor public spaces. The proposed solution offers several key advantages in comparison with other pose-estimation methods, like being immune to environmental conditions that may affect external sensors and being robust with foot traffic and dynamic objects in the environment. In addition, the system maintains high precision by providing a solution that mitigates drift issues present in other odometry techniques. Furthermore, the system provides a mechanism to identify landmarks and special areas with minimal installation effort and without changes in programming.

To achieve this result, the approach begins with the extraction and matching of image features using the Lucas–Kanade algorithm, followed by motion estimation, minimizing the error of the reprojection of the detected features across frames, and a filtering process to mitigate errors. Finally, sub-pixel accuracy is achieved by employing a template-matching algorithm that minimizes the photometric error between consecutive frames. The experimental results demonstrate the robustness of the algorithm on various surfaces, including tile-like patterns common in indoor environments in which the system is designed to operate. Furthermore, it has been shown how the method is able to maintain a low drift error.

The proposed method was tested on a system built as a demonstration platform designed to prove the feasibility of the method and check that it achieves the desired objective correctly. In subsequent iterations, we plan to address the low frame rate by using a faster and more optimal camera and conducting the implementation in C instead of using Python. We also believe the sub-pixel method could be further improved. Future work will include the improvement of the setup by incorporating a higher-resolution camera, which is expected to increase the accuracy of the system. Additionally, further work will include the integration of the system into an omnidirectional robot, which will facilitate extensive testing in real-world scenarios, providing further validation of the performance of the algorithm.

Author Contributions

Concept formulation, B.M.A.-H. and C.P.; methodology, B.M.A.-H., D.T. and C.P.; software development, D.T. and C.P.; validation, B.M.A.-H., D.T. and C.P.; formal analysis, B.M.A.-H., D.T. and C.P.; investigation, B.M.A.-H., D.T. and C.P.; resources, B.M.A.-H. and C.P.; original draft preparation, D.T.; review and editing, B.M.A.-H. and C.P.; visualization, D.T. and C.P.; supervision, B.M.A.-H. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is part of the R&D project “Cognitive Personal Assistance for Social Environments (ACOGES)”, reference PID2020-113096RB-I00, funded by MCIN/AEI/10.13039/501100011033.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Song, X.; Althoefer, K.; Seneviratne, L. A robust downward-looking camera based velocity estimation with height compensation for mobile robots. In Proceedings of the 2010 11th International Conference on Control Automation Robotics & Vision, Singapore, 7–10 December 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 378–383. [Google Scholar]
Liu, Y.; Zhou, Z. Optical Flow-Based Stereo Visual Odometry with Dynamic Object Detection. IEEE Trans. Comput. Soc. Syst. 2022, 10, 3556–3568. [Google Scholar] [CrossRef]
Cvišić, I.; Marković, I.; Petrović, I. Soft2: Stereo visual odometry for road vehicles based on a point-to-epipolar-line metric. IEEE Trans. Robot. 2022, 39, 273–288. [Google Scholar] [CrossRef]
Yin, H.; Liu, P.X.; Zheng, M. Stereo visual odometry with automatic brightness adjustment and feature tracking prediction. IEEE Trans. Instrum. Meas. 2022, 72, 5000311. [Google Scholar] [CrossRef]
Kottath, R.; Poddar, S.; Sardana, R.; Bhondekar, A.P.; Karar, V. Mutual information based feature selection for stereo visual odometry. J. Intell. Robot. Syst. 2020, 100, 1559–1568. [Google Scholar] [CrossRef]
Niu, J.; Zhong, S.; Zhou, Y. IMU-Aided Event-based Stereo Visual Odometry. arXiv 2024, arXiv:2405.04071. [Google Scholar]
Nezhadshahbodaghi, M.; Mosavi, M.R.; Hajialinajar, M.T. Fusing denoised stereo visual odometry, INS and GPS measurements for autonomous navigation in a tightly coupled approach. GPS Solut. 2021, 25, 47. [Google Scholar] [CrossRef]
Liu, Q.; Zhang, H.; Xu, Y.; Wang, L. Unsupervised deep learning-based RGB-D visual odometry. Appl. Sci. 2020, 10, 5426. [Google Scholar] [CrossRef]
Zhang, F.; Li, Q.; Wang, T.; Ma, T. A robust visual odometry based on RGB-D camera in dynamic indoor environments. Meas. Sci. Technol. 2021, 32, 044003. [Google Scholar] [CrossRef]
Won, C.; Seok, H.; Cui, Z.; Pollefeys, M.; Lim, J. OmniSLAM: Omnidirectional localization and dense mapping for wide-baseline multi-camera systems. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: New York, NY, USA, 2020; pp. 559–566. [Google Scholar]
Deng, H.; Arif, U.; Yang, K.; Xi, Z.; Quan, Q.; Cai, K.Y. Global optical flow-based estimation of velocity for multicopters using monocular vision in GPS-denied environments. Optik 2020, 219, 164923. [Google Scholar] [CrossRef]
Zeng, Q.; Ou, B.; Lv, C.; Scherer, S.; Kan, Y. Monocular visual odometry using template matching and IMU. IEEE Sens. J. 2021, 21, 17207–17218. [Google Scholar] [CrossRef]
Yu, Y.; Pradalier, C.; Zong, G. Appearance-based monocular visual odometry for ground vehicles. In Proceedings of the 2011 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Budapest, Hungary, 3–7 July 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 862–867. [Google Scholar]
Yang, N.; Stumberg, L.V.; Wang, R.; Cremers, D. D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 1281–1292. [Google Scholar]
Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast semi-direct monocular visual odometry. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; IEEE: New York, NY, USA, 2014; pp. 15–22. [Google Scholar]
Patruno, C.; Renò, V.; Nitti, M.; Mosca, N.; di Summa, M.; Stella, E. Vision-based omnidirectional indoor robots for autonomous navigation and localization in manufacturing industry. Heliyon 2024, 10, e26042. [Google Scholar] [CrossRef] [PubMed]
Lucas, B.D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th International Joint Conference on Artificial Intelligence (IJCAI’81), Vancouver, BC, Canada, 24–28 August 1981; pp. 674–679. [Google Scholar]
He, M.; Zhu, C.; Huang, Q.; Ren, B.; Liu, J. A review of monocular visual odometry. Vis. Comput. 2020, 36, 1053–1065. [Google Scholar] [CrossRef]
Morra, L.; Biondo, A.; Poerio, N.; Lamberti, F. MIXO: Mixture of Experts-based Visual Odometry for Multicamera Autonomous Systems. IEEE Trans. Consum. Electron. 2023, 69, 261–270. [Google Scholar] [CrossRef]
Pandey, T.; Pena, D.; Byrne, J.; Moloney, D. Leveraging deep learning for visual odometry using optical flow. Sensors 2021, 21, 1313. [Google Scholar] [CrossRef] [PubMed]
Zhan, H.; Weerasekera, C.S.; Bian, J.W.; Reid, I. Visual odometry revisited: What should be learnt? In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 4203–4210. [Google Scholar]
Jan, A.; Seo, S. Monocular depth estimation using res-UNet with an attention model. Appl. Sci. 2023, 13, 6319. [Google Scholar] [CrossRef]
Klodt, M.; Vedaldi, A. Supervising the new with the old: Learning SFM from SFM. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Wang, K.; Ma, S.; Chen, J.; Ren, F.; Lu, J. Approaches, challenges, and applications for deep visual odometry: Toward complicated and emerging areas. IEEE Trans. Cogn. Dev. Syst. 2020, 14, 35–49. [Google Scholar] [CrossRef]

Figure 1. Flow diagram of the algorithm.

Figure 2. Filtering process: matched features before filtering (left) and filtered features (right).

Figure 3. Schematic diagram of template matching.

Figure 4. Template matching results.

Figure 5. Setup for the experiments.

Figure 6. Employed patterns: Aruco (left) and tile pattern (right).

Figure 7. Path estimated using the VO: without (left) and with drift compensation (right).

Figure 8. Path estimated using the VO algorithm after four loops with different patterns: Aruco (left) and tile (right).

Figure 9. Path calculated using the VO algorithm when the camera was moved along random routes.

Table 1. Average errors in the final position.

Type	Aruco	Tile Pattern
Absolute error	1.255 mm	1.805 mm
Error ratio	0.112%	0.161%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Al-Hadithi, B.M.; Thomas, D.; Pastor, C. Hybrid Visual Odometry Algorithm Using a Downward-Facing Monocular Camera. Appl. Sci. 2024, 14, 7732. https://doi.org/10.3390/app14177732

AMA Style

Al-Hadithi BM, Thomas D, Pastor C. Hybrid Visual Odometry Algorithm Using a Downward-Facing Monocular Camera. Applied Sciences. 2024; 14(17):7732. https://doi.org/10.3390/app14177732

Chicago/Turabian Style

Al-Hadithi, Basil Mohammed, David Thomas, and Carlos Pastor. 2024. "Hybrid Visual Odometry Algorithm Using a Downward-Facing Monocular Camera" Applied Sciences 14, no. 17: 7732. https://doi.org/10.3390/app14177732

APA Style

Al-Hadithi, B. M., Thomas, D., & Pastor, C. (2024). Hybrid Visual Odometry Algorithm Using a Downward-Facing Monocular Camera. Applied Sciences, 14(17), 7732. https://doi.org/10.3390/app14177732

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid Visual Odometry Algorithm Using a Downward-Facing Monocular Camera

Abstract

1. Introduction

Related Works

2. Proposed VO System

2.1. Intended Application and Considerations

2.2. System Overview

2.3. Solution Implementation

2.3.1. Movement Estimation

2.3.2. Sub-Pixel Precision

3. System Validation

3.1. Experimental Setup

3.2. Initial Testing and Analysis

3.3. System Calibration

4. Results

Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI