1. Introduction
Estimating the geographic position is a fundamental requirement for UAVs, which is usually achieved using GPS [
1]. However, there also exist some situations where the GPS is unreliable, such as when the GPS is jammed. Since ample, free, and georeferenced maps, which cover many parts of the globe, are available online, such as the satellite imagery from Google Maps or the road map for OpenStreetMap (OSM), many researchers have tried to utilize georeferenced maps to solve the geolocalization problem for UAVs in GPS-denied environments.
Such methods model geolocalization using a georeferenced map as an image registration problem where the core issue is to estimate the transformation that aligns the observed aerial image from the onboard camera to a known georeferenced map. The transformation is often modeled as a similarity transformation when the optical axis of the camera is perpendicular to the ground or more generally as a homography transformation. Many researchers have tried to meet the challenge of the image registration problem using robust low-level vision features [
2,
3,
4,
5] or semantic vision features [
6,
7,
8]. Generally speaking, geolocalization using georeferenced maps can be divided into two categories: geolocalization using original satellite imagery and geolocalization using semantic maps, such as a road map or building contour.
Geolocalization using satellite imagery: Geolocalization using satellite imagery is more intuitive. To get around the difficulties in image registration caused by the significant difference between the observed aerial images and satellite imagery, early attempts usually utilize robust low-level vision features to perform image registration. In [
2,
3], the crosscorrelation and the HOG were used to measure the similarity between two images to estimate a 2D translation between two rotation-and-scale-aligned images. Ref. [
4] used mutual information as the similarity metric and utilized the template matching method to estimate the similarity transformation. Recently, some researchers have tried to solve the satellite imagery registration problem with deep learning-based methods. Ref. [
5] followed the idea proposed in [
9] and registered aerial images to satellite imagery by aligning a feature map learned with a VGG16 network [
10], and they reported a localization accuracy of 8 m. Ref. [
11] proposed a localization solution using the Monte Carlo localization method, where the similarity between the onboard aerial image and the reference satellite imageries was measured using a convolutional neural network. Ref. [
12] estimated the geographic position of UAVs by aligning onboard aerial images to satellite imagery using SuperPoint [
13], which is a kind of local descriptor.
Geolocalization using semantic maps: Benefitting from the high stability and reliability of semantic maps and the improved performance of semantic segmentation using deep learning techniques [
14,
15,
16], geolocalization using semantic maps has attracted the attention of more researchers. In [
6], building contours were employed to match to semantic maps using the Hu moments [
17] approach with a carefully designed pipeline. In [
7,
8], road intersections were utilized to build the feature correspondences between an aerial image and a reference road map to reduce the error accumulation caused by inertial navigation. In [
18], a large-area road geolocalization method was proposed, where the geometric hashing method was utilized to align the road fragment obtained by car tracking to a known road map. Different from the aforementioned methods, which use a similarity transformation hypothesis and thus only work when the camera is nadir, in our prior work [
19,
20], we proposed two projective-invariant geometric features and the accompanying matching methods and achieved road network alignment for aerial images with projective perspectives over a sizable search region. In contrast to these two-stage methods, where semantic segmentation and shape matching are separated, Ref. [
21] proposed to regress the similarity transformation between an aerial image and its reference road map using a Siamese network. Their approach, however, is only useful at locations with complex-shaped roadways, such as highway intersections.
Camera pose estimation: Even though matching to georeference maps is studied in many literature works, the subsequent issue, accurate geographic position estimation, is often disregarded. Camera pose estimation is a classical problem in multiple-view geometry and also a fundamental task in many computer vision applications. The problem is well solved in the case of estimating the camera pose between two central projection images. The problem can be solved using the algorithms proposed in [
22] or [
23] when the scenario is assumed to be planar. More generally, Refs. [
24,
25] proposed methods to recover the camera motion from the fundamental matrix between two frames without the planar assumption. In addition, the camera pose can be estimated from correspondences between 2D image points and 3D object points when the depth of the scene is known using the methods in [
26,
27], which details the perspective-n-point (PnP) problem. However, there is still no public research paper which addresses the problem of estimating the camera motion between a central projection image and an orthographic projection image, which is the case when we need to recover the geographic position of the camera by matching it to a georeferenced orthographic map.
Some works on geolocalization use the translation of the estimated similarity transformation to recover the geographic position of the camera [
2,
3,
4,
21], which is equal to computing the projection point of the image center using the estimated transformation. Such methods only work properly when the optical axis of the camera is perpendicular to the ground, or external information is used to compensate for the deviation [
12]. Some other works [
18,
19,
20] donated the geolocalization result using the homography or simplified similarity transformation between the aerial image and the georeferenced map, wherein the geographic position of the UAV was thus unavailable. To the best of our knowledge, no public research article has addressed the problem of estimating the geographic position of the camera when the transformation between the aerial image from the onboard camera and the georeferenced maps is known without the assumption that the optical axis of the camera is perpendicular to the ground.
In summary, the main contributions of this article are as follows:
- (1)
The initial solution of estimating the camera geographic position and attitude from the homography transformation between the aerial image from an onboard camera and the georeferenced maps is derived.
- (2)
A fast and robust position refining method is proposed which improves the accuracy of geographic position estimation even when the homography transformation estimation is noisy.
- (3)
A real-time continuous road-matching-based geolocalization method for UAVs is presented.
2. Materials and Methods
In this section, we introduce the method to estimate the geographic position of the UAV using road network alignment with a georeferenced map in detail. We first give the formulation of the problem and introduce the coordinate system used. And then, the relation between the geographic position and attitude of a UAV and the estimated road registration result, usually expressed with a homography transformation , is derived, with which the initial optimal solution of and is then computed. And then, a pose refining method is performed by aligning all observed roads to georeferenced roads to improve the pose estimation accuracy further. Finally, a continuous real-time geolocalization system using road network alignment is designed and presented based on the proposed camera geographic position algorithm. The detailed algorithm is described as follows.
2.1. Problem Formulation
The problem of estimating the geographic position of a UAV after road network alignment can be described as follows: given the georeferenced road map of a certain area, of which the geographic boundary is , the ground sample distance (GSD) is S, and the camera intrinsic parameter matrix is , can the geographic pose of the UAV be recovered when the transformation between the observed aerial image and the reference road map is estimated?
To address the issue, we introduce the east-north-up (ENU) coordinate system (shown in
Figure 1), of which the coordinate origin
is the southwest corner of the reference road map area, the x axis points to the east, the y axis points to the north, and the z axis is up. In practical applications, the position in a predefined ENU coordinate system is usually used, so we mainly focus on estimating the geographic position of UAV in the ENU coordinate system in this paper.
Let
be a road point expressed in the ENU coordinate system, let
be the homography coordinate of the corresponding point in the observed aerial image, and let
be the rotation matrix and translation vector of the camera expressed in the ENU coordinate system. The transformation between
and
can be computed as
[
28], where
s is the scale factor with which the third dimension of the vector is normalized to 1. Writing
as
, we can obtain
. Since we suppose the local roads lie on the same plane, the
z of
is always equal to 0 in the defined ENU coordinate system. Then, it can be deduced that
. Writing
we can obtain
, where
is the corresponding homography coordination of the projection point in the
plane of
. We can see that the transformation between the aerial image and the
plane of the defined ENU coordination system can be expressed as a simple homography transformation, which is determined only by the camera intrinsic parameter matrix
and camera geographic pose
.
Moreover, the transformation between the
plane and the reference road map image can be computed as
which is fixed once the reference road map is given. So, the camera pose in the defined ENU coordinate system is determined once the homography transformation between the aerial image and reference road map is estimated.
2.2. Estimate of the Initial Solution of the Camera Geographic Pose from the Homography Matrix
In
Section 2.1, we deduce the formulation between the camera geographic pose and the homography transformation that projects the points from the defined ENU coordinate system to the aerial image coordinate system. In this section, the algorithm to recover the camera pose
in the ENU coordinate system is introduced in detail.
2.2.1. Estimate Geographic Attitude
Multiplying both sides of Equation (
1) by
gives
Writing
, we obtain
Here, is subject to , where is a two-dimension identity matrix.
Since there exist errors in the estimation of
and
,
may be not fully compatible with any camera pose
that determines the matrix
. We face the task of determining the optimal solution of
given
. Here, we use the Frobenius norm to measure the difference between optimal
, and the observed matrix
, and then solving Equation (
4) is equal to minimizing the following cost function:
Expressing Frobenius norm in Equation (
5) with the trace of matrix gives
The minimum of Equation (
6) is obtained when
, and the corresponding minimum is
. So, minimizing Equation (
6) is equal to maximizing
.
We write the SVD decomposition of
as
, where
, and
are
and
identity matrices, respectively; we then obtain
Writing
gives
where
and
are the
column of
respectively.
Since
and
, where
are unit vectors, we obtain
The equal relation holds, if and only if
, where
. So, Equation (
8) reaches the maximum when
Finally, the initial solution of the optimal geographic attitude is
where
and
are the first and second column of
, respectively.
2.2.2. Estimating the Geographic Position
Since Equation (
6) reaches a minimum when
, we obtain the optimal
Combining Equation (
3) and Equation (
12), we obtain the geographic position
where
reprents the third column of the matrix
.
2.3. Refining the Camera Geographic Pose
In
Section 2.2, we have shown that the solution of the camera geographic pose can be computed from the estimated homography transformation using the road network alignment method. The accuracy of the estimated camera geographic pose is determined directly by the accuracy of the estimated homography transformation, where the estimation error exists unavoidably. To improve the accuracy and robustness of the camera geographic pose estimation, we model the camera geographic pose estimation as a problem to minimize the alignment error of the reference road map and the observed roads:
where
is the road point set in the aerial image from the onboard camera, and
is the road point set in the reference road map.
is the Huber loss function used to make the optimization robust to outliers.
is the function that projects a point
in the aerial image to the reference road map, which can be computed using the camera model as follows:
where
refers to the
row of vector
.
In Equation (
14), the aligning error of a road point
in the aerial image is measured using the distance between its projection point and the nearest road point in the reference road map to the projection point. Since this kind of metric is nondifferentiable, it is difficult to solve Equation (
14). Fortunately, when the reference road map is given, the minimum distance to a road point in a certain position is determined and can be computed with the distance transforms algorithm [
29] in advance. Writing the Voronoi image computed by the distance transformations algorithm as
, we obtain the simplified formulation:
Equation (
16) can be solved efficiently using the Levenberg Marquardt algorithm using the solution deduced in
Section 2.2 as the initial value.
2.4. Road Network Alignment-Based Real-Time Geolocalization Method
As demonstrated in
Section 2.1, the geographic position of the camera can be computed once the aerial image is aligned to a georeferenced road map, and the homography transformation between them is estimated. For practical applications, the estimation of the camera geographic position must run in real time, which cannot be achieved using the road network alignment-based method, since road network alignment is time-consuming. However, as shown in our previous work [
30], it is possible to achieve real-time alignment to the georeferenced road map when combining it with the relative transformation estimation computed from the ORB feature [
31] matching to adjacent frames. Thus, we design a real-time geolocalization pipeline for UAVs by combining the relative transformation estimation from adjacent frames and geographic alignment to a given georeferenced road map.
As shown in
Figure 2, the proposed road network alignment-based geolocalization method includes two threads: the geographic position estimation thread and the road network alignment thread.
Geographic position estimation thread: We use the method proposed in our previous work [
30] to estimate relative homography transformation to a local reference keyframe. Different from the method in [
30], we stitch the RGB image instead of detecting and stitching roads in each frame into a road mosaic image to achieve faster estimation. Thus, we can estimate the transformation
between the current frame and the local reference frame (usually the first frame sent to the thread) and expend the mosaic of the scene in real time. With the geographic alignment result
from the road network alignment thread, the homography transformation between the current frame and the georeferenced road map can be computed as
. The geographic position of the camera is then estimated using the method proposed in
Section 2.2 and refined using the method in
Section 2.3. Once the current frame moves too far from its reference keyframe, a new keyframe will be created, and the old keyframe will be sent to the road network alignment thread.
Road network alignment-based geographic alignment thread: Upon receiving the keyframe from the geographic position estimation thread, road detection is conducted on the RGB image mosaic of the keyframe. A global road network feature search is performed for the first keyframe, or the homography transformation refining is conducted using the initial geographic alignment estimation from the geographic position estimation thread for the later keyframes using both of the methods proposed in our previous work [
20]. Thus, the optimized transformation
between the keyframe and the georeferenced road map can be obtained and is then used to update the transformation between the local reference frame and the georeferenced road map using
.
4. Discussion
Results from experiments conducted on both synthetic and real-flight datasets conclusively show that the proposed method adeptly and accurately estimates the geographic position of the UAV, irrespective of whether the camera’s optical axis is perpendicular to the ground.
The result on synthetic aerial image dataset shown in
Figure 3 demonstrates that the maximum values and medians of the errors of the geographic positions estimated using the proposed initial solution were less than 10 m and 5 m, respectively, and they were reduced to 4 m and 2 m, respectively, after using the proposed pose refining procedure under all poses. This suggests that the proposed geographic position estimation method can estimate accurate geographic positions and that the proposed pose refining algorithm is effective in reducing the positioning error. The position errors estimated using the projection point increased rapidly with the pitch and reached tens of meters, even when the pitch was as small as
, which means the method works only when the camera is nadir. The positioning accuracy after using the pose refining procedure improved slightly with the increase in pitch, which mainly benefits from the larger visual field under a larger pitch. In such cases, more roads are observed and provide more constraints for the pose optimization.
In the analysis of three real-flight aerial image sequences, notable reductions in position estimation errors were observed, exemplified by significant decreases at specific time points, such as 48.0 seconds and 77.5 seconds, as illustrated in
Figure 5A, which were mainly due to successful road mosaic georeferencing. Since the image mosaicking algorithm computes the homography transformations of the image sequence in a recursive manner, there existed error accumulation in the estimated homography transformations. In other words, the accuracy of the estimated homography transformation improved as a frame got closer to its corresponding georeferenced keyframe. This phenomenon leads to abrupt decreases in positioning errors estimated with projection point. It indicates that estimating the position with the projection point is sensitive to the error in the computing homography transformation. Even though the proposed initial solution was also sensitive to the homography transformation estimation error, the error could be reduced effectively using the pose refining procedure in most cases, thus making our complete position estimation algorithm robust to homography transformation estimation noise.
There existed differences in the geographic position estimation accuracy in the three flights. The differences may mainly come from two aspects: the flight height and the density of road in the scene. Generally speaking, more observed roads can provide more constraints when estimating the homography transformation and refining the camera pose. When the roads of the scene are denser, more roads may be captured by the camera. As is shown in
Figure 4, among the three flights, the road of the scene in flight A was much denser than that in B and C, thus resulting in the most accurate geographic position estimation. Also, a relatively high height may improve the estimation accuracy because the field of view is larger and more roads may be captured by the camera when the UAV flies at a relatively high height. Nevertheless, aerial images captured at higher flight altitudes exhibit a lower GSD, which is a factor that tends to marginally decrease the accuracy of the geographic position estimation.