1. Introduction
3D reconstruction of urban scenes provides a fundamental data source for many smart city researches and applications [
1,
2,
3]. In recent years, airborne oblique photogrammetry has become one of the mainstream solutions for reconstructing photorealistic 3D models in urban scenes due to its high cost-effectiveness, high fidelity, and high accessibility of professional equipment [
4,
5,
6]. Although airborne oblique photogrammetry is widely adopted for 3D modeling at the city scale, the bottom parts of reconstructed models are often unsatisfactory due to the occlusion of ground objects and large perspective distortion of aerial imagery, especially in complex urban scenarios. With the development of data acquisition techniques, the integration of aerial and terrestrial imagery is widely used for generating better 3D models in terms of completeness, accuracy, and fidelity [
7,
8,
9].
One major challenge of terrestrial–aerial integrated 3D reconstruction is cross-platform image matching [
10,
11]. Establishing tie-points between images with large viewpoint and illumination variations is difficult for SIFT-like image matching methods [
12]. Recent learning-based image matching methods can extract more distinctive features by using deep neural networks [
13,
14]. These learned features exhibit better performance on benchmark datasets. Although a few methodologies based on handcrafted and learning-based image matching algorithms have been proposed to improve the robustness of cross-platform image matching [
15,
16], the problem has not been fully resolved.
Another challenge is the accurate fusion of terrestrial and aerial models. Most studies reconstructed terrestrial and aerial models and merged the models via a 3D similarity transformation. The similarity transformation is usually estimated using 3D correspondences derived from cross-platform tie points. However, previous studies have shown that the epipolar constraint cannot remove all mismatches. And the remaining cross-platform mismatches will introduce inaccurate observations to the estimation of similarity transformation and the global optimization of the integrated model. Although outlier detection methods have been proposed to filter remaining mismatches and incorrect 3D correspondences, the accuracy of model fusion can be further improved. Terrestrial and aerial models can also be merged based on point cloud registration [
17,
18]. However, these methods require accurate extraction and robust matching of common geometric features in two point clouds. The establishment of accurate correspondences in cross-platform point clouds is still challenging.
This paper presents a novel approach for integrated 3D reconstruction of urban scenes using aerial and terrestrial imagery. The main contributions of this paper are as follows. First, a robust image matching method is proposed to tackle the cross-platform image matching problem. Incremental Structure from Motion (SfM) with weighted Global Navigation Satellite System (GNSS) observations is used to reconstruct georeferenced terrestrial and aerial sparse models, respectively. Based on the sparse models, cross-platform match pairs are selected by projecting terrestrial points to the aerial images. Instead of matching rectified images or renderings, terrestrial and aerial images are directly matched based on the selected match pairs to generate cross-platform tie points. Second, an outlier detection algorithm is proposed to refine 3D correspondences between terrestrial and aerial models. The proposed algorithm is derived from the positioning uncertainty of photogrammetric reconstruction. The cross-platform tie points are robustly triangulated based on the terrestrial and aerial sparse models. Then, outliers are removed from the correspondences based on the statistics of positional differences. The similarity transformation from the terrestrial sparse model to the aerial sparse model is estimated based on the refined 3D correspondences. After merging the transformed terrestrial model with the aerial model, the integrated model is globally optimized.
The remainder of this paper is organized as follows.
Section 2 reviews related research works.
Section 3 elaborates on the proposed methodology, including reconstruction of terrestrial and aerial sparse models, robust matching of terrestrial and aerial images, and accurate fusion of terrestrial and aerial sparse models.
Section 4 presents experimental results on five benchmark datasets, and the performance of the proposed methodology is demonstrated by comparative experiments and ablation studies.
Section 5 discusses the experimental results and limitations of the proposed methodology. Finally, conclusions are made in
Section 6.
2. Related Works
The core components of an image-based 3D reconstruction pipeline include image matching, image orientation, dense matching, and textured mesh construction. A traditional image matching procedure extracts feature points from images and finds initial matches between image pairs [
19,
20]. After mismatches are filtered based on the epipolar constraint with the random sample consensus (RANSAC) framework, tie points are established [
21,
22]. Then, the image orientation procedure estimates the optimal extrinsic parameters, intrinsic parameters, camera calibration parameters, and the sparse structure of a scene based on the tie points. The Structure from Motion (SfM) framework is the de facto standard for fully automatic image orientation [
23,
24,
25,
26]. The dense matching procedure establishes pixel-wise correspondences between matched images and generates dense depth maps, which can be used to derive the dense point cloud of the scene [
27,
28,
29]. Based on the dense point cloud, the textured mesh construction procedure first builds the geometric model of the scene using a 3D triangular mesh and textures the mesh with images [
30].
Robust cross-platform image matching is generally required when applying the above pipeline to terrestrial–aerial integrated reconstruction in complex urban scenarios. It is well known that SIFT is sensitive to viewpoint changes larger than 50 degrees [
12]. ASIFT improves the robustness of image matching by simulating all possible affine distortions and matches the simulated images using SIFT [
31]. An image matching approach based on warping aerial images to ground was proposed for matching nadir and oblique aerial images [
32]. The method reduced the viewpoint difference between the aerial and oblique images, and the warped images were robustly matched using SIFT. A similar approach was proposed for matching terrestrial and aerial images toward the reconstruction of ancient Chinese architecture [
15]. This approach first conducted terrestrial and aerial sparse reconstruction, respectively. Then, terrestrial–aerial image pairs were selected based on co-visible mesh, and terrestrial images were warped to the perspectives of aerial images. The warped images were matched against the aerial images using SIFT. And the tie points were filtered and transferred to the original terrestrial images. Similarly, a rendering-based approach was proposed for cross-platform image matching [
9]. This method detected building facades from the dense cloud derived from the aerial images and rectified each pair of images based on the detected facades. The method effectively increased the number of SIFT tie points between terrestrial and aerial images. A similar strategy was proposed to match terrestrial and aerial images based on rectifying images using textured mesh [
9]. This method rendered the textured mesh derived from the aerial images to the perspectives of the terrestrial images and matched the renderings with the terrestrial images using SIFT. The method exhibited high robustness for cross-platform image matching on five benchmark datasets. An approach based on refined image patches was proposed to match cross-platform images [
33]. Sparse point clouds were derived from aerial and terrestrial images, respectively. Image patches were built based on the point clouds and optimized to be close to the tangent plane of the object surface by variational patch refinement. The aerial and terrestrial image patches were matched using SIFT. Although these methods improved the robustness of matching cross-platform images, the dependence on the SIFT-like algorithms limits their capability in challenging urban scenarios.
In recent years, many learning-based methods have been proposed for robust image matching under challenging conditions [
13,
14,
34]. These methods can extract more distinctive features using convolutional neural networks trained on benchmark datasets. Based on the Transformer framework, tie points can be established without using a feature detector [
35]. These learning-based methods showed better adaptiveness to viewpoint and illumination variations than handcrafted methods on benchmark datasets [
36]. These learning-based methods have been used for matching cross-platform images. A learning-based framework was proposed for matching terrestrial and aerial images [
37]. A dense correspondence network was trained to learn consistent features among terrestrial and aerial images and generate dense correspondences. Then, sparse keypoints were extracted from each image, and tie points were established between each image pair based on the dense correspondences. The locations of the keypoints were further refined using the learned feature map to improve the quality of the tie points. A methodology based on the SuperGlue algorithm [
38] was proposed for matching aerial, mobile mapping, and backpack images [
8]. The method first generated a sparse point cloud from the aerial images. Then, the sparse point cloud was segmented, and facade planes were extracted from the segmented point cloud. Images acquired by different platforms were rectified onto the extracted facade planes, and the rectified images were matched using the SuperGlue algorithm. The method performed well on a challenging dataset. However, the point cloud segmentation results require manual checking and interactive improvements. Moreover, the SuperGlue algorithm extracts much fewer feature points from images with poor textures. The unevenly distributed tie points require manual adjustments. Although learning-based image matching methods have shown promising performance, recent studies have demonstrated that these methods do not have obvious advantages over handcrafted ones in conventional 3D reconstruction tasks [
39,
40,
41,
42]. Directly applying these learning-based methods to match cross-platform images and reconstruct 3D models of complex urban scenes remains challenging.
Accurate fusion of terrestrial and aerial models is also required for high-quality terrestrial–aerial integrated reconstruction. The estimation of accurate similarity transformation between terrestrial and aerial models requires precise 3D correspondences. These 3D correspondences are usually obtained by triangulating the cross-platform tie points. It is well known that outlier detection based on the epipolar constraint cannot eliminate all mismatches. Methods have been proposed to further remove the remaining mismatches. Mismatches were filtered by using thresholds on the variations of scale and principal orientation of SIFT features [
15]. The affine transformation model with RANSAC loops was also used for filtering mismatches. Then, terrestrial–aerial tracks were triangulated to obtain 3D correspondences. A global bundle adjustment was performed to merge terrestrial and aerial point clouds, in which the Huber loss was introduced to deal with false 3D correspondences. An outlier detection method based on geometric constraints was proposed to filter mismatches [
9]. The length, intersection, and direction constraints defined based on disparity vectors were used to remove outliers from initial matches. The remaining mismatches were filtered using the epipolar constraint. The established tie points were further refined by matching local patches in the original terrestrial and aerial images. A normalized correlation coefficient search was used to find initial matches, and the initial matches with a correlation score smaller than a threshold were pruned. A two-stage approach was proposed for outlier detection [
33]. Outliers in initial matches were firstly filtered by cross-checking and saliency detection using the nearest neighbor distance ratio test. Then, a 3D similarity transformation between two sets of image patches was computed with the RANSAC framework to further remove outliers. The 3D similarity transformation was also used as an additional geometric constraint to limit the matching range of the image patches, which also improved the robustness of the approach. Geometric constraints were also proposed to filter mismatches in [
8]. After mismatches were filtered using the epipolar constraint, a 3D point was calculated from each match pair. A match was considered an outlier if the corresponding 3D point was far from the facade plane or other 3D points calculated from matches on other images. After outlier detection, tie points were linked to build tracks. And tie points with short track lengths were further removed. The advantage of these outlier detection methods is that the constraints have clear geometric meanings and are easy to understand. However, setting threshold values for these constraints requires practical experience, which can be challenging for complex datasets.
In summary, recent studies have shown promising performances of learning-based image matching methods on benchmark datasets. However, the capabilities of these methods have not been effectively incorporated into the reconstruction pipeline. Furthermore, most outlier detection methods filter mismatches from the perspective of image matching. The positioning uncertainty of the terrestrial–aerial integrated reconstruction problem has not been fully exploited. Innovative methods need to be developed to robustly match cross-platform images and achieve accurate integrated reconstruction.
3. Methodology
3.1. Overview of Proposed Methodology
The workflow of the proposed methodology is illustrated in
Figure 1. Firstly, image matching is performed separately on the terrestrial and aerial images of a scene. Georeferenced sparse models are reconstructed from the terrestrial and aerial images, respectively. Then, match pair selection is performed to determine match pairs between cross-platform images. Based on the selected match pairs, robust image matching is conducted to generate tie points that connect terrestrial and aerial images. Then, the cross-platform tie points are triangulated to derive 3D points from the terrestrial and aerial sparse models, respectively. Correspondences between the terrestrial and aerial 3D points are determined, and outliers are filtered. The terrestrial and aerial sparse models are merged based on the refined correspondences, and the integrated model is globally optimized to finally obtain an accurate reconstruction of the scene.
3.2. Reconstruction of Terrestrial and Aerial Sparse Models
The georeferenced sparse models are reconstructed from terrestrial and aerial images as follows. Image matching is first performed on the images. In the image matching process, RootSIFT [
43] is used for feature point extraction and description. The feature points are matched using the approximate nearest neighbors (ANN) algorithm to determine initial matches. Then, the initial matches are verified based on the epipolar constraint with the RANSAC framework to generate geometrically consistent tie points. To speed up the matching process, match pairs are selected from
K nearest neighbors (KNN) of each image. And image matching was only performed on the selected match pairs.
Based on the tie points, the GNSS-aided incremental SfM is used to reconstruct the georeferenced terrestrial and aerial sparse models in favor of its robustness and accuracy. The incremental SfM procedure first selects an image pair and reconstructs an initial stereo model. Then, it grows the model by adding new images and globally optimizing all parameters in a loop. The object function for the global optimization is given by Equation (1).
where
projects a 3D point
onto an image
j,
represents camera parameters of image
j,
is an image observation,
denotes L2-norm, and
is an indicator function.
equals to 1 if
is visible to image
; otherwise, it equals 0.
is a position observation of image
k,
is the estimation of the position,
is a weight that is calculated according to Equation (2).
where
is the accuracy of image observations,
is the accuracy of the GNSS observations. After the GNSS-aided SfM, georeferenced terrestrial and aerial sparse models are obtained. In this study, a sparse model of a scene indicates the model reconstructed by the GNSS-aided SfM, which is composed of a sparse point cloud, exterior and interior orientations of images, and camera calibration parameters. The sparse point cloud is derived from geometrically consistent tie points. The exterior orientations of an image define the position and rotation of the image under the object coordinate system. The interior orientations include the focal length f and the offset of the principal point (cx, cy). The Brown’s radial distortion model with three parameters (k1, k2, and k3) was used for camera calibration.
3.3. Robust Matching of Terrestrial and Aerial Images
Based on the terrestrial and aerial sparse models, the terrestrial and aerial images are robustly matched as follows. Firstly, the normal vector of each 3D point from the terrestrial sparse model is estimated. Based on the estimated normal vectors, the observability of each terrestrial point in the aerial images is determined. Terrestrial–aerial match pairs are selected by projecting terrestrial points to the aerial images in which they are observable. Based on the selected match pairs, the terrestrial and aerial images are robustly matched.
The normal vector of a point from the terrestrial point cloud is estimated by averaging its normalized observation vectors. The normal vector estimation is illustrated in
Figure 2. In this top-view illustration, an estimated 3D point
P on the facade of a building is observable in images
I1,
I2,
I3, and
I4. To calculate the normal vector of
P, the observation vectors
PS1,
PS2,
PS3, and
PS4 are firstly calculated based on the estimated positions of
P and respective perspective centers. Assume that the observation vectors are uniformly distributed in space, and the normal vector
N of
P is approximated by normalizing and averaging the observation vectors.
To improve the robustness and efficiency of the integrated reconstruction pipeline, cross-platform image matching is conducted on selected image pairs. Based on the normal vector estimation, match pairs between cross-platform images are selected as follows. Firstly, the procedure iterates through the terrestrial point cloud and calculates virtual observation vectors for each point. A virtual observation vector is the vector from a point to the perspective center of an aerial image. Then, for each point, the aerial images in which the point is observable are obtained by projecting the point onto the potential aerial images based on the Collinearity Equations. An aerial image is considered a potential aerial image of a point if the angle between the normal vector of the point and the corresponding virtual observation vector is smaller than a given threshold VT. If the projection of the point is within the valid area of an aerial image, the point is considered observable in the image. If a point is observable both in an aerial image and a terrestrial image, the point is considered a common point between these two images. If the number of common points between a terrestrial image and an aerial image is larger than a given threshold NT, these two images are considered as a valid match pair. For each terrestrial image, all aerial images that form a match pair with it can be determined.
Based on the selected match pairs, an image matching scheme is proposed to match cross-platform images. The relationship between the terrestrial and aerial images is illustrated in
Figure 3. In the figure, the point on the facade of a building is observable in three terrestrial images and three aerial images.
Tj is a terrestrial image.
Al,
Am, and
An are selected aerial images that form match pairs with
Tj.
Ti and
Tk are neighboring terrestrial images of
Tj. In this study, the LoFTR algorithm [
34] is used for matching the terrestrial and aerial images for its high robustness. It should be noted that image matching using LoFTR is directed, which means that the tie points generated by matching
Ti against
Tj are different from those by matching
Tj against
Ti. In this study, pairwise image matching is conducted as follows. For each terrestrial image
Tj, it is matched against all the selected aerial images and its
K nearest terrestrial images. The initial matches are verified based on the epipolar constraint with the RANSAC framework to obtain geometrically consistent tie points. The algorithm for robust cross-platform image matching, including the proposed match pair selection method and image matching scheme, is described by Algorithm 1.
Algorithm 1 Robust cross-platform image matching |
Input: terrestrial image set , aerial image set , terrestrial sparse point cloud , threshold to constrain the angle between a normal vector and a virtual observation vector, minimum number of common points between a terrestrial image and an aerial image, for searching nearest neighbors of a terrestrial image |
Output: a set of tie points that record each group of tie points including their position in observable images |
Initialization: , which records the number of common points between a terrestrial image and an aerial image |
1: for each point in |
2: find terrestrial images in which is observable |
3: calculate observation vectors of |
4: calculate normal vector of by averaging observation vectors |
5: for each aerial image in 6: calculate virtual observation vector of |
7: if angle() < and projection of is within the valid area of |
8: for each image in |
9: increment |
10: end for |
11: end if 12: end for 13: end for |
14: for each terrestrial image in |
15: for each aerial image in |
16: if |
17: build a match pair |
18: end if |
19: end for |
20: end for |
21: organize match pairs to which stores aerial images that form a match pair with a terrestrial image |
22: for each terrestrial image in |
23: find terrestrial images that are nearest neighbors of |
24: match against aerial images in |
25: match against terrestrial images in |
26: add the tie points as a group to |
27: end for |
28: return |
3.4. Accurate Fusion of Terrestrial and Aerial Sparse Models
The terrestrial and aerial sparse models are merged to generate an integrated model of the scene as follows. Firstly, the cross-platform tie points are separated into terrestrial and aerial groups. Then, the two groups of tie points are triangulated based on the terrestrial and aerial sparse models, respectively. Thirdly, correspondences between the two groups of triangulated 3D points are found, and outliers are filtered from the correspondences. Finally, the terrestrial and aerial sparse models are merged based on the refined correspondences, and the integrated model is globally optimized to further improve the accuracy of the reconstruction.
The terrestrial and aerial groups of tie points are robustly triangulated with the RANSAC framework. The triangulation of the tie points is illustrated in
Figure 4. The triangulated 3D points are labeled with black points in the figure. These 3D points are triangulated from tie point observations of points
A,
B, and
C. The tie point observations of the same point are labeled with the same color. As illustrated by
Figure 4, the points triangulated from the aerial images do not coincide with those from the terrestrial images.
To precisely merge the terrestrial and aerial sparse models, a 3D similarity transformation from the terrestrial model to the aerial model is estimated based on the 3D correspondences that are found in the triangulated terrestrial and aerial 3D points. To improve the accuracy and robustness of the model fusion process, outliers in the correspondences are detected based on the following derivations. Assume that
and
are a pair of 3D correspondences that are triangulated based on respective terrestrial and aerial sparse models. Assume
where
is the true position of the object point corresponding to the correspondences,
and
are residual error vectors corresponding to terrestrial and aerial sparse models, respectively. Assume that
and
are subject to the three-dimensional normal distributions defined as follows.
where
and
define the positioning bias and accuracy of the terrestrial and aerial sparse models, respectively. Based on the above assumptions, the positional difference between
and
is subject to the three-dimensional normal distribution given by Equation (7).
Based on the above derivations, statistics of positional differences are used to detect and remove outliers from the correspondences. Specifically, mean and standard deviation values along the X, Y, and Z axes are calculated from the positional differences of all pairs of 3D correspondences. A pair of correspondences is considered an outlier as long as the positional difference along any axis is outside the range of the respective mean value plus and minus
times the standard deviation. After outliers are removed from the correspondences, a 3D similarity transformation is estimated. For refined correspondences {
} and {
}, the similarity transformation is estimated by minimizing the object function given by Equation (8).
where
is the rotation matrix,
is the translation vector,
is the scaling factor, and
is the Huber loss function,
denotes L2-norm. Then, the estimated similarity transformation is applied to the terrestrial sparse model. And the transformed terrestrial model is locally optimized with correspondences fixed to their positions in the aerial model. After merging the locally optimized terrestrial model with the aerial model, the integrated model is globally optimized by minimizing the object function given by Equation (1). The algorithm for robust sparse model fusion is described by Algorithm 2.
Algorithm 2 Robust sparse model fusion |
Input: terrestrial sparse model , aerial sparse model , tie points |
Output: an integrated model |
1: separate to terrestrial tie points and aerial tie points |
2: initialize terrestrial point set and aerial point set |
3: for each group of tie points in |
4: robustly triangulate to obtain a 3D point based on |
5: add to |
6: end for |
7: for each group of tie points in |
8: robustly triangulate to obtain a 3D point based on |
9: add to |
10: end for |
11: find 3D correspondences between and 12: calculate positional differences of 3D correspondences, derive mean and standard deviation values along three axes 13: filter outliers in based on the Three-Sigma Rule |
14: estimate a 3D similarity transformation based on refined correspondences 15: transform to based on , and locally optimize |
16: merge and to integrated model |
17: globally optimize |
18: return |
4. Experimental Results
The proposed methodology was evaluated using five publicly available benchmark datasets. Firstly, the specifications of the datasets are detailed. Secondly, experimental results of sparse model reconstruction, terrestrial–aerial image matching, and terrestrial–aerial sparse model fusion are presented. Finally, the proposed methodology was compared with a state-of-the-art methodology and three software packages. The proposed methodology is implemented based on the open-source software OpenMVG (version 1.6) [
44]. The LoFTR model from the Kornia library (version 0.7.0) [
45] was used for matching the terrestrial and aerial images. The LoFTR model was pre-trained on the MegaDepth dataset [
46]. The proposed algorithms were mainly implemented in the C++ programming language. And scripts for data preprocessing and LoFTR-based image matching were implemented in the Python programming language. All of the experiments were performed on a Dell Precision 7530 mobile workstation. The workstation is equipped with a Windows 10 Professional operating system, an Intel i9-8950HK CPU (6 cores, 2.9 GHz), an NVIDIA Quadro P3200 GPU, and 32 GB memory.
4.1. Specifications of Datasets
The datasets used for the experiments were downloaded from the website provided by the research team from Southwest Jiaotong University (SWTJU), China [
9]. The Center and Zeche datasets were initially provided by the International Society for Photogrammetry and Remote Sensing (ISPRS) and the European SDR (EuroSDR) and acquired with the ISPRS scientific initiative in 2014 and 2015. The datasets were collected around two buildings in Dortmund, Germany. The SWJTU-LIB, SWJTU-BLD, and SWJTU-RES datasets were acquired and provided by the SWTJU research team. The SWJTU-LIB, SWJTU-BLD, and SWJTU-RES datasets were acquired around a library building, a research building, and a residential building, respectively. Specifications of the datasets are listed in
Table 1.
Each of the datasets is composed of several hundreds of aerial and terrestrial images. The aerial and terrestrial images of the ISPRS datasets were acquired using the same camera. The aerial and terrestrial images of the SWTJU datasets were acquired using two cameras. The image resolution of most cameras is 4000 by 6000 pixels. Image positioning observations, including latitude, longitude, and altitude defined under the World Geodetic System 1984 (WGS 1984), are provided in the ISPRS datasets as EXIF tags. Positioning observations under the WGS 1984, as well as the UTM (Universal Transverse Mercator) coordinate system, are provided in the SWTJU datasets. In this study, the East–North–Up (ENU) coordinate system is used as the object coordinate system for processing the ISPRS datasets. The UTM coordinate system is used as the object coordinate system for processing the SWTJU datasets. Sample images of the datasets are shown in
Figure 5. The left column shows aerial images of five scenes, and the right column shows terrestrial images of the scenes.
4.2. Terrestrial and Aerial Sparse Reconstructions
For each dataset, a terrestrial sparse model and an aerial sparse model were reconstructed from the terrestrial and aerial images, respectively. To match images efficiently, each image was matched against its ten nearest neighbors. For the GNSS-aided incremental SfM, the accuracy of the aerial GNSS observations was set to 0.1 m for all datasets. The accuracy of the positioning observations of the terrestrial images of the Center, Zeche, SWJTU-LIB, SWJTU-BLD, and SWJTU-RES datasets were set to 10 m, 10 m, 0.1 m, 0.1 m and 0.1 m, respectively. And the accuracy of image observations was set to 1 pixel. The statistics of the sparse reconstructions are listed in
Table 2. All terrestrial and aerial images were registered during the sparse reconstructions for each dataset. Hundreds of thousands of 3D points were reconstructed from each dataset. The column of root-mean-squared error (RMSE) shows that all of the sparse reconstructions achieved subpixel accuracy. The experimental result shows high robustness and accuracy of the sparse reconstructions.
4.3. Cross-Platform Image Matching
The parameters for match pair selection between the terrestrial and aerial images were set as follows. The angle threshold
VT between the normal vector of a point and a virtual observation vector was set to 40 degrees. The threshold
NT for the validation of a terrestrial–aerial match pair was set to 300. The statistics of the selected terrestrial–aerial match pairs are listed in
Table 3. The table shows that hundreds to thousands of match pairs were selected for the datasets. The maximum number of match pairs of a dataset shows the highest number of aerial images that overlap with a terrestrial image. Most of the minimum numbers are zero. It indicates that there exists at least one terrestrial image which does not overlap with any aerial images. The average number of match pairs correlates with the overlap ratio between the terrestrial and aerial images of a dataset. The standard deviation (STD) shows the variation of selected match pairs within a dataset.
Each terrestrial image was matched against the aerial images that formed a match pair with it. And the terrestrial image was also matched against two neighboring terrestrial images. It should be noted that all images were subsampled to 600 by 900 pixels to make the cross-platform image matching process efficient. Then, the positions of the established tie points in the subsampled images were transferred back to the original images.
Figure 6 shows matching results of the terrestrial and aerial images shown in
Figure 5. For each dataset, the figure in the left column shows the geometrically verified tie points in two images that are connected using green lines. The figures in the right column are enlargements of the red rectangles in the left figure. The green circles in the figures on the right show the tie points, and the blue points in the figures on the left are outliers detected in initial matches. It can be seen from
Figure 6 that most of the matches are visually correct, which demonstrates the robustness of the LoFTR model. It can also be seen that much fewer tie points were established between the cross-platform images of the SWJTU datasets than the ISPRS datasets, which indicates larger viewpoint and scale variations of the SWJTU datasets.
The statistics of terrestrial–aerial tie points are listed in
Table 4. As can be seen from the table, tens to hundreds of tie points, on average, were established between a terrestrial–aerial image pair for the datasets. The average number of tie points of the SWJTU datasets is much lower than the ISPRS datasets, which indicates the difficulty of the SWJTU datasets. The standard deviation shows the variation of established tie points among the selected match pairs within a dataset.
4.4. Triangulation of Tie Points and Sparse Model Fusion
The cross-platform tie points were triangulated based on the terrestrial and aerial sparse models, respectively. The number of triangulated points and 3D correspondences for each dataset is listed in
Table 5. It can be seen that hundreds to tens of thousands of 3D points were triangulated for the datasets. The number of triangulated points in the ISPRS datasets is much higher than in the SWJTU datasets, as many more tie points are established in the ISPRS datasets. It is also found that the number of triangulated points from the aerial sparse model is higher than that from the terrestrial sparse model for each dataset. This is mainly because a terrestrial image was matched against 4.97 to 15.40 aerial images on average (cf.
Table 3). And the same terrestrial image was matched against only two neighboring terrestrial images. It can be seen from the last column that hundreds of 3D correspondences were established in the SWJTU datasets. More 3D correspondences were obtained in the ISPRS datasets as more tie points were established.
Figure 7 shows the triangulated points overlaid on respective sparse models. The triangulated points are marked with green points. The left column shows the 3D points triangulated from the terrestrial sparse model of each dataset, and the aerial triangulated points are shown on the right. It can be seen that the triangulated points are mainly located on the facade of the buildings. The terrestrial triangulated points correspond well to the aerial triangulated points for each dataset. The figures also demonstrate the correctness of the terrestrial–aerial image matching results.
Before merging the reconstructed terrestrial and aerial sparse models, false correspondences were detected based on the proposed outlier detection method. Statistics of positional differences between the correspondences are listed in
Table 6. The maximum and minimum values show the range of the positional difference along three axes. The
Mean values of a dataset reflect the positional biases between the terrestrial and aerial sparse models of the dataset. The
STD values of a dataset reflect the spatial proximity of the terrestrial and aerial sparse models in general. It can be seen from the table that the
Mean values of the ISPRS datasets are larger than those of the SWJTU datasets, which indicates that there are larger positional biases between the terrestrial and aerial sparse models of the ISPRS datasets.
To detect outliers in the 3D correspondences, the threshold
was set to 3. The number of outliers detected in the 3D correspondences of the Center, Zeche, SWJTU-LIB, SWJTU-BLD, and SWJTU-RES datasets is 38, 16, 4, 21, and 5, respectively.
Figure 8 shows tie point observations of an outlier detected in 3D correspondences of the SWJTU-RES dataset.
Figure 8a,b show the tie point observations in two terrestrial images.
Figure 8c,d show the tie point observations in two aerial images. The tie point observations are labeled with red circles in the terrestrial images and red plus signs in the aerial images. The figures show that the outlier escaped the outlier detection during both pairwise image matching and robust triangulation. This type of outlier will reduce the accuracy of integrated reconstruction and even corrupt the terrestrial–aerial model fusion process.
Figure 9 illustrates distributions of positional differences of the SWJTU-RES dataset after the removal of the outliers. The red curves show the normal distributions fitted using the inlier samples. It can be seen that the data fit the model well along each axis, which justifies the assumptions of the proposed outlier detection method.
After the removal of the outliers from the correspondences, the terrestrial and aerial sparse models were merged and optimized for each dataset. The optimized sparse models were exported to Metashape for further dense reconstruction and texture mapping to generate textured models.
Figure 10 shows the reconstructed sparse models and textured models. The reconstructed sparse models are shown in the left column. The textured models reconstructed based on the integrated sparse models are shown in the middle column. The right column shows textured models reconstructed using only aerial images. It can be seen that the terrestrial and aerial sparse models were merged well for all the datasets. The sparse models are visually correct, and no observable distortion is found in the reconstructed scenes.
The textured models reconstructed using both terrestrial and aerial images show more details of buildings compared to the textured models reconstructed using only aerial images. The structures of the building facade are more complete and accurate, as terrestrial images provide observations of the structures that are heavily occluded in aerial images. The experimental results demonstrate the effectiveness of the proposed methodology.
4.5. Comparison of Pipelines
The proposed pipeline was compared with a state-of-the-art methodology [
9] and software packages, including Metashape, COLMAP [
47] and OpenMVG [
44]. The configurations of the software packages are listed in
Table 7.
Metashape is a commercial software package widely used by the research community and the industry for the photogrammetric processing of aerial images. In this study, match pairs are selected based on both position and visual similarity of images. The accuracy of the positioning observations of the images of the Center, Zeche, SWJTU-LIB, SWJTU-BLD, and SWJTU-RES datasets was set to 10 m, 10 m, 0.1 m, 0.1 m, and 0.1 m. Metashape leverages a hierarchical SfM for sparse reconstruction. The highest accuracy and the adaptive camera model fitting were set for the SfM reconstruction. The other parameters were set to default values.
COLMAP is an open-source software package widely used by the research community for image-based 3D reconstruction. In this study, match pairs were selected using a vocabulary tree, and the vocabulary tree file with 32K visual words was downloaded from the official website of COLMAP. SIFT is used for feature point extraction, and all SIFT feature points extracted from raw images were used for pairwise image matching. An incremental SfM strategy was used for sparse reconstruction. The other parameters were set to default values.
OpenMVG is an open-source software package widely used by the research community for image-based sparse reconstruction. In this study, an exhaustive strategy was used for match pair selection. RootSIFT is used for feature point extraction, and all extracted feature points were used for pairwise image matching. An incremental SfM reconstruction was used for sparse reconstruction. The other parameters were set to default values.
The terrestrial–aerial integrated reconstruction results are listed in
Table 8. It can be seen that the proposed pipeline registered all images for all datasets. Zhu et al.’s method failed to register some aerial images of the Zeche and SWJTU-RES datasets [
9]. COLMAP achieved a complete reconstruction of the SWJTU-LIB dataset. However, the estimated positions of all terrestrial images are below the ground points reconstructed from the aerial images. Therefore, the terrestrial images were considered unregistered for this dataset. COLMAP also failed to register the terrestrial images of the SWJTU-BLD dataset. OpenMVG was unable to reconstruct a visually correct model for the SWJTU-BLD and SWJTU-RES datasets. Metashape failed to register aerial images of the SWJTU-RES dataset. The experimental results demonstrate the robustness of the proposed pipeline.
The reported accuracy values show that the proposed pipeline consistently achieved the highest accuracy on all the datasets. OpenMVG also obtained high accuracy on the Center, Zeche, and SWJTU-LIB datasets. The accuracy achieved by COLMAP is a little lower than OpenMVG. Although Metashape exhibited high robustness, it achieved relatively low SfM accuracy on the datasets. No reprojection errors of SfM reconstructions were reported in [
9].
The proposed pipeline reconstructed more 3D points than COLMAP and Metashape. It reconstructed more than one million 3D points for the ISPRS datasets and hundreds of thousands of 3D points for the SWJTU datasets. OpenMVG reconstructed more 3D points than the proposed pipeline, as the exhaustive strategy was used by OpenMVG for match pair selection.
4.6. Ablation Studies
To demonstrate the effectiveness of the proposed image matching and outlier detection methods, the following two experiments were conducted. In the first experiment, the proposed match pair selection was used to generate the match pairs. The RootSIFT algorithm was used for matching all the match pairs. Then, outliers in the initial matches were removed using the epipolar constraint with the RANSAC framework. And a GNSS-aided incremental SfM reconstruction was conducted based on the tie points to build a sparse model for each dataset. In the second experiment, match pair selection and pairwise image matching were conducted using the proposed methods. A sparse model was reconstructed for each dataset using a GNSS-aided incremental SfM. The proposed outlier detection method for filtering 3D correspondences was not used in either of the experiments. The experiments were implemented based on the OpenMVG framework. The experimental results are listed in
Table 9.
The results of the first experiment on the Center, Zeche, and SWJTU-LIB datasets are almost the same as those by OpenMVG from
Table 8. It means that the proposed match pair selection method has little influence on the integrated reconstruction of relatively simple datasets. However, the first experiment achieved better reconstruction on the SWJTU-BLD dataset than OpenMVG, which indicates that precisely selected match pairs can improve the robustness of integrated reconstruction on complex datasets. However, the first experiment still failed to register any terrestrial images of the SWJTU-BLD dataset and could not reconstruct the SWJTU-RES dataset. In comparison, the second experiment achieved complete and accurate reconstruction on the SWJTU-RES dataset, which demonstrates that the proposed image matching method can improve the robustness of integrated reconstruction on complex datasets. The second experiment also achieved comparable results on the ISPRS datasets. However, the reconstruction failed on the SWJTU-LIB dataset in the same way as COLMAP. The estimated positions of all terrestrial images are below the ground points reconstructed from the aerial images. And the reconstruction also failed on the SWJTU-RES dataset. It demonstrates that outliers in the LoFTR tie points corrupt the incremental SfM reconstruction process. It can be seen from
Table 8 that a complete reconstruction of the SWJTU-RES dataset was achieved based on the proposed outlier detection method, which demonstrates that the proposed outlier detection method can further improve the robustness of integrated reconstruction on complex datasets. By comparing the accuracy achieved by the proposed pipeline from
Table 8 and the accuracy achieved by the second experiment from
Table 9, it can be seen that the proposed pipeline achieved higher reconstruction accuracy on all the datasets. It demonstrates that the proposed outlier detection method can improve the accuracy of integrated reconstruction.
The experimental results are visualized in
Figure 11. The left column shows the sparse models reconstructed in the first experiment, and the sparse models reconstructed in the second experiment are shown on the right. It can be seen from the figures that both experiments achieved geometrically consistent reconstructions on the ISPRS datasets.
Figure 11e shows that the sparse models reconstructed from the SWJTU-RES dataset are distorted, and the terrestrial images are misaligned with the aerial images.
Figure 10 and
Figure 11 together demonstrate the robustness and accuracy of the proposed pipeline for integrated reconstruction.
5. Discussion
- (1)
Robustness
There are three factors that affect the robustness of the proposed pipeline. First, the reconstruction of terrestrial and aerial sparse models forms the basis of an integrated reconstruction. The experimental results demonstrate that high-quality terrestrial and aerial sparse models can be obtained using RootSIFT-based image matching and incremental SfM.
Second, the robustness of integrated reconstruction is affected by the precision of match pairs. LoFTR is known to generate tie points even between non-overlapping images. In this case, outliers in the tie points will probably affect the robustness of sparse model fusion and integrated reconstruction. Match pair selection of the proposed methodology is affected by normal vector approximation. The approximation of a normal vector N is based on the assumption that the observation vectors are uniformly distributed in space. Ideally, normal vectors should be estimated using a dense cloud. However, the proposed method avoids using a dense cloud, as dense matching makes the pipeline more time-consuming. Although the approximation may be biased, it still can be used for cross-platform match pair selection by relaxing the angle constraint VT. In addition, the threshold NT for the validation of a match pair also influences the match pair selection. When the point density of a terrestrial point cloud is low, this threshold should be lowered to increase the number of match pairs. The values of VT and NT are set empirically in this study. The quantitative analysis of the influence of the threshold values on the final results is beyond the scope of the manuscript, and it will be investigated in future work.
Third, the quality of 3D correspondences affects the robustness of integrated reconstruction. As shown by the experimental results, the epipolar constraint cannot remove all mismatches. The remaining outliers will affect the robustness of model fusion. The proposed methodology removes outliers in 3D correspondences, which improves the robustness of integrated reconstruction.
- (2)
Accuracy
The accuracy of integrated reconstruction is affected by the accuracy of GNSS observations and the quality of tie points and 3D correspondences. First, it is found during the experiments that the elevation accuracy of the terrestrial GNSS observations of the Zeche dataset is low, and therefore, low weights are given to these observations during sparse reconstruction and global optimization. In comparison, the positioning accuracy of the aerial images is generally higher as airborne GNSS observations are not disturbed by ground object occlusion and the multipath effect. Therefore, higher weights are given to aerial GNSS observations during sparse reconstruction and global optimization. Similarly, the proposed methodology merges the terrestrial sparse model with the aerial sparse model rather than the other way around in consideration of the higher accuracy of the aerial GNSS-aided SfM reconstruction. The experimental results demonstrate that the proposed pipeline works as expected with the accuracy of GNSS observations set properly.
Second, the quality of tie points and 3D correspondences also influence the accuracy of the integrated reconstruction. Tie point observations of the proposed methodology are generated by image matching using SIFT and LoFTR. As mentioned above, high-quality tie points can be obtained using SIFT-based image matching during terrestrial and aerial sparse reconstruction. Although mismatches remain in cross-platform tie points, the proposed outlier detection method removes outliers in 3D correspondences to mitigate the influence of the remaining mismatches, which improves the accuracy of the integrated reconstruction.
- (3)
Efficiency
The proposed pipeline is fully automatic. No human interventions or intermediate processing like cross-view rendering are required, which makes it more streamlined for integrated 3D reconstruction using multi-source images. The current bottleneck of the proposed pipeline is cross-platform image matching due to the low efficiency of the LoFTR implementation, which is much lower than the CPU-parallel RootSIFT implementation from OpenMVG.