1. Introduction
The 3D point cloud reconstruction and mesh modeling of UAV large-scale scenes are key technologies in various domains such as environmental quality monitoring, geographic resource surveys, and urban planning [
1]. Typically, they capture a series of images to reconstruct the 3D geometric information of scanned scenes. Consequently, a significant challenge in modern large-scale scene modeling lies in efficiently utilizing the multi-view images acquired from UAV platforms to reconstruct three-dimensional models of large-scale scenes.
However, prevailing methods [
2,
3,
4] for modeling extensive 3D maps from UAV aerial images often encounter challenges such as incomplete scene modeling and blurred surface geometry and textures. These issues are frequently attributed to limitations in existing visual 3D reconstruction methods and suboptimal UAV aerial image acquisition techniques.
Current mainstream methods for large-scale multi-view 3D reconstruction, such as structure from motion (SfM) [
5,
6,
7,
8], Multi-view stereo (MVS) [
9,
10,
11], and simultaneous localization and mapping (SLAM) [
12,
13], have been successfully integrated into various commercial software, producing satisfactory high-quality scene models. However, challenges arise when applying these methods to UAV video data, particularly in complex large-scale geographic environments, such as areas with diverse terrain like plateaus, plains, hills, and mountains. Various factors contribute to these challenges, including numerous blind spots in the field of view, unstable signal propagation, and thin air at high altitudes, which collectively limit the operational radii of UAVs. These limitations can impact the redundancy and effectiveness of UAVs’ multi-view image collection. Even in scenarios where elements in the scene, such as buildings, may appear in multiple frames, certain aspects of the target or scene may be overlooked due to geographical interference or insufficient coverage in some images. This directly affects the accuracy of camera pose estimation and loop closure detection, resulting in reduced completeness and accuracy in subsequent 3D map modeling and leading to deformations in the modeling process.
On the other hand, large-scale aerial surveys are impacted by climatic conditions and the material properties of target surfaces. The complexities of large-scale environments, with variations in lighting at different times, contribute to inconsistent data collection. Many previous methods rely on multi-view stereo (MVS) [
14,
15,
16] for scene reconstruction, predicting dense depth maps for each frame and fusing them to create a comprehensive model. However, depth-based methods, while adept at accurately reconstructing local geometric shapes, face challenges in enhancing the accuracy of depth information due to inconsistencies between different viewpoints. Additionally, in complex scenes, numerous textureless, repetitive, or reflective objects (e.g., fields, water surfaces, glass walls, etc.) can directly lead to depth estimation errors or missing data, posing difficulties in effective fusion of these depth maps.
Finally, existing reconstruction methods [
17,
18,
19,
20,
21,
22,
23,
24], frequently omit images unsuitable for geometric modeling to mitigate the influence of redundant frames and images with excessively wide or narrow baselines on reconstruction accuracy. However, this curation process may lead to the exclusion of images containing vital texture information, thereby diminishing the quality of subsequent modeling. In scenarios with relatively sparse imagery data of building surfaces, particularly in areas lacking texture, this selective approach can substantially affect the accuracy of measurements (depth estimation), lighting consistency, the reconstruction of dense point clouds, and the realism of geometric modeling. Ultimately, for areas posing reconstruction challenges or exhibiting modeling gaps, manual filling and repair by professionals may be essential to establish a complete and detailed geometric model structure.
In summary, it is evident that existing professional large-scale indoor modeling [
25,
26,
27] requires extensive division, block-level computation, data merging, heightened computational complexity, and reduced fault tolerance. The entire workflow process encounters numerous challenges, emphasizing the necessity for operators to possess extensive experience in both indoor and outdoor operations to adeptly handle various situations that may arise. For the majority of non-professionally captured UAV imagery, the accuracy of utilizing existing reconstruction methods is significantly compromised and may even lead to failure. The entire data collection and reconstruction process is tailored for professional operators, rendering it complex, time-consuming, and costly. Consequently, extending building scene modeling with UAVs to ordinary users and achieving consumer-level applications proves challenging.
Recently, the advent of NeRF technology [
28] has facilitated the modeling of non-professional UAV-captured data [
29]. Through the calibration of UAV image parameters, precise implicit model reconstruction is attainable, enabling the reconstructed implicit model to generate high-definition images from specified viewpoints. Addressing unbounded scenes, NeRF++ [
30] partitions the space into foreground and background regions, employing separate MLP models for each region. These models independently conduct ray rendering before their final combination. Furthermore, there are implicit modeling methods specifically designed for large scenes, such as CityNeRF [
31], which utilizes multiscale data inputs, including LiDAR, to achieve large-scale modeling. In contrast, BlockNeRF [
32] and Mega-NeRF [
33] spatially decompose the scene, aligning with our approach in the implicit modeling stage. However, the spatial decomposition algorithms in these methods are relatively straightforward, resulting in mediocre performance.
In this paper, our objective is to develop a more flexible and straightforward large-scale UAV scene modeling method that strikes a balance between high accuracy and efficiency. This method harnesses the mutually beneficial aspects of implicit and explicit models at different stages of modeling, employing a novel neural radiation field to synthesize images with spatiotemporal coherence between frames. Subsequently, it engages in dense depth estimation unit-by-unit using co-visible clusters and integrates the implicit scene model to recover missing spatial information, resulting in a more comprehensive and high-quality dense point cloud. In the modeling phase, it reverses the neural radiation field to recover keyframe poses that were initially omitted during reconstruction initialization. Furthermore, it integrates more texture images into the signed distance field (SDF) [
34] to generate a more realistic and high-quality scene mesh model. The comparative results can be observed in
Figure 1. The innovations in this paper are as follows:
(1) A novel implicit–explicit coupling framework is designed for improving UAV large-scale scene modeling performance. The dense reconstruction module (explicit) supervises the neural radiation field (implicit) for the precise implicit modeling of large-scale UAV scenes. Expanding on this foundation, the neural radiation field employs neural rendering to synthesize high-precision scene images. This compensates for occlusion and depth missing data caused by obstacles and environmental noise during the dense reconstruction process, thereby further enhancing the accuracy of the dense point cloud reconstruction. Ultimately, in the mesh modeling stage, the texture-rich images overlooked in the dense reconstruction are recovered by reversing the neural radiation field, thereby improving the details of the mesh modeling. The experiments show that our method achieves SOTA performance compared with the related mainstream methods and commercial software.
(2) An implicit synthetic co-visibility-cluster-guided dense point cloud reconstruction is proposed. We integrated the rendering of images from new perspectives, utilizing a neural radiance field implicit model, into the construction of traditional co-visibility clusters. This method addresses the challenge of viewpoint occlusion and low overlap resulting from flight path and environmental factors. In contrast to traditional co-visibility clusters, implicitly synthesized co-visibility clusters leverage the capabilities of the NeRF model to implicitly render new perspective images, compensating for viewpoint gaps induced by flight trajectories and environmental conditions. The NeRF model learns the depth and illumination information of a 3D scene from images, providing implicit synthesis co-visibility clusters with robustness against variations in lighting, occlusions, and low-texture regions. Consequently, this robustness facilitates guiding the dense point cloud reconstruction module to achieve more robust reconstructions.
(3) An implicit texture image-pose-recovery-based high-accuracy mesh modeling is proposed. This method innovatively restores image poses that were not considered in the reconstruction process by reversing the neural radiance field. This decouples the mapping of image textures from the constraints of the 3D scene reconstruction. Through the implicit model, we accurately recover image poses that were excluded during the reconstruction phase, thereby incorporating more texture information into the mesh model. Specifically, the mesh model takes dense point clouds as input and undergoes surface mesh computation through Poisson reconstruction. Additionally, local optimization of the mesh is performed by integrating illumination models, achieving a more realistic and finely detailed 3D mesh modeling of large-scale UAV scenes.
2. Implicit–Explicit Coupling-Enhancement-Based 3D Reconstruction
This article presents an implicit–explicit coupling framework for UAV large-scale scene modeling. This framework eliminates the necessity for professional UAV trajectory planning and data preprocessing, enabling high-quality large-scale 3D reconstruction with non-professional aerial photography. The resultant model surfaces also exhibit finer textures. The overall framework is illustrated in
Figure 2.
Specifically, the proposed framework addresses the challenge of processing non-trajectory-planned and non-preprocessed UAV cityscape aerial images. The framework is meticulously structured into two distinct stages, each contributing to the overall efficacy of the reconstruction process.
In the initial stage, a cutting-edge multi-view 3D reconstruction method is strategically employed. This method serves the crucial purpose of rapidly estimating intricate camera parameters and depth information within tightly knit clusters of co-visible images. In this way, the framework gains access to highly accurate poses and depth information, which play a pivotal role in supervising the neural radiation field. This supervision is paramount for establishing the implicit representation of the large-scale scene, ensuring the faithful reconstruction of the intricate details within the cityscape.
Furthermore, to enhance the completeness and accuracy of the reconstructed scene, a sophisticated neural radiance field (NeRF) model is brought into play. This well-trained NeRF model is specifically tailored to the characteristics of the large scene under consideration. Leveraging this model, the framework synthesizes images and corresponding dense depth data for regions where information may be missing or overlap is limited due to unplanned flight paths. This integration of the NeRF model not only addresses potential data gaps but also significantly contributes to the overall robustness of the reconstruction process.
In summary, the framework’s comprehensive approach begins with the rapid acquisition of camera parameters and depth information, followed by the meticulous supervision of the neural radiation field for implicit scene representation. The incorporation of a specialized NeRF model further enhances the reconstruction accuracy by synthesizing data for areas with limited information, ultimately culminating in an explicit high-accuracy mesh model of the UAV cityscape aerial images.
Subsequently, the synthesized data from the NeRF model is utilized to optimize the co-visible relationships within the multi-view 3D dense reconstruction. This optimization specifically addresses feature mismatch problems arising from low overlap, occlusion, and environmental noise caused by unplanned flight paths and pose planning. Concurrently, depth information is employed to rectify point cloud data in textureless or repetitive texture areas, generating a dense, high-precision point cloud model. Finally, the image and depth information are fused into a signed distance field (SDF), and the reversal of the neural radiation field is applied to recover the excluded near-range image poses. This precision texturing of building exteriors results in a high-quality 3D model of large-scale scenes. Each key step of the algorithm will be elaborated upon in the subsequent sections.
2.1. Implicit Synthetic Co-Visibility-Cluster-Guided Dense 3D Reconstruction
In this section, we will provide a detailed explanation of the UAV-based large-scale scene dense point cloud reconstruction module, that is guided by implicit synthetic co-visibility clusters. The module initiates by utilizing a generic visual odometry method to rapidly and accurately compute the pose and depth information from input images, which is used for the supervised training of the implicit model. Subsequently, the trained implicit model is employed to render synthesized new perspectives along with their corresponding dense depth data. This process optimizes spatiotemporal co-visible clusters for dense point cloud reconstruction. Finally, robust and accurate reconstruction results were obtained through the explicit model for dense point cloud reconstruction.
2.1.1. Implicit Representation of UAV Large-Scale Scenes
In this work, we commence with an input sequence of UAV-captured frames for a large-scale scene, denoted as
, without prior trajectory planning or preprocessing. Our approach initially employs the visual odometry provided by a dense reconstruction module [
35] to generate accurate depth and pose information. Similar to the RAFT [
36] framework, we iteratively and globally optimize camera pose and depth information with finer updates to the optical flow field. This iterative process is particularly well suited for addressing issues such as drift in long trajectories and loop closures that commonly arise in large-scale UAV scene reconstruction. Upon determining the 3D point positions (D) and camera parameters (T) that minimize the reprojection error, this data is employed as supervised training data for the subsequent training of the large-scale implicit model for UAV-based scene reconstruction. This approach effectively leverages the accurate depth and pose information obtained from visual odometry to generate supervised data for training the implicit model. This step is crucial for achieving high-quality large-scale scene reconstructions using UAVs without the need for prior trajectory planning and preprocessing.
In contrast to conventional large-scale implicit modeling, this work primarily focuses on non-professional data acquisition with UAVs. Unplanned flights can introduce more variation in pitch and roll angles, resulting in irregular alignment of non-professional UAV images. This irregularity is mainly evident in significant variations in heading and cross-track overlap, leading to the possibility of some parts of the scene being missed in the images, sparse image overlap, and significant changes in yaw angles between adjacent image frames. However, utilizing a single multi-layer perceptron (MLP) in traditional neural radiance field (NeRF) models has limitations in terms of capacity. If a specific area in the scene cannot capture a substantial number of images and complex scene details, it can lead to underfitting in the neural radiance field, resulting in significant blurring of synthesized new perspective images. This can be detrimental for the subsequent training of implicit models for large-scale scenes and can negatively impact modeling accuracy.
Based on the characteristics of UAV-captured data, it is observed that the images in the sequence are arranged in chronological order, representing the progression of time. Consequently, adjacent images along the time axis are likely to exhibit high spatial similarity. However, within short time intervals, the number of images may be insufficient to supervise the construction of implicit models. Regardless of whether the aerial photography is professional or non-professional, the nature of UAV flights often involves multiple repeated scans and loops in a given region. This implies that images taken at different times might depict the same scene. Recognizing this pattern, the approach adopted in this work is to divide the entire large-scale scene into regions based on the UAV’s trajectory. For each region, a spatiotemporal consistency co-visible cluster is constructed, ensuring that each region of the large scene has a sufficient number of images with high overlap. This method provides each region of the large scene with good training data for implicit modeling.
Specifically, the approach commences with the first frame
of the aerial image sequence as the starting point. It then utilizes the LoFTR [
37] method, which can perform keypoint detection for full-image coverage and is effective in handling low-texture or repetitive pattern areas, to conduct similarity matching with subsequent images. For similarity calculation, any image frame with a similarity exceeding a predefined threshold
(typically around 20%) is considered as forming a spatial co-visible relationship. The exact method for calculating similarity
is as follows:
where
represents the number of LoFTR features in image
, and
represents the number of matched feature pairs.
Due to the decreasing similarity between and nearby images , the last image with a similarity greater than or equal to is selected as the central scene for the co-visible cluster. Using as the reference, subsequent images are compared for scene overlap. Images in the sequence are sampled based on their temporal relationship with , and those with a similarity exceeding are added to the co-visible cluster sequence . This completes the selection of the first co-visible cluster. Similarly, we need to find the last image among the frames before and after , where the regional overlap is greater than , and this image becomes the central scene for the second co-visible cluster. This process is repeated until the last co-visible center in the vicinity still has a similarity greater than when compared to the end image of the image sequence. At this point, all co-visible clusters within the existing image sequence have been identified, and the subsequent neural radiance field training phase begins.
The method for establishing an individual neural radiance field is as follows: First, obtain the depth map
and camera parameters
for each frame based on Droid SLAM. Next, use an autoencoder structure to encode the input image into a latent vector and decode this latent vector into parameters for a multi-plane NeRF. These NeRF parameters include color and density for each plane. Then, employ a renderer to generate the image for each frame based on the parameters of the multi-plane NeRF and the camera pose. The renderer uses the same volumetric rendering formula as Instant NGP [
38], taking into account the color and density of each plane and the distance between the camera and the planes to compute the color of each pixel. Finally, minimize the reprojection loss function to minimize the L1 norm between the rendered image and the input image. The reprojection loss function is as follows:
where
is the input image, and
is the rendered image.
To generate more accurate modeling results, the neural implicit representation not only requires providing color images from new viewpoints but also needs corresponding dense depth maps to eliminate the impact of holes in textureless surface models caused by MVS methods. However, the output dimensions of the neural radiance field model do not directly contain depth information, but only include scene opacity information, which is related to the three-dimensional position of the described point and independent of the selected viewpoint. Clearly, we cannot simply use opacity information as depth information for calculation. Therefore, it is necessary to derive the correspondence between density predictions from the neural radiance field and depth. First, let us assume a ray originates from a point
x in space. To obtain depth information from a specific viewpoint
to the point
x, we need to sample along the direction of the ray. Let
be a sampling point along the ray, and each point has its corresponding density value
, which represents the probability of the point emitting or reflecting rays. To calculate depth, we need to determine the position where the ray terminates when it encounters an opaque object. In the neural radiance field, when a ray encounters an opaque object, it is completely absorbed or reflected and cannot continue to propagate. At this point, the final position the ray reaches is the depth. We use
to represent the distance from
x to the termination position. Therefore, the depth at point
x when the ray arrives is the weighted sum of the distances between all sampling points, where the probability of each sampling point as the termination position is equal to the probability of emitting or scattering rays at that point multiplied by the probability of not being absorbed or reflected from
x to that point. The formula is as follows:
where
is the distance between the
th sampling point and the
ith sampling point.
So far, we have described the large-scale NeRF modeling method for the original images. Next, we need to supplement the high-overlap scene co-visibility clusters based on the modeling results of the neural implicit model.
2.1.2. Multi-View Implicit Synthetic Co-Visibility Cluster Generation
We now have the foundation for synthesizing new-view color images and depth images. However, we need to determine which viewpoints to output as supplements. As previously mentioned, images with overlaps greater than
(usually set at 80% as an empirical value) are considered dense. If every image within a co-visibility cluster has overlaps with the surrounding images greater than
, we can consider the entire cluster to be sufficiently dense and capable of providing good modeling results. To achieve this, we establish a mathematical model. First, we select an image frame
from the co-visibility cluster and identify the set of images
within the same cluster that share a co-visibility relationship with it. We find images in the set where the feature overlap with the current image is less than
. In the previous assumption, image captures were all performed on the same plane, and the generation of viewpoints only required determining the plane coordinates. In the plane coordinates, when the center-to-center distance between two rectangles is less than a certain value, the overlap is guaranteed to be greater than
. Through simple set deduction, we can derive that when the center-to-center distance
where
W and
H are the width and height of the image, respectively, the overlap between each image is greater than
.
Therefore, when generating the plane coordinates of new viewpoints, we only need to calculate the equation of the line connecting image frame and the images in the co-visibility set , as well as the Euclidean distance . We sample new viewpoints along the line at intervals of where the spacing between viewpoints is . We apply this process to generate new viewpoints for every image within the co-visibility cluster that has a shared co-visibility relationship. Combined with height and camera angle information, we input the obtained RGB images and depth maps into the NeRF model. This process produces a highly dense dataset of RGBD aerial viewpoints for the subsequent reconstruction phase, ensuring the availability of data.
2.1.3. Dense Point Cloud Reconstruction of UAV Large-Scale Scenes
In our algorithm, each co-visibility cluster corresponds to a series of multi-view depth images. However, the original depth maps from the input images may have their own missing depth information due to the use of the dense reconstruction [
35] method for prediction. The depth information generated based on NeRF is dense. To address this, we refer to the depth maps of the new viewpoints to further supplement the missing information in the original image’s depth maps. Ultimately, we generate complete depth maps for each corresponding image frame within a co-visibility cluster. Next, we perform depth optimization on all generated depth maps to ensure depth consistency for the purpose of merging information from multi-view depth images. For each pixel in an image, we use its depth information
and camera parameters to project it back into three-dimensional space:
where
p is the pixel’s homogeneous coordinate, and
X is the three-dimensional point in the world coordinate system. Then, we further project
X into the neighboring images
of
. Assuming
is the
kth neighboring image in the set of adjacent depth images
, here we use
as the depth information of the image
, and use
to represent the reprojected depth value, as mentioned above. If
is close enough to
, we can determine that
X is consistent in both
and
, which means
where
is a threshold. If
X is consistent in at least two neighboring images in the set
, it is considered a reliable scene point, and the depth value for the corresponding pixel
p in
is retained. Otherwise, it is removed. Finally, all the depth maps are reprojected into a 3D form and merged into a single complete point cloud.
2.2. Detail-Implicit Texture Mapping Enhancement for High-Accuracy Mesh Modeling
This section focuses on enhancing building model surface texture and modeling based on implicit 3D pose guidance. It aims to address the challenges of achieving high-accuracy modeling and handling common tasks in scene modeling, such as segmentation and texturing.
2.2.1. Implicit Pose-Recovery-Based Texture Refinement
In order to achieve higher modeling accuracy, it is necessary to collect information at multiple heights during aerial data acquisition. In addition to capturing data from scenes at higher altitudes, it is common to include close-range information such as details of buildings. However, traditional reconstruction methods often discard images containing fine details, as they typically lack a sufficient number of co-observable scene images and are, thus, excluded during scene modeling. These images, containing rich texture information, have the potential to significantly improve the accuracy of scene modeling.
To effectively utilize this information, it is essential to remap multi-view images containing potential texture details back into three-dimensional space. Nevertheless, the images filtered out during the reconstruction process have lost their original camera poses, making texture mapping challenging. Inspired by the implicit modeling of the entire UAV scene, drawing inspiration from inerf [
39], we can establish a differential relationship between the camera pose T and the rendered image I through an end-to-end implicit neural rendering method. Given the camera pose
T, the rendered image I can be represented as
, where
represents the learnable parameters of the neural network. Thus, if the scene image for the pose to be estimated is denoted as
, it is necessary to solve for its true pose
. Based on the differential relationship between images and poses established by NeRF, the problem of missing image poses can be addressed by approximating the pose
to approach
using the following equation:
However, usually, after the optimization process mentioned above, precise pose estimation results can be gradually obtained. Nevertheless, due to the lack of depth information supervision, direct utilization of Formula (
7) can easily lead to local minima in complex scenes, making it challenging to achieve optimal results. On the other hand, NeRF relies on pose initialization for iterative fitting, resulting in slow convergence and susceptibility to error accumulation. To address this, we integrate the pose calculation method of NeRF depth estimation and feature point detection. By rapidly obtaining a better initial pose for NeRF, we enhance the accuracy and efficiency of pose recovery. Initially, for scene images
and
, we extract feature points and calculate matched feature point pairs
. For
, NeRF provides its corresponding depth
. Thus, combining
and
, the pose estimation problem transforms into a PnP problem. The EPnP algorithm is employed to rapidly solve for an appropriate initial pose. Finally, based on more accurate pose initialization information, we iteratively fit the image to better approximate
, obtaining a higher-precision camera pose
. The corresponding dense depth map is generated through the neural radiance field. Following the reconstruction method in
Section 2.1.3, the texture details of the image are projected onto the reconstructed scene surface.
2.2.2. UAV Large-Scale Scene Voxel Mesh Modeling
To reconstruct more detailed scene models, it is crucial to consider the impact of lighting on the reconstructed textures and project the rich textures, combined with the lighting model, to appropriate distances. To achieve this, we first use Poisson reconstruction to convert the previously generated point cloud into a mesh model. Poisson reconstruction is an implicit surface-based surface reconstruction method that can recover a smooth surface from an unordered point cloud. For the point cloud obtained from the previous dense reconstruction, we estimate the normal vector for each point. These normal vectors are then used as samples for the gradient field. We solve a Poisson equation to obtain an implicit function, where the zero-level set represents the desired surface. The Poisson equation can be expressed as
where ∇ represents the gradient,
v is a vector field obtained by interpolating the normal vectors of the point cloud, and
f is the unknown implicit function. To solve this equation, we need to discretize space into an octree grid. Then, at each grid node, an unknown variable
is defined to represent the value of the implicit function at that node. We can use finite difference methods to approximate the Poisson equation, resulting in a system of linear equations:
where
L is the Laplacian matrix,
f is a vector composed of all
values at grid nodes, and
b is a vector composed of values of
at grid nodes. We utilize the preconditioned conjugate gradient method as an iterative algorithm to solve this system of linear equations, obtaining an approximate solution for
f. Finally, we use an isosurface extraction algorithm, marching cubes, to extract the zero-level set from
f, which represents the reconstructed surface.
Next, we use a virtual laser scanner to scan the mesh from multiple angles and then employ a kd-tree to find the nearest surface points. The sign of the signed distance function (SDF) is determined based on either normals or a depth buffer. To reconstruct the initial SDF from the mesh model, we simulate a virtual laser scanner that projects the mesh from different directions, producing a series of depth images. Each pixel on these depth images corresponds to a point in space. We use the kd-tree algorithm to find the nearest neighbor point on the mesh for each pixel, i.e., the nearest surface point, and calculate the distance between these two points as the SDF value for that pixel.
To determine the sign of the SDF (whether the point is inside or outside the surface), we employ two methods based on the characteristics of different implicit surfaces. For smooth and continuous surfaces, we use the angle between the point’s normal and the surface normal: if the angle is greater than 90 degrees, the point is considered inside; otherwise, it is considered outside. For complex or non-smooth surfaces, we introduce a depth buffer. If the depth value of a point is greater than the depth value of the corresponding surface point, the point is considered inside; otherwise, it is considered outside. This dynamic selection of methods based on the surface type ensures optimal results. Ultimately, we obtain the SDF values for each point, completing the initial SDF reconstruction.
Finally, we utilize spatially varying spherical harmonic functions [
40] to solve the reflectance of each image frame and the scene illumination. Employing structure-from-shading (SfS) techniques, we perform local geometric optimization for each voxel within the SDF volume to minimize the reprojection errors of image frames and ensure voxel smoothness. We iterate through these steps until convergence or the maximum iteration count is reached. This process results in a high-quality three-dimensional model with intricate geometric details and consistent surface textures.
3. Experiments
To demonstrate the effectiveness of our method, we conducted experiments on five different scenes, including urban, rural, countryside, mountain, and campus scenarios. The test data consisted of multiscale aerial images, covering areas of several square kilometers. During the data collection with UAVs, the UAV model used was DJI Mavic 2. We conducted aerial operations at an altitude of 120 m, capturing photographs at a 30° angle. The final image resolution was 4000*3000. We employed a ground station for UAV positioning to ensure an image overlap rate exceeding 85%. The data included regions with rich textures (buildings), repetitive textures (roads and vegetation), and non-textured areas (water). Additionally, since the collected data were not acquired simultaneously, they had variations in lighting conditions, making precise 3D reconstruction of the entire dataset challenging. We evaluated the dataset from multiple perspectives. In the implicit modeling phase, we compared our method to some advanced techniques (NeRF [
28], TensorRF [
41], Mega-NeRF [
33]). In the explicit modeling phase, we compared our results to those of various existing aerial modeling software (including COLMAP 3.9, Context Capture 2023, PhotoScan 2023, Pix4D 4.5.6).
3.1. Implicit Modeling for Image Generation
We tested the performance of implicit model generation, with the results shown in
Figure 3 and
Table 1. PSNR (peak signal-to-noise ratio), SSIM (structural similarity index), and LPIPS (learned perceptual image patch similarity) are metrics used for assessing image quality. Significant improvements were achieved in both the visual quality and evaluation metrics. The neural radiance fields supervised by depth information outperformed methods relying solely on visual images, exhibiting fewer artifacts and achieving a photo-realistic quality. This indicates that the new viewpoint images generated by our method are sufficiently realistic to be included in the co-visibility clusters to compensate for the missing views required for reconstruction.
3.2. The 3D Point Cloud Reconstruction Accuracy
Our GPU-accelerated offline modeling framework, based on the coupling of implicit–explicit representations for large-scale UAV aerial scenes, strikes a balance between reconstruction accuracy and efficiency in complex aerial environments. It is suitable for rapid presentation and analysis in practical applications. In order to demonstrate the effectiveness of our method based on implicit co-visibility clusters, we conducted ablation experiments by subsampling video frames. We simulated scenarios for non-professional UAV aerial scenes (overlap was only 50%). The results generated using traditional methods such as COLMAP for low-overlap images are often poor.
Figure 4 presents the comparative results of point cloud reconstruction for low-overlap scenes.
Our method achieves competitive quality in sparse and dense reconstruction compared to software like Context Capture 2023, Pix4D 4.5.6, and COLMAP 3.9. To assess the accuracy, we randomly selected several ROIs from the point cloud, search for their nearest points in the reference literature, and computed their mean Euclidean distance (MD) and standard deviation (SD). The results are provided in
Table 2 (and compared to PhotoScan).
To further demonstrate the accuracy and quality of our method in dense point cloud reconstruction of natural scenes, we compare our approach to the state-of-the-art method RTSfM [
42], used for 3D dense point cloud reconstruction. We propose an efficient and high-precision solution for large-baseline high-resolution aerial imagery. Compared to the most advanced 3D reconstruction methods, our system can provide real-time generation of large-scale high-quality dense point cloud models.
Additionally, to illustrate our model’s performance, we use very large aerial images (rural and mountain scenarios) as test data. These datasets have a wide range of capture times, different textures, and occlusions, which pose significant challenges for dense 3D point cloud reconstruction. As shown in
Table 3, the point cloud results generated by our method are on a par with the state-of-the-art RTSfM. This indicates that our reconstruction accuracy is highly competitive.
3.3. The Accuracy of 3D Mesh Modeling
We tested the final mesh modeling results using PhotoScan as the baseline. We randomly sampled the triangular faces of the model and calculated the face centroids as sampling points. Similarly, we evaluated our final mesh modeling results by calculating their MD (mean Euclidean distance) and SD (standard deviation).
Table 4 shows the superiority of our method, which leverages co-visibility constraints for large-scale scene representation, uses neural implicit modeling to compensate for missing views, and finally, accurately reconstructs the point cloud through dense SLAM. After texture enhancement and smoothing, we obtain high-quality mesh models. It is evident that our method achieves good accuracy in most scenes, and it ranks first in terms of stability across all scenarios. Finally, in
Figure 5, we show the final modeling results of our method for all the scenes we considered.
4. Conclusions
Due to challenges such as the trajectories of UAV data acquisition and environmental noise, UAV large-scale scene modeling often yields low model accuracy, incompleteness, and blurry geometric structures and textures. This paper proposes a novel implicit–explicit coupling framework for a high-accuracy UAV large-scale scene 3D modeling framework. It comprises an implicit co-visibility-cluster-guided large-scale scene dense 3D reconstruction and an implicit 3D pose-recovery-based 3D modeling and surface texture enhancement.
Initially, the framework utilizes implicit rendering to synthesize new views, ensuring spatiotemporal consistency in co-visible relationships. Subsequently, within the co-visibility clusters, it employs implicit scene models to restore missing depth information from explicit multi-view pixel matching. Finally, the framework jointly optimizes image and depth information fused into the signed distance function (SDF) and estimates spatially varying lighting for building surface texture images and geometric shapes, resulting in high-quality 3D models of buildings. In comparison to existing methods, our approach achieves significant improvements in accuracy and reconstruction details, demonstrating strong competitiveness against commercial software.
Nevertheless, limitations exist, such as vulnerability to the influence of dynamic objects and restricted generalization to new scenes, necessitating substantial training data for different scenarios. On the other hand, due to the extensive use of implicit modeling in this paper to compensate for perspective and depth information loss caused by occlusion or blurriness, our approach inherits limitations from the neural radiance field (NeRF) method when dealing with large-scale, complex, or dynamic three-dimensional scenes involving UAVs. These limitations include:
(1) Memory and Computational Efficiency Issues: NeRF typically demands significant memory for storing and optimizing neural network parameters. Especially, as the scene size increases, the required parameter count sharply rises. Training time and computational costs significantly increase with the complexity and scale of the scene, impacting the practical applicability of our method in large-scale UAV scene modeling.
(2) Resolution Limitation Issues: The modeling resolution produced by our method is constrained by the number of points sampled by NeRF and the learning capacity of the network. In large scenes, maintaining overall and local accuracy may necessitate higher-resolution sampling, greatly escalating computational complexity. Therefore, the model may struggle to accurately capture intricate geometric and textural details in distant or extensive areas.
(3) Scalability and Updatability Issues in Scene Reconstruction: Original NeRF methods face challenges in achieving incremental updates and expansions for continuously expanding large or dynamically changing scenes. In other words, efficient learning and model updates focusing on newly added or altered parts are difficult. To address this, new algorithmic frameworks need to be designed to support modular, chunk-based, or adaptive updates, better meeting the modeling requirements of large-scale and dynamic environments.
Our current method solely focuses on dense 3D modeling and model surface texture accuracy in large-scale UAV scenarios. However, the resulting 3D scene models lack interactivity, editability, necessary semantic information for users, and generalization across different dynamic and static scenes, making them challenging for direct practical applications. Hence, future enhancements in this work will consider:
(1) Integration of Semantic and Instance Segmentation: Developing a 3D semantic reconstruction framework for large UAV scenes that integrates semantic information, not only for geometric modeling but also for semantic understanding of the scene and object-level decoupling mapping.
(2) Interactivity and Editability: Creating a large-scale reconstruction system with enhanced interactivity and editability, allowing users to modify or update constructed scene models in real time.
(3) Generalization and Open-World Adaptability: Improving the model’s generalization ability to adapt to different environmental and lighting conditions, as well as entirely new and unseen scenes.
In conclusion, the current research outcomes of this work can directly provide high-quality 3D map grid models for various mobile robots (such as autonomous cars and UAVs). The related technologies can also offer technical support for virtual roaming, urban planning, military simulation, forensic investigation, VR gaming, and other applications in virtual reality intelligence, demonstrating significant practical value.