Next Article in Journal
Wind Shear Coefficient Estimation Based on LIDAR Measurements to Improve Power Law Extrapolation Performance
Previous Article in Journal
Reconstructing the Three-Dimensional Thermohaline Structure of Mesoscale Eddies in the South China Sea Using In Situ Measurements and Multi-Sensor Satellites
Previous Article in Special Issue
Leveraging Neural Radiance Fields for Large-Scale 3D Reconstruction from Aerial Imagery
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

UnDER: Unsupervised Dense Point Cloud Extraction Routine for UAV Imagery Using Deep Learning

Department of Earth Observation Science, Faculty of Geo-Information Science and Earth Observation (ITC), University of Twente, 7522NH Enschede, The Netherlands
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(1), 24;
Submission received: 30 September 2024 / Revised: 29 November 2024 / Accepted: 23 December 2024 / Published: 25 December 2024


Extraction of dense 3D geographic information from ultra-high-resolution unmanned aerial vehicle (UAV) imagery unlocks a great number of mapping and monitoring applications. This is facilitated by a step called dense image matching, which tries to find pixels corresponding to the same object within overlapping images captured by the UAV from different locations. Recent developments in deep learning utilize deep convolutional networks to perform this dense pixel correspondence task. A common theme in these developments is to train the network in a supervised setting using available dense 3D reference datasets. However, in this work we propose a novel unsupervised dense point cloud extraction routine for UAV imagery, called UnDER. We propose a novel disparity-shifting procedure to enable the use of a stereo matching network pretrained on an entirely different typology of image data in the disparity-estimation step of UnDER. Unlike previously proposed disparity-shifting techniques for forming cost volumes, the goal of our procedure was to address the domain shift between the images that the network was pretrained on and the UAV images, by using prior information from the UAV image acquisition. We also developed a procedure for occlusion masking based on disparity consistency checking that uses the disparity image space rather than the object space proposed in a standard 3D reconstruction routine for UAV data. Our benchmarking results demonstrated significant improvements in quantitative performance, reducing the mean cloud-to-cloud distance by approximately 1.8 times the ground sampling distance (GSD) compared to other methods.

1. Introduction

Extraction of dense 3D geographic information from ultra-high-resolution unmanned aerial vehicle (UAV) imagery unlocks a great number of mapping and monitoring applications [1,2] with an unprecedented level of detail, scalability, and cost-effectiveness, which would otherwise be too impractical when attempted via conventional surveying methods. Such applications range from mapping coastal [3] and agricultural environments [4], surface excavation sites [5,6], individual tree stems [7], and tree species [8] to monitoring landslide progression [9,10], developments in crop attributes [11,12], and changes in buildings [13,14]. This dense 3D reconstruction task is facilitated by finding pixels corresponding to the same object within overlapping images captured by the UAV from different locations—a step called dense image matching, typically undertaken using a semi-global [15] or a patch-based [16] approach.
These standard approaches in dense image matching generally rely on handcrafted features derived from relatively local contextual information to measure the similarity between corresponding pixels. With the resurgence of artificial neural networks over the past few decades, fueled by largely labeled training datasets and more powerful computational hardware, many vision-related tasks, including those in the remote-sensing domain [17,18,19], have shown that features learned from deep networks are superior to conventional handcrafted features.
A considerable problem with applying a deep learning-based approach to a task such as dense 3D reconstruction from ultra-high-resolution UAV images is that most of these deep networks are trained in a supervised manner, i.e., they require a substantial amount of reference dense pixel-to-pixel correspondences from which to learn the similarity features. Generally, applying a network trained on an image dataset with characteristics entirely different to those of the target dataset, in order to perform the dense image matching, will produce poor results; and annotating reference training data for the dense-image-matching step for every UAV image acquisition would be impractical. Thus, an unsupervised procedure to utilize the rich features from pretrained deep networks for dense image matching is necessary.

1.1. Related Work

A well-known standard dense-image-matching method, available in most open-source and commercial photogrammetric software, is the census-based semi-global matching (SGM) algorithm. It uses a handcrafted feature, called the census transform, to measure the similarity between dense pixel correspondences [20]. Census encodes the local structure of an image by comparing pixel intensities within a neighborhood window, using a fixed operation, i.e., creating a binary code with a series of values encoding 1 if the neighboring pixel is greater than or equal to the central pixel, and 0 otherwise. A matching cost volume is derived by calculating the hamming distances between the census values of all the pixels in a base image and the census values of all the pixels, within a search range, in the overlapping side image. The range of possible pixel correspondences in the side image is reduced to a one-dimensional search range by performing a step called rectification—which warps the base and side images, using their camera orientation parameters, ensuring that the corresponding pixels lie on the same rows of the base and side images. Finally, the dense pixel correspondences are given by the base- and side-image pixel pairs having the lowest cost value along the search range. Optimization to find the lowest cost value is usually conducted via cost aggregation, adding smoothness terms to the previously described cost volume, along different directions in the search space [21].
More recently, learning-based dense image matching has been explored using deep convolutional networks [22,23,24] trained in a supervised manner on largely labeled benchmark datasets such as KITTI [25], Middlebury [26], and Scene Flow [27]. The datasets contain already-rectified image pairs with reference dense pixel correspondence in the form of disparity maps—encoding, for all pixels in the base image, the difference between the column coordinates of the base image pixel and the equivalent pixel in the side image. MC-CNN [22] learns a similarity metric using a convolutional network trained on examples of similar and dissimilar patches. Like the conventional dense-image-matching methods, a matching cost volume is built using the features learned by the convolutional network, and it is consequently optimized via cost aggregation [21]. Another approach is to learn the disparity map directly in an end-to-end manner, using a deep convolutional network such as GC-Net [23] or PSMNet [24].
The use of deep convolutional networks for similarity metrics and end-to-end disparity learning has been further adapted into remote-sensing applications to extract 3D information from very-high-resolution satellite [28,29,30], aerial [28,29,31], and UAV [32] stereo images. A common theme in these previous works has been the (re-)training of a deep network for similarity metrics or end-to-end disparity estimation, in a supervised manner, using available dense 3D reference datasets to improve the reconstruction results. In our previous work [33], we explored the use of a deep stereo matching network [34] that can be fine-tuned in an unsupervised manner for improving the extraction of a dense 3D point cloud from ultra-high-resolution UAV images. The dataset in this previous work, however, lacked dense 3D reference data with which to reliably assess the results, it was greatly affected by occlusion, and it was lacking a model merging strategy to test combined results from multiple image pairs. We have enlarged on this previous work by focusing on addressing these three key points.

1.2. Contributions

As previously outlined, supervised approaches for 3D reconstruction of ultra-high-resolution UAV imagery suffer from two limitations: domain dependency and annotation bottleneck. Networks trained on an entirely different typology of image data often perform poorly when tested on UAV data, due to the significant domain shift, and generating dense 3D reference data for every UAV image acquisition is labor-intensive and impractical for most relevant mapping applications.Therefore, in this work we propose an unsupervised dense point cloud extraction routine for UAV imagery. We summarize the main contribution of this paper as follows:
  • We propose to use a pretrained stereo matching network that can be fine-tuned in an unsupervised manner to perform the disparity estimation step in our routine, demonstrating that there is some performance gain in conducting this unsupervised fine-tuning process.
  • Naively applying the pretrained network produces poor results. Thus, we propose a method called disparity shifting, using prior information from the UAV image-acquisition setup to generalize the use of a network trained on an entirely different typology of image data. Unlike previously proposed disparity-shifting techniques that either shift the disparity maps to produce cost volumes [35] or shift stereoscopic scenes to improve visual comfort [36], our procedure addresses the domain shift between the images that the network was pretrained on and the UAV images by using prior information from the UAV image acquisition.
  • We propose a procedure called the left–right disparity consistency check, to mask out occluded pixels that use the disparity image space rather than the object space as proposed in a standard 3D reconstruction routine for UAV data [15]
We organized the remaining parts of the paper as follows: Section 2.1 presents the proposed framework, Section 2.2 describes the dataset and experimental setup used, Section 3 presents the results of these experiments, which are further discussed in Section 4, and, finally, the paper is concluded in Section 5. The code implementing our point cloud extraction routine will be freely available in github upon publication of this work.

2. Materials and Methods

2.1. UnDER: Unsupervised Dense Point Cloud Extraction Routine for UAV Imagery

In this paper, we propose UnDER, an unsupervised dense point cloud extraction routine for ultra-high-resolution nadir-looking UAV imagery leveraging three key mechanisms: (1) unsupervised disparity estimation via a pretrained deep stereo matching network, (2) disparity shifting using prior and/or ancillary information on the UAV image-acquisition setup, and (3) occlusion masking via left–right disparity consistency checking. UnDER is built on top of a deep stereo matching network [34] originally trained on the Scene Flow [27] and KITTI [25] datasets. The routine starts with undistorted UAV images and their corresponding camera interior and exterior orientation parameters as input, and it produces a dense point cloud as output. Figure 1 provides an overview of the major steps included in the UnDER framework:

2.1.1. Image Rectification

UnDER starts with an input undistorted nadir-looking UAV image pair and its corresponding camera interior and exterior orientation parameters. In a simplified case where there is no scale difference and shear along the x and y axis of the image coordinate system, and where both the tangential and the radial distortions are corrected, the interior parameters include the focal length c and coordinates of the principal points ( x H , y H ) of the camera, and the exterior parameters include the coordinates of the camera projection center given by the vector Z with respect to a world coordinate system and the rotation matrix R parametrized by three rotation angles ( ω , ϕ , κ ) that describe how to rotate the world coordinate system along its X-, Y-, and Z-axes to align it to the image coordinate system. Given the undistorted image pair and their corresponding orientation parameters, required as an input to UnDER, a camera matrix P can be constructed as follows:
P = K R [ I 3 | Z ]
where I 3 is a 3 × 3 identity matrix and K is the calibration matrix given by
K = c 0 x H 0 c y H 0 0 1
The calibration matrix K and the rotation matrix R are used to rectify the undistorted UAV image pair according to the following equations:
x m =KR R K 1 x x m =KR R K 1 x
where x m and x m are rectified image coordinates, K and K are the calibration matrices, and R and R are the rotation matrices of cameras and . For most UAV image acquisition the common calibration matrix K = K = K and the rotation matrix R is the rotation defined by a common viewing direction that is the average of the viewing directions of the two cameras and is perpendicular to the baseline of the cameras [37] (p. 567). Both equations in Equation (3) are in closed form and can be independently used to warp each image in the stereo image pair to the rectified image coordinate system.

2.1.2. Disparity Estimation Network

After obtaining the rectified images in the previous step, the former are fed to a disparity estimation network, to derive an output disparity map. For UnDER, we adapted the deep stereo matching network proposed in [34] for the following reasons: (1) it does not involve a cost–volume-based approach and, therefore, it does not constrain the range of the output disparity values; (2) it is possible to fine-tune the network in an unsupervised manner not only using the well-known photometric loss [38,39] and smoothness loss [39,40] but also losses derived from parallax attention maps [34]. These two factors are important for reconstructing 3D scenes from ultra-high-resolution UAV imagery, since a limited disparity range will not be able to capture the potentially wide range of depth values, and because it is impractical to collect ground truth disparity values for supervised fine-tuning with every UAV image acquisition.
The network is composed of three stages: learned feature extraction, cascaded parallax attention modules, and disparity refinement, as shown in Figure 2. The feature-extraction stage employs an hourglass encoder–decoder architecture similar to the widely used SegNet [41], the main difference being that convolution with a stride of two is used, rather than max pooling, to downsample the feature maps, and that transposed convolutions are used to upsample it back. The encoder part of this hourglass network is composed of five blocks, each performing 3 × 3 convolutions with a stride of two pixels, effectively downsampling the consecutive feature maps by a scale of 1/2, batch normalization [42], and Leaky ReLU [43] as the activation function. The decoder part is composed of one block performing the same operations as the encoder blocks but without downsampling the bottleneck part of the hourglass, effectively maintaining the previous feature map dimensions, and three additional blocks performing upsampling using bilinear interpolation, by a factor of two, followed by 1 × 1 convolutions with a stride of one pixel, batch normalization, and Leaky ReLU activation. Both images in the stereo pair go through the same encoder-decoder block, producing left and right feature maps:
Selected multi-scale feature maps are then fed to parallax attention modules cascaded at three different scales: P 1 with dimensions H / 16 by W / 16 , P 2 with dimensions H / 8 by W / 8 , and P 3 with dimensions H / 4 by W / 4 , each module producing an output disparity map and a validity mask, where H and W are the height and width of the input image pairs. The cascade starts with the F 1 feature map, produced by the decoder block, right after the bottleneck feature maps, and initial left–right and right–left matching cost maps C 0 with all values set to zero. These three feature maps are then fed to four parallax attention blocks, each block performing 3 × 3 and 1 × 1 convolutions to compute the key and query feature maps ( K and Q ), residualization via skip connections and summation, and array reshaping combined with matrix multiplication to derive parallax attention—a modification to the attention mechanism [44] that effectively measures the similarity or correspondence of a specific pixel to pixels in the same row, as opposed to a global comparison performed by the original self-attention variant. See Figure 3 for an illustration comparing self-attention and parallax attention.
Each scale P i stacks four parallax attention blocks, feeding the output of the previous block as an input to the consecutive block. Each parallax attention block produces four sets of outputs: two sets of intermediate feature maps derived from the left and right feature maps of the input feature map of the block, and a pair of intermediate matching cost maps derived from parallax attention comparing the feature maps from the left image with the right image and vice versa. The outputs of the fourth block are upsampled using linear interpolation—bilinear for the feature maps and trilinear for the matching cost maps—to produce the set of inputs for the next scale P i + 1 , concatenating the feature map with the same scale F i + 1 from the decoder block of the previous feature-extraction stage.
Each group of four parallax attention blocks in a scale is connected to an output module applying softmax activation to the last matching cost map, to obtain parallax attention maps that are then used to generate a validity mask and to perform disparity regression. By default, these validity masks are only used during training, to exclude occluded pixels using partial convolution [45].
The output disparity of the cascaded parallax attention module in the last scale P 3 goes through another hourglass architecture, to perform learned disparity refinement. This initial disparity is concatenated to the left feature map of the second block of the encoder in the feature-extraction stage F l e f t 4 , as the left image is designated to be the base image, which is then fed to the hourglass architecture performing the disparity refinement, to produce the residual disparity D ^ r e s and confidence map M c o n . The refined disparity D ^ r e f i n e d is calculated as follows:
D ^ r e f i n e d = ( 1 M c o n ) × D ^ i n i t + M c o n × D ^ r e s
where ↑ is the upsampling operation using bilinear interpolation.
We use the same unsupervised loss L as introduced by [34], given by
L u n s u p = L p + λ s L s + λ a ( 0.2 L a 1 + 0.3 L a 2 + 0.5 L a 3 )
L a s = L a p s + λ a s L a s s + λ a c L a c s
L p is the photometric loss, L s is the smoothness loss, and L a s is the parallax attention mechanism loss calculated in three scales, s = ( 1 , 2 , 3 ) ; L a s includes three additional terms: a photometric L a p s , smoothness loss L a s s calculated from the parallax attention maps, and a cycle loss L a c s with a corresponding weight term λ . These weights are set manually, like the other hyperparameters of the network before the training.

2.1.3. Disparity Shifting

Directly applying a deep stereo matching network originally trained on an entirely different dataset, like KITTI, to a UAV image dataset for the purpose of deriving output disparity maps fails, due to the considerable differences in image-acquisition setups. In general, computer vision datasets like KITTI use a fixed stereo camera setup with (an assumed-to-be) rigid baseline and fronto-parallel orientation, i.e., the viewing directions of the two cameras are parallel. In UAV image acquisition, however, this stereo camera setup is simulated by a UAV capturing images from different points of view (POVs), i.e., different Z and R. Therefore, the baselines between POVs are not only much wider—only about 54 cm for KITTI, while for UAV image acquisition they can span more than 10 m—but they also vary greatly across image pairs. This introduces a wide and varying gap in disparity ranges between the rectified UAV image pairs and the disparity range considered in the KITTI dataset, and, thus, prevents the stereo matching network from providing any disparity map accurate enough to be useful for 3D reconstruction.
We addressed this issue by introducing a technique called disparity shifting. Referring to Figure 4, we define disparity shifting to be
x s h i f t e d = x δ p x
where x s h i f t e d is the shifted image coordinates of image , p x is the disparity of the principal point in camera referring to the basis depth, and δ is a hyperparameter called disparity shift ratio. In the same figure above, x = P X and the basis depth can be derived from any prior information in the UAV image dataset that can help arbitrarily select a depth that is reasonably within or close to the expected range of depth values present in the scene. For instance, we can subtract the mean elevation value derived from the GCPs acquired during the UAV image-acquisition mission from the mean elevation value as provided in the exterior orientation parameters of the images. If GCPs are not acquired during the UAV mission, the mean elevation of the ground can be estimated using freely available elevation information, e.g., Google Earth. With a proper choice of δ (generally, a value between 0 to 1), we are able to obtain reasonably accurate disparity maps from image pairs with varying and considerably wider baselines from a stereo matching network trained on image pairs with a fixed and substantially narrower baseline. Using a value of δ greater than 1 would potentially result in negative disparity values, and the stereo matching network was originally designed to only produce positive disparity value estimates.

2.1.4. Triangulation

UnDER combines the output disparity map produced in the disparity-estimation step with the corresponding base image of the (multi-)stereo pair and the camera interior and exterior orientation parameters of the images in the pair, in the third and last major step, which is triangulation. We follow the geometric solution discussed in [46] (pp. 6–17) by solving the following system of linear equations in 3D:
f = Z + ψ r g = Z + η s
where Z and Z are the projection center coordinates, ψ and η are unknown scaling factors, and r and s are the direction vectors of cameras and given by
r = R T x s = R T x
where R T and R T are the transposed rotation matrices and x and x are the rectified image coordinates of cameras and . The final triangulated point is given by
j = f + g 2 .

2.1.5. Occlusion Masking

Naively applying Equations (7)–(9) to all pixels within the output disparity map results in a substantial amount of outliers, due to pixels being visible in the base image but occluded in the side image(s) of the (multi-)stereo pair. We attempted to use the validity masks derived from the parallax attention maps in the stereo matching network [34] used in the disparity-estimation step, but this failed to mask out the majority of the occluded pixels, possibly due to the fact that the validity masks are produced on a much lower scale than the output disparity map. Thus, we introduced two occlusion-masking procedures conducted in an unsupervised manner, i.e., without the need for reference occlusion masks on which to possibly train an occlusion mask detector.
The first of the two procedures uses a mask based on the estimated geographic footprint overlap of the base and side images. This geographic footprint is estimated by projecting the image corner coordinates, using the camera matrix P on the basis depth D b . We further apply an inner buffer to this footprint overlap to ensure that all points within the overlap can be visible, if not occluded, in all the base and side images.
The second procedure entails a left–right disparity consistency check, similar to that conducted by [15], but instead of performing it in the object space we perform it in the disparity map space. The consistency check is conducted between two output disparity maps by switching the base image in the image pairs. An illustration of this disparity consistency check is shown in Figure 5. The procedure starts with two image pair setups: (1) image I as base image and I as side image, and (2) image I as base image and I as side image. We perform the same rectification and disparity estimation as described in Section 2.1.1, Section 2.1.2 and Section 2.1.3 on both the two setups shown in Figure 5 as “1. rectify” and “2. estimate disparity”. There will be two resulting disparity maps: one from image pair setup 1 and another from image pair setup 2. We proceed with warping back the disparity map of image pair setup 2 to the original image space of I , labeled as “3. unrectify” in Figure 5. We then warp the resulting “unrectified” disparity map to the rectified image space of I , labeled as “4. rectify” in Figure 5. We obtain the difference between this resulting disparity map and the disparity map obtained from the previous disparity-estimation step of image pair setup 1, labeled as “5. difference’’ in Figure 5. The difference between these two disparity maps is then warped back to the original image space of I , labeled as “6. unrectify”, and reclassified, labeled as “7. reclassify” in Figure 5, to the output occlusion mask, using
m = 1 if d < ϵ 0 if d > = ϵ
where d is the pixel value in the difference of the disparity maps, the output of “5. difference” in Figure 5, m is the pixel value in the reclassified raster, i.e., the output occlusion mask, and ϵ is a hyperparameter called disparity difference threshold. The hyperparameter ϵ should be a positive value limiting the difference, in terms of pixels, between the left–right and right–left disparity maps. It can be greater than one, i.e., more than a pixel difference, or less than one, i.e., a sub-pixel difference. Increasing ϵ flags more points as occluded, and decreasing it makes the filtering less restrictive. The two output-occlusion masks from the footprint overlap estimation and the disparity consistency check are combined to obtain the final occlusion mask used to exclude occluded pixels in the triangulation step discussed in Section 2.1.4.

2.2. Data and Experiments

2.2.1. Datasets

We evaluated the proposed UnDER framework, using three datasets: (i) the UseGeo dataset [47], (ii) the UAV-Nunspeet dataset [33], and a subset of (iii), the UAV-Zeche-Zollern dataset [48]. The UseGeo dataset consists of ultra-high-resolution nadir-looking UAV images of an urban scene in Southern Italy. It consists of three subsets: we used Dataset-1 for our experiments, to have some outputs comparable to previously published results [32]. Dataset-1 of UseGeo includes undistorted images with an average ground sampling distance (GSD) of about 1.7 cm. The dataset also provides undistorted images resampled at a 1/4 resolution with corresponding reference depth maps derived from a reference LiDAR point cloud. For this work, we used the undistorted images with the original resolution, the corresponding camera interior and exterior orientation parameters, the reference LiDAR point cloud, and an additional point cloud derived using Pix4D [49], all provided in the dataset, for evaluating our proposed method. The dataset consists of 224 undistorted UAV images with dimensions 7953 × 5279 . Figure 6 shows an overview of the Dataset-1 of UseGeo. We used UseGeo for the majority of our experiments, and we used the additional two datasets specifically for unsupervised fine-tuning setup comparison and computational time assessment.
The UAV-Nunspeet dataset consists of ultra-high-resolution nadir-looking UAV images of a town in the central Netherlands. There are a total of 312 undistorted UAV images with dimensions 4032 × 3024 having an average ground sampling distance of about 1.7 cm. Unlike the UseGeo dataset, the UAV-Nunspeet dataset lacks a high-resolution LiDAR point cloud that can be used for accuracy assessment. Therefore, we only used a point cloud derived using Pix4D. Figure 7 shows an overview of the UAV-Nunspeet dataset.
The subset of the UAV-Zeche-Zollern dataset used in this work consists of ultra-high-resolution nadir-looking UAV images of an area in West Germany. This is a much smaller dataset containing only 35 undistorted UAV images with dimensions 4592 × 3448 , having an average ground sampling distance of 2.4 cm. Similar to the UAV-Nunspeet dataset, we used a point cloud derived from Pix4D as reference. Figure 8 shows the subset of the UAV-Zeche-Zollern dataset used in our experiments. In Table 1, we summarize the properties of the images in the three datasets described above.

2.2.2. Method Setups

We mainly compared three different setups of our proposed UnDER framework. The first setup was as described in Section 2.1, using the network in [34] pretrained in the Scene Flow and KITTI datasets. The second setup used the same network fine-tuned in an unsupervised manner, using the UAV-Nunspeet dataset and training settings described in [33]—setting λ a s , λ a c , λ s , and λ a to 1, 1, 0.5, and 0.5 and setting the initial learning rate to 2 × 10 4 for 15 epochs, decreasing to 2 × 10 5 for 5 more epochs. The third setup used the same network fine-tuned in an unsupervised manner on the UseGeo dataset. The UAV-Nunspeet dataset is similar, in terms of the image-acquisition setup, to the UseGeo dataset, with the images having slightly smaller dimensions, but they were acquired from a totally different location, in a different country with substantial terrain differences. However, unlike UseGeo, UAV-Nunspeet lacks a high-resolution LiDAR point cloud that can be used as a reference for more reliable performance assessment. For notational purposes, we called the first, second, and third setups UnDER-P, UnDER-FN, and UnDER-FU, respectively. For the unsupervised fine-tuning of UnDER-FU, we prepared the training images in the same manner as described in [33], resulting in 13912 image pairs with dimensions 960 × 540 , taken by subsetting the rectified base and side images. We conducted the unsupervised fine-tuning to demonstrate whether deep learning-based methods are comparable with traditional methods that are also optimized in an unsupervised manner for practical-use cases. This differed from most previous studies, which trained the disparity estimation network in a supervised manner, using dense 3D reference data.

2.2.3. Ablations

We performed some ablation experiments—varying selected configurations in the UnDER framework. For these experiments, we used the UnDER-P setup. We analyzed the effect of varying the disparity shift ratio δ , the disparity difference threshold ϵ , and using a single-stereo vs two-stereo image pairs in determining the accuracy of the output point cloud. For these ablation experiments, we randomly selected 8 out of the 224 images of UseGeo dataset to be used as the base images for the image pair/s.

2.2.4. Accuracy Assessment

The accuracy of the resulting point cloud was assessed using the mean and standard deviation of the cloud-to-cloud (C2C) distance with respect to a high-resolution reference LiDAR point cloud provided in the UseGeo dataset. The reference point cloud had an average point density of about 50 points per square meter. The resulting dense point clouds were produced following the UnDER framework by using all the available 224 images in Dataset-1 of UseGeo as base images for an image pair. The corresponding side images were chosen to be the images with the highest overlaps with the base image. In the case of a multi-stereo pair, the top two images with the highest overlaps were used as corresponding side images, and the mean coordinates of a triangulated point from the same base image pixel were used in the output dense point cloud. For the ablation experiments, the results from each base image were used to visualize the trend of varying a specific configuration. For benchmarking against previously available results, the point clouds from all the base images were merged together and assessed against the whole reference LiDAR point cloud. A subset of the results was also visualized for qualitative performance analysis. The improvement in accuracy by applying an off-the-shelf filtering method named FPCfilter, used by OpenDroneMap [50], on top of the three UnDER setups was included.

2.2.5. Implementation

We used the network implemented in PyTorch by [34] for the disparity estimation. The UnDER framework was fully written in Python, utilizing a number of freely available packages. We also used a PostgreSQL database to organize and keep track of the progress when running UnDER. Prototyping the UnDER framework, unsupervised fine-tuning of the disparity estimation network and testing of UnDER to produce the point clouds for 224 base images of UseGeo Dataset-1 were run in multiple machines equipped with the following GPUs: NVIDIA Jetson AGX (only used for prototyping), NVIDIA GeForce RTX 2080 Ti, NVIDIA Titan XP, and NVIDIA A40 (NVIDIA Corporation, Santa Clara, CA, USA). Running the whole framework for all the 224 base images took between 1 and 3 days, depending on the setup and GPU used. CloudCompare software was used to calculate the C2C distances. For the benchmarking results, the output merged point cloud from all the 224 images was further split into subsets to facilitate the accuracy assessment, as the merged result was too large to be loaded and analyzed fully in memory.

3. Results

3.1. Ablation Experiments

3.1.1. Disparity Shift Ratio

Figure 9 shows how the disparity shift ratio ( δ ) influenced the accuracy of the resulting extracted dense point cloud. The first solid curve, δ = 0.0, corresponds to the setup without applying any disparity shifting; and the following solid curves correspond to the setups with increasing δ : 0.25, 0.5, 0.7, and 0.9. The horizontal axis shows the IDs of the eight base images used in the multi-stereo pairs tested in this experiment. The left vertical axis shows the mean C2C distance—comparing the dense point cloud extracted from the corresponding multi-stereo to the reference LiDAR point cloud. The mean C2C distances were log-transformed, to better visualize the solid curves; otherwise, the values of the first two solid curves, having substantially larger values, would have further obscured the differences between the last three solid curves. For this experiment, the disparity difference threshold ( ϵ ) was set to 8.
For most cases, as we increased the δ values, the corresponding accuracy of the resulting point cloud improved, as shown by the solid curves with higher δ values having lower mean C2C distances. A substantial difference can be observed in the absence of disparity shifting and when the δ value used was smaller than the value necessary to shift the stereo pairs into a disparity range closer to the dataset used in the pretrained disparity estimation network, as shown by the first two solid curves, compared to the setups using higher δ values. Based on the plot in Figure 9, we see relatively small differences between the last three solid curves with higher δ values because of the large range of mean C2C distances included in the plot. However, we calculated the mean absolute differences between the solid curves of δ = 0.5 and 0.7 and the solid curves of δ = 0.7 and 0.9 to be 0.037 m and 0.031 m, correspondingly—both values amounted to about two times the average GSD of the UseGeo dataset, which is a four-GSD substantial improvement from using δ = 0.5 to δ = 0.9. Thus, for the overall assessment of the proposed methods, we set the δ value to 0.9.
We also plotted the mean baseline length of the image pairs used in each multi-stereo, shown by the dashed curve in Figure 9, with the curve’s range given by the right vertical axis. Notably, the first two solid curves, δ = 0.0 and 0.25, follow the same trend as the mean baseline length curve—indicating substantial errors for wider-baseline image pairs in the absence or lack of disparity shifting. Such large baseline variation is common in most UAV image acquisition for mapping purposes, and, thus, further underscores the importance of using disparity shifting when using disparity estimation networks without supervised fine-tuning for these applications.

3.1.2. Disparity Difference Threshold

Figure 10 shows how the disparity difference threshold ( ϵ ) influenced the accuracy of the resulting extracted dense point cloud. The plotted curves correspond to setups with decreasing ϵ : 8.0, 4.0, 2.0, 1.0, and 0.75. Both the horizontal and vertical axes are similarly described in Figure 9. For this experiment, the δ was set to 0.9, thus, the second curve in Figure 10 is the same last solid curve plotted in Figure 9:
For most cases, as we decreased the ϵ values, the corresponding accuracy of the resulting point cloud improved, as shown by the curves with lower ϵ values having lower mean C2C distances. A substantial difference can be observed in the absence of masking via ϵ compared to using at least an eight-pixel disparity difference threshold. Observing the zoomed-in portion of the plot, we can see comparably similar improvements as we decreased the ϵ by half from 8.0 to 1.0. These improvements ranged from 0.006 m to 0.008 m, in terms of the mean absolute difference between a pair of curves, which is roughly 1/3 to 1/2 of the average GSD of the UseGeo dataset, presenting a 1.2 GSD improvement when changing the ϵ value from 8.0 to 1.0. After further decreasing the threshold beyond the sub-pixel equivalence ( ϵ = 0.75), the observed average improvement substantially dropped to 0.003 m; thus, we fixed the ϵ value to 0.75 for the overall assessment of the proposed methods.
Most of the masked pixels consisted of regions only visible in one of the stereo pairs and occluded in the other. UnDER filtered about 53 % of the points when applying an ϵ = 0.75 compared to applying no occlusion filter. The results from this experiment emphasize the importance of masking occluded pixels, especially in ultra-high-resolution scenes with a considerable amount of variation in the types and depths of objects present in the scene—factors that further accentuate the presence of occlusion.

3.1.3. Multi-Stereo

Figure 11 compares the setup of using multi-stereo image pairs with a setup only using single-stereo image pairs in the triangulation step of the point cloud extraction routine. For this experiment, the δ value was set to 0.9, and the ϵ value was set to 0.75. Just by looking at the two solid curves, the better setup appears to be inconsistent across the base images included in this experiment. However, after adding the mean κ difference plot, shown by the dashed curve, a pattern emerged, i.e., for image pairs with higher κ deviation, using a single-stereo setup appeared to be more beneficial, while for image pairs with close-to-zero mean κ deviation, the multi-stereo setup was consistently better.
The disparity estimation network was pretrained with images captured by a fixed stereo camera setup, where the corresponding mean κ difference was assumed to be zero and, thus, did not seem to generalize well for image pairs with substantial differences in camera rotation angles, specifically the κ angle, which measured the rotation along the z-axis. In the UseGeo dataset and most UAV datasets captured for mapping purposes, the majority of the image pairs are oriented along the same direction, except for images captured when the UAV is making its turn to the next flight line. Thus, for the point cloud extraction routine, we proceeded with using the multi-stereo setup.

3.2. Performance Comparison

Table 2 compares the different setups of our point cloud extraction routine, using the mean μ and standard deviation σ of the C2C distances between the extracted dense point cloud from Dataset-1 of UseGeo and the reference LiDAR point cloud against the dense image matching result provided in the UseGeo dataset [47] (UseGeo DIM) and the results from previous work [32] on the same dataset (MSP and Re-trained MVSFormer). The first six entries are results from our own point cloud extraction routine and the last three entries are from previously available results. The total number of densely extracted points n was also included, when available, for comparison.
With proper selection of the hyperparameters in our point cloud extraction routine, such as δ and ϵ , a disparity estimation network model pretrained on a dataset with an entirely different image-acquisition setup provided decent accuracy results that were comparable to the previously available results. The results from UnDER-P only lagged behind the point cloud included in the dataset (UseGeo DIM) by 4.4 mm, about 1/4 of the GSD extracted using pix4D [49], and they only lagged behind the hierarchical SGM-based method (MSP) [51] by 1.3 mm, which was less than 1/10 of the GSD. Interestingly, the results from UnDER-P outperformed the results from a multi-view stereo network based on vision transformers [52] retrained on ground truth depth maps provided in the UseGeo dataset (Re-trained MVSFormer) by 45.8 mm, approximately a 2.6 GSD improvement, in terms of μ .
Performing the unsupervised fine-tuning on UAV images with an image-acquisition setup (UnDER-FN) similar to the dataset being evaluated and executing the same fine-tuning in the exact dataset being evaluated (UnDER-FU) improved the accuracy of the resulting point cloud over using the pretrained model, by at least 12.8 mm, which is about 0.7 GSD, in terms of μ , and by at least 15.2 mm, which is about 0.9 GSD, in terms of σ . However, there seemed to be a lack of any substantial improvement between unsupervised fine-tuning on the exact dataset being evaluated and unsupervised fine-tuning on a similar UAV image dataset. Practically, this result can be viewed positively, i.e., it is unnecessary to repeat the same unsupervised fine-tuning step when applying our point cloud extraction routine based on a disparity estimation network already fine-tuned on another UAV dataset with a similar image-acquisition setup.
As a considerable number of redundant points were produced in our point cloud extraction routine, we further improved the setup by including an off-the-shelf filtering method named FPCfilter, used by OpenDroneMap [50]. This filtering step enhanced the resulting accuracy of the extracted point clouds by at least 17.8 mm, which is about one GSD, in terms of μ , and by at least 33.4 mm, which is about 1.9 GSD, in terms of σ ; and the resulting total number of points n was reduced to only about half of the unfiltered one—which was still substantially dense and redundant.
The lowest error obtained by a setup of our point cloud extraction routine, UnDER-FN+FPCfilter, was 54.0 mm, which was about 3 times the average GSD, in terms of μ , and 45.4 mm, which was about 2.6 GSD, in terms of σ . This result was a total improvement of about 1.8 GSD over using a pretrained model for the disparity estimation network in our own point cloud extraction routine, at least a 1.5 GSD improvement over the dense-image-matching-derived point cloud provided in the dataset, a 1.7 GSD improvement over an SGM-based approach, and a 4.4 GSD improvement over a multi-view stereo network based on vision transformers retrained on the ground truth depth maps provided in the dataset.
Figure 12 shows the error maps derived from comparing the resulting point clouds of the UnDER-P and UnDER-FN+FPCfilter methods with the reference LiDAR point cloud. There appears to be no apparent systematic spatial distribution of the C2C distance values. Comparing the two setups, a clear improvement can be seen in the mean C2C distances from using a disparity estimation network model fine-tuned on another UAV dataset with similar image acquisition characteristics together with applying an off-the-shelf filtering method over using a pretrained model. Noticeable improvements can be seen in objects with relatively flat surfaces, like roads and buildings. The point cloud extraction routine clearly suffered with reconstructing points within taller vegetation, where a lot of points were being excluded during the occlusion-masking step, which is to be expected when comparing point clouds from LiDAR and photogrammetry:
Figure 13 shows the distributions of the mean C2C distance values of UseGeo DIM, UnDER-P, and UnDER-FN+FPCfilter. The distribution of the mean C2C distance values of UnDER-FN+FPCfilter had a narrower width and a peak closer to zero compared to both UnDER-P and UseGeo DIM—further validating the quantitative performance comparison in Table 2. The majority of the values were below 0.1 m, and all the distributions appeared to be unimodal and positively skewed, as expected of an unsigned error measure.

3.3. Performance on Additional Datasets

Table 3 shows the accuracy metrics assessing UnDER-P+FPCfilter on the two additional datasets: UAV-Nunspeet and the subset of UAV-Zeche-Zollern. Unlike the assessment done in [33], where only two stereo image pairs were used, all the available 312 images of UAV-Nunspeet were used to produce the resulting point cloud. The resulting point cloud from UnDER-P+FPCfilter on the UAV-Nunspeet dataset (first row) differed by 58.5 mm, in terms of μ , and by 64.7 mm, in terms of σ , differences which were roughly equivalent to 3.4 GSD and 3.8 GSD, respectively—which was a considerable improvement over the 4.4 GSD and 44.2 GSD average μ and σ reported in the two image pairs evaluated in our previous study [33]. This result supports the importance of the disparity-shifting and occlusion-masking techniques proposed in this study.
The same table also shows the quantitative results on the subset of the UAV-Zeche-Zollern dataset used in this study. The resulting point cloud from the UnDER-P+FPCfilter differed by 101 mm and 130 mm μ and σ , correspondingly—differences that were approximately equivalent to 4.2 and 5.4 average GSD, comparably less accurate than the UAV-Nunspeet results. This may have been due to the relatively higher ratio of the presence of objects that were more difficult to reconstruct, such as vegetation and the two tower-like structures in the images.
We also assessed the computational time of our proposed routine by comparing the total running time of UnDER-P+FPCfilter with Pix4D. For practical purposes, we used the much smaller dataset—i.e., the subset of the UAV-Zeche-Zollern dataset—for this comparison. To make it as comparable as we could, we set in the options of the second step (dense image matching) of processing in Pix4D, to use the original resolution of the images (the default option was 1/2 of the original resolution) and high point density (the default value was optimal), and to exclude mesh generation (mesh generated from point cloud by default). Table 4 shows the results of this comparison on the 35 images of this dataset. As expected, the commercial software was considerably faster than the prototype software we used to implement UnDER, by about five times. Considering that optimization of the code was outside the scope of this work, having the same order of magnitude of running time as commercial software like Pix4D is sufficiently satisfactory for academic and exploratory purposes.

4. Discussion

Our results from the ablation experiments show the effect of varying selected configurations in the UnDER setup. The experiment in varying the disparity shift ratio δ showed the importance of applying disparity shifting when using a stereo matching network pretrained on an entirely different typology of image data. Considerably inaccurate results were obtained in the absence of disparity shifting, and this was more apparent for image pairs with substantially longer baselines. Using a ratio closer to 1 consistently produced improved results. Using a value greater than 1 would have risked having negative disparity values, and the stereo matching network was originally designed to only produce positive disparity value estimates. The only downside of using disparity shifting is the need for defining a basis depth that is within or close to the range of depth values expected in the object scene. However, in practice, such information is generally available in this type of UAV image acquisition. And, if not available, the basis depth can always be derived from freely available information, such as elevation information from Google Earth and the UAV flying height.
The experiment in varying the disparity difference threshold ϵ showed the importance of applying a left–right disparity consistency check to mask occluded pixels. Considerably inaccurate results were obtained in the absence of this occlusion mask; opting to use the validity mask derived from parallax attention maps proposed by [34] in the stereo matching network used in the disparity estimation step did not provide reasonably good results either. The disparity consistency check depends on the choice of a basis depth just like δ , which, as previously discussed, is generally available or can be derived from the information on the UAV image-acquisition setup. A more substantial downside of masking using the disparity consistency check is the possible removal of non-occluded regions where the disparity estimation network fails to provide consistent disparity values. We tried to mitigate this by merging dense point clouds derived from using all images as a base image; however, there still seemed to be problematic regions with missing reconstruction, such as textureless areas, as shown in Figure 12. Future extensions of this work will include an approach to filling these gaps in the reconstruction by, for example, utilizing semantic information to guide data-imputation methods or using diffusion models to generate data within these gaps.
The experiment in comparing single- and multi-stereo confirmed the importance of leveraging multi-view geometry in the dense 3D reconstruction task specifically for image pairs with a lower mean κ angle difference. The disadvantage of using a multi-stereo setup is the additional computational time required to process the additional image pair. This was alleviated in our implementation by storing intermediate results from previously processed image pairs—specifically, half of the disparity estimation performed for the left–right disparity consistency check could be used in the image pair of the consecutive base image. The setup could be further improved by using the mean κ difference as a threshold to perform a single- or multi-stereo setup, as performing multi-stereo when the mean κ difference is substantially large can further degrade the results, as shown in Figure 11 for base image ids "17-14”, “17-16”, and “17-18”.
Our results from the benchmarking experiments showed the accuracy improvement in performing unsupervised fine-tuning of the disparity estimation network. It also showed that unsupervised fine-tuning on a UAV dataset with a similar image-acquisition setup produces results similar to performing the same unsupervised fine-tuning on the target dataset. The unsupervised fine-tuning step requires few resources, as there is no need to impractically prepare and provide reference dense pixel correspondence on a newly acquired UAV image dataset; the preparation of the training images can be easily performed via rectification, disparity shifting, and image subsetting, which can be done in a couple of hours for the 224 images of the UseGeo dataset; and the actual unsupervised fine-tuning of the network takes an additional couple of hours to finish. This highlights the main advantage, compared to other previously published methods for 3D reconstruction, of using UnDER for 3D reconstruction on very-to-ultra-high-resolution nadir-looking remotely sensed images that heavily rely on a dense 3D reference dataset to retrain the reconstruction method in a supervised manner.
The experiments on the two additional datasets further showed results consistently comparable with either using point cloud produced by Pix4D as a reference, as done in UAV-Nunspeet and UAV-Zeche-Zollern, or assessing the results from both UnDER and Pix4D with a reference LiDAR point cloud, as done in the UseGeo dataset. Our experiments on the computational time required to run the whole routine also revealed that it was within the same order of magnitude of running time (about 5 times slower) as an industry-standard commercial software like Pix4D.
In addition to its disadvantages—in regard to identification of the basis depth, possible masking of non-occluded regions, additional computational burden from multi-stereo setup, and relatively slower processing time compared to industry-grade software, as discussed above—another drawback of the current proposed UnDER framework is the redundancy in the resulting merged point cloud, as no view-filtering step is performed, i.e., all 224 images of UseGeo dataset are used as a base image. This is mitigated by using an off-the-shelf filtering method called FPCfilter; however, there still seems to be substantial redundancy in the output points, having multiple reconstructions for corresponding pixels in different base images, as shown by the point count in Table 2 and the previously discussed gaps in reconstruction. Additionally, running the routine for a couple of hundreds of images still takes a considerable amount of processing time, in the order of a couple of days. Future extensions of this work will include a dedicated procedure developed to address this point redundancy, e.g., a smarter multi-view selection or filtering approach, and a modification of the stereo matching network in the disparity-estimation step, to improve the computational speed of UnDER.

5. Conclusions

In this work, we have demonstrated that the accuracy of dense 3D point cloud extracted from ultra-high resolution UAV images can be improved via unsupervised fine-tuning of a disparity-estimation network pretrained on an entirely different typology of image data. Our experimental results show the importance of the proposed disparity-shifting and disparity-consistency-check procedures for the accuracy of the results, as well as cases where using a single- or multi-stereo image pair in the proposed UnDER framework can be favorable. The lowest error obtained by a setup of our point cloud extraction routine on the UseGeo Dataset-1, UnDER-FN+FPCfilter, was 54.0 mm, which is about 3 times the average GSD, in terms of the mean C2C distance μ . This result is a total improvement of about 1.8 GSD over only using the pretrained model, without unsupervised fine-tuning, for the disparity estimation network in our own point cloud extraction routine; and it is at least a 1.5 GSD improvement over the dense-image-matching-derived point cloud provided in the dataset. Future developments of this work will include a gap-filling mechanism to complement the occlusion-masking procedure, a dedicated method for adequately filtering redundancy in the merged point cloud, possibly including a view-filtering/selection step, and integration of semantic information, for instance-leveraging vision foundation models in the point cloud extraction routine.

Author Contributions

Conceptualization, J.R.B.; methodology, J.R.B.; software, J.R.B.; validation, J.R.B..; formal analysis, J.R.B.; investigation, J.R.B.; resources, J.R.B. and F.N.; data curation, J.R.B. and F.N.; writing—original draft preparation, J.R.B.; writing—review and editing, J.R.B. and F.N.; visualization, J.R.B. All authors have read and agreed to the published version of the manuscript.


This research received no external funding.

Data Availability Statement

The UseGeo dataset can be found in: (accessed 30 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest.


The following abbreviations are used in this manuscript:
UAVunmanned aerial vehicles
GSDground sampling distance


  1. Nex, F.; Remondino, F. UAV for 3D mapping applications: A review. Appl. Geomat. 2014, 6, 1–15. [Google Scholar] [CrossRef]
  2. Nex, F.; Armenakis, C.; Cramer, M.; Cucci, D.; Gerke, M.; Honkavaara, E.; Kukko, A.; Persello, C.; Skaloud, J. UAV in the advent of the twenties: Where we stand and what is next. ISPRS J. Photogramm. Remote Sens. 2022, 184, 215–242. [Google Scholar] [CrossRef]
  3. Mancini, F.; Dubbini, M.; Gattelli, M.; Stecchi, F.; Fabbri, S.; Gabbianelli, G. Using unmanned aerial vehicles (UAV) for high-resolution reconstruction of topography: The structure from motion approach on coastal environments. Remote Sens. 2013, 5, 6880–6898. [Google Scholar] [CrossRef]
  4. Meinen, B.U.; Robinson, D.T. Mapping erosion and deposition in an agricultural landscape: Optimization of UAV image acquisition schemes for SfM-MVS. Remote Sens. Environ. 2020, 239, 111666. [Google Scholar] [CrossRef]
  5. Siebert, S.; Teizer, J. Mobile 3D mapping for surveying earthwork projects using an Unmanned Aerial Vehicle (UAV) system. Autom. Constr. 2014, 41, 1–14. [Google Scholar] [CrossRef]
  6. Shahbazi, M.; Sohn, G.; Théau, J.; Ménard, P. UAV-based point cloud generation for open-pit mine modelling. Int. Arch. Photogramm. Remote. Sens. Spat. Inf. Sci. 2015, 40, 313–320. [Google Scholar] [CrossRef]
  7. Fritz, A.; Kattenborn, T.; Koch, B. UAV-based photogrammetric point clouds—Tree stem mapping in open stands in comparison to terrestrial laser scanner point clouds. Int. Arch. Photogramm. Remote. Sens. Spat. Inf. Sci. 2013, 40, 141–146. [Google Scholar] [CrossRef]
  8. Liu, B.; Hao, Y.; Huang, H.; Chen, S.; Li, Z.; Chen, E.; Tian, X.; Ren, M. TSCMDL: Multimodal Deep Learning Framework for Classifying Tree Species Using Fusion of 2-D and 3-D Features. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–11. [Google Scholar] [CrossRef]
  9. Al-Rawabdeh, A.; Moussa, A.; Foroutan, M.; El-Sheimy, N.; Habib, A. Time Series UAV Image-Based Point Clouds for Landslide Progression Evaluation Applications. Sensors 2017, 17, 2378. [Google Scholar] [CrossRef]
  10. Xu, Q.; Li, W.l.; Ju, Y.z.; Dong, X.j.; Peng, D.l. Multitemporal UAV-based photogrammetry for landslide detection and monitoring in a large area: A case study in the Heifangtai terrace in the Loess Plateau of China. J. Mt. Sci. 2020, 17, 1826–1839. [Google Scholar] [CrossRef]
  11. Bendig, J.; Yu, K.; Aasen, H.; Bolten, A.; Bennertz, S.; Broscheit, J.; Gnyp, M.L.; Bareth, G. Combining UAV-based plant height from crop surface models, visible, and near infrared vegetation indices for biomass monitoring in barley. Int. J. Appl. Earth Obs. Geoinf. 2015, 39, 79–87. [Google Scholar] [CrossRef]
  12. Chang, A.; Jung, J.; Maeda, M.M.; Landivar, J. Crop height monitoring with digital imagery from Unmanned Aerial System (UAS). Comput. Electron. Agric. 2017, 141, 232–237. [Google Scholar] [CrossRef]
  13. Chen, B.; Deng, L.; Duan, Y.; Huang, S.; Zhou, J. Building change detection based on 3D reconstruction. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 4126–4130. [Google Scholar] [CrossRef]
  14. Vacca, G.; Furfaro, G.; Dessì, A. The use of the uav images for the building 3D model generation. Int. Arch. Photogramm. Remote. Sens. Spat. Inf. Sci. 2018, 42, 217–223. [Google Scholar] [CrossRef]
  15. Rothermel, M.; Wenzel, K.; Fritsch, D.; Haala, N. SURE: Photogrammetric Surface Reconstruction From Imagery. In Proceedings of the LC3D Workshop, Berlin, Germany, 4–5 December 2012. [Google Scholar]
  16. Furukawa, Y.; Hernández, C. Multi-View Stereo: A Tutorial. Found. Trends Comput. Graph. Vis. 2015, 9, 1–148. [Google Scholar] [CrossRef]
  17. Bergado, J.R.; Persello, C.; Gevaert, C. A deep learning approach to the classification of sub-decimetre resolution aerial images. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 1516–1519. [Google Scholar] [CrossRef]
  18. Mboga, N.; Persello, C.; Bergado, J.R.; Stein, A. Detection of Informal Settlements from VHR Images Using Convolutional Neural Networks. Remote Sens. 2017, 9, 1106. [Google Scholar] [CrossRef]
  19. Persello, C.; Tolpekin, V.; Bergado, J.; de By, R. Delineation of agricultural fields in smallholder farms from satellite images using fully convolutional networks and combinatorial grouping. Remote Sens. Environ. 2019, 231, 111253. [Google Scholar] [CrossRef] [PubMed]
  20. Zabih, R.; Woodfill, J. Non-parametric Local Transforms for Computing Visual Correspondence. In Proceedings of the Third European Conference-Volume II on Computer Vision—Volume II, Stockholm, Sweden, 2–6 May 1994; Springer: Berlin/Heidelberg, Germany, 1994. ECCV ’94. pp. 151–158. [Google Scholar]
  21. Hirschmuller, H. Stereo Processing by Semiglobal Matching and Mutual Information. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 328–341. [Google Scholar] [CrossRef] [PubMed]
  22. Zbontar, J.; LeCun, Y. Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 2016, 17, 1–32. [Google Scholar]
  23. Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-To-End Learning of Geometry and Context for Deep Stereo Regression. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  24. Chang, J.R.; Chen, Y.S. Pyramid Stereo Matching Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5410–5418. [Google Scholar]
  25. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
  26. Scharstein, D.; Hirschmüller, H.; Kitajima, Y.; Krathwohl, G.; Nešić, N.; Wang, X.; Westling, P. High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth. In Proceedings of the Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; Jiang, X., Hornegger, J., Koch, R., Eds.; Springer: Cham, Switzerland, 2014; pp. 31–42. [Google Scholar]
  27. Mayer, N.; Ilg, E.; Häusser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  28. Albanwan, H.; Qin, R. A comparative study on deep-learning methods for dense image matching of multi-angle and multi-date remote sensing stereo-images. Photogramm. Rec. 2022, 37, 385–409. [Google Scholar] [CrossRef]
  29. Chebbi, M.A.; Rupnik, E.; Pierrot-Deseilligny, M.; Lopes, P. DeepSim-Nets: Deep Similarity Networks for Stereo Image Matching. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 2097–2105. [Google Scholar] [CrossRef]
  30. Gao, J.; Liu, J.; Ji, S. A general deep learning based framework for 3D reconstruction from multi-view stereo satellite images. ISPRS J. Photogramm. Remote Sens. 2023, 195, 446–461. [Google Scholar] [CrossRef]
  31. Liu, J.; Gao, J.; Ji, S.; Zeng, C.; Zhang, S.; Gong, J. Deep learning based multi-view stereo matching and 3D scene reconstruction from oblique aerial images. ISPRS J. Photogramm. Remote Sens. 2023, 204, 42–60. [Google Scholar] [CrossRef]
  32. Nex, F.; Zhang, N.; Remondino, F.; Farella, E.M.; Qin, R.; Zhang, C. Benchmarking the extraction of 3D geometry from UAV images with deep learning methods. Int. Arch. Photogramm. Remote. Sens. Spat. Inf. Sci. 2023, XLVIII-1/W3-2023, 123–130. [Google Scholar] [CrossRef]
  33. Bergado, J.R.; Nex, F. Dense point cloud extraction from UAV imagery using parallax attention. ISPRS Ann. Photogramm. Remote. Sens. Spat. Inf. Sci. 2023, X-1/W1-2023, 1027–1032. [Google Scholar] [CrossRef]
  34. Wang, L.; Guo, Y.; Wang, Y.; Liang, Z.; Lin, Z.; Yang, J.; An, W. Parallax Attention for Unsupervised Stereo Correspondence Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2108–2125. [Google Scholar] [CrossRef] [PubMed]
  35. He, S.; Zhou, R.; Li, S.; Jiang, S.; Jiang, W. Disparity Estimation of High-Resolution Remote Sensing Images with Dual-Scale Matching Network. Remote Sens. 2021, 13, 5050. [Google Scholar] [CrossRef]
  36. Jung, Y.J.; Sohn, H.; Lee, S.i.; Ro, Y.M. Visual comfort improvement in stereoscopic 3D displays using perceptually plausible assessment metric of visual comfort. IEEE Trans. Consum. Electron. 2014, 60, 1–9. [Google Scholar] [CrossRef]
  37. Forstner, W. Photogrammetric Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
  38. Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. arXiv 2017, arXiv:1609.03677. [Google Scholar]
  39. Li, A.; Yuan, Z. Occlusion Aware Stereo Matching via Cooperative Unsupervised Learning. In Proceedings of the Computer Vision—ACCV 2018, Perth, Australia, 2–6 December 2018; Jawahar, C., Li, H., Mori, G., Schindler, K., Eds.; Springer: Cham, Switzerland, 2019; pp. 197–213. [Google Scholar]
  40. Yin, Z.; Shi, J. GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1983–1992. [Google Scholar]
  41. Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
  42. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
  43. Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; Volume 30. [Google Scholar]
  44. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  45. Liu, G.; Reda, F.A.; Shih, K.J.; Wang, T.C.; Tao, A.; Catanzaro, B. Image Inpainting for Irregular Holes Using Partial Convolutions. In Proceedings of the The European Conference on Computer Vision (ECCV), Munich, Germany, 17–24 May 2018. [Google Scholar]
  46. Stachniss, C. Triangulation and Absolute Orientation. 2021. Available online: (accessed on 16 February 2024).
  47. UseGeo—ISPRS Scientific Initiative 2021–2022. Available online: (accessed on 9 February 2024).
  48. Nex, F.; Gerke, M.; Remondino, F.; Przybilla, H.J.; Bäumker, M.; Zurhorst, A. ISPRS benchmark for multi-platform photogrammetry. ISPRS Ann. Photogramm. Remote. Sens. Spat. Inf. Sci. 2015, II-3/W4, 135–142. [Google Scholar] [CrossRef]
  49. Pix4D. Pix4D Software. 2024. Available online: (accessed on 9 February 2024).
  50. OpenDroneMap Authors. ODM–A Command Line Toolkit to Generate Maps, Point Clouds, 3D Models and DEMs from Drone, Balloon or Kite Images. 2024. Available online: (accessed on 9 February 2024).
  51. Qin, R. RPC stereo processor (RSP)–A software package for digital surface model and orthophoto generation from satellite stereo imagery. ISPRS Ann. Photogramm. Remote. Sens. Spat. Inf. Sci. 2016, III-1, 77–82. [Google Scholar] [CrossRef]
  52. Cao, C.; Ren, X.; Fu, Y. MVSFormer: Multi-View Stereo by Learning Robust Image Features and Temperature-based Depth. Transactions of Machine Learning Research. 2023. Available online: (accessed on 9 February 2024).
Figure 1. An overview of the proposed UnDER framework consisting of three main steps: image rectification, disparity estimation, and triangulation. UnDER accepts the following as an input: undistorted UAV image pairs, camera interior and exterior orientation parameters, a disparity estimation network. UnDER produces, as a final output, a dense point cloud corresponding to the overlapping area of the image pairs.
Figure 1. An overview of the proposed UnDER framework consisting of three main steps: image rectification, disparity estimation, and triangulation. UnDER accepts the following as an input: undistorted UAV image pairs, camera interior and exterior orientation parameters, a disparity estimation network. UnDER produces, as a final output, a dense point cloud corresponding to the overlapping area of the image pairs.
Remotesensing 17 00024 g001
Figure 2. An overview of the parallax attention stereo matching network used in the disparity estimation step of UnDER.
Figure 2. An overview of the parallax attention stereo matching network used in the disparity estimation step of UnDER.
Remotesensing 17 00024 g002
Figure 3. Comparison of self-attention and parallax attention. The similarity of the selected pixel (green) to other pixels (in different colors) is measured in the same feature map (self-attention), or in a feature map extracted from a paired right image (parallax attention).
Figure 3. Comparison of self-attention and parallax attention. The similarity of the selected pixel (green) to other pixels (in different colors) is measured in the same feature map (self-attention), or in a feature map extracted from a paired right image (parallax attention).
Remotesensing 17 00024 g003
Figure 4. Reference figure for defining disparity shifting. It shows the image planes of a stereo pair, a basis depth for deriving the disparity shift, the projection centers of the two cameras, the image points of the left principal point in both image planes, the corresponding object point lying on the basis depth, and the disparity of the left principal point.
Figure 4. Reference figure for defining disparity shifting. It shows the image planes of a stereo pair, a basis depth for deriving the disparity shift, the projection centers of the two cameras, the image points of the left principal point in both image planes, the corresponding object point lying on the basis depth, and the disparity of the left principal point.
Remotesensing 17 00024 g004
Figure 5. Reference figure for the disparity consistency check. It shows how the occlusion mask is calculated by comparing output disparity maps by switching the base image in the image pairs. Images I and I are correspondingly captured at two different locations of the camera projection center Z and Z , and M is the output mask.
Figure 5. Reference figure for the disparity consistency check. It shows how the occlusion mask is calculated by comparing output disparity maps by switching the base image in the image pairs. Images I and I are correspondingly captured at two different locations of the camera projection center Z and Z , and M is the output mask.
Remotesensing 17 00024 g005
Figure 6. Dataset-1 of the UseGeo dataset: full extent of the dataset, a sample undistorted image, and a corresponding subset of the reference LiDAR point cloud (left to right). The area of the sample image is located in the yellow box annotated on the extent of Dataset-1.
Figure 6. Dataset-1 of the UseGeo dataset: full extent of the dataset, a sample undistorted image, and a corresponding subset of the reference LiDAR point cloud (left to right). The area of the sample image is located in the yellow box annotated on the extent of Dataset-1.
Remotesensing 17 00024 g006
Figure 7. The UAV-Nunspeet dataset: full extent of the dataset, a sample undistorted image, and a corresponding subset of the point cloud derived from Pix4D. The area of the sample image is located in the yellow box annotated on the extent of the dataset.
Figure 7. The UAV-Nunspeet dataset: full extent of the dataset, a sample undistorted image, and a corresponding subset of the point cloud derived from Pix4D. The area of the sample image is located in the yellow box annotated on the extent of the dataset.
Remotesensing 17 00024 g007
Figure 8. Subset of the UAV-Zeche-Zollern dataset: the extent of the subset and the corresponding reference Pix4D point cloud.
Figure 8. Subset of the UAV-Zeche-Zollern dataset: the extent of the subset and the corresponding reference Pix4D point cloud.
Remotesensing 17 00024 g008
Figure 9. Plot showing the effect of varying the disparity shift ratio ( δ ) values used in the disparity-estimation step of the point cloud extraction routine. Each solid curve corresponds to a different δ value. The horizontal axis shows the base images used in each multi-stereo pair. The left vertical axis shows the natural logarithm (log) of the mean cloud-to-cloud (C2C) distance: comparing the point cloud extracted from each multi-stereo pair with the reference LiDAR point cloud. The dashed curve shows the mean baseline length of the image pairs used in the multi-stereo. The right vertical axis provides the range of values of the mean baseline length.
Figure 9. Plot showing the effect of varying the disparity shift ratio ( δ ) values used in the disparity-estimation step of the point cloud extraction routine. Each solid curve corresponds to a different δ value. The horizontal axis shows the base images used in each multi-stereo pair. The left vertical axis shows the natural logarithm (log) of the mean cloud-to-cloud (C2C) distance: comparing the point cloud extracted from each multi-stereo pair with the reference LiDAR point cloud. The dashed curve shows the mean baseline length of the image pairs used in the multi-stereo. The right vertical axis provides the range of values of the mean baseline length.
Remotesensing 17 00024 g009
Figure 10. Plot showing the effect of varying the disparity difference threshold ( ϵ ) values used in the occlusion-masking step of the point cloud extraction routine. Each curve corresponds to a different ϵ value. The horizontal axis shows the base images used in each multi-stereo pair. The vertical axis shows the natural logarithm (log) of the mean cloud-to-cloud (C2C) distance, comparing the point cloud extracted from each multi-stereo pair with the reference LiDAR point cloud. A zoomed-in portion of the graph is included, to further highlight the differences in the setups with increasing ϵ .
Figure 10. Plot showing the effect of varying the disparity difference threshold ( ϵ ) values used in the occlusion-masking step of the point cloud extraction routine. Each curve corresponds to a different ϵ value. The horizontal axis shows the base images used in each multi-stereo pair. The vertical axis shows the natural logarithm (log) of the mean cloud-to-cloud (C2C) distance, comparing the point cloud extracted from each multi-stereo pair with the reference LiDAR point cloud. A zoomed-in portion of the graph is included, to further highlight the differences in the setups with increasing ϵ .
Remotesensing 17 00024 g010
Figure 11. Plot showing the effect of using a multi-stereo setup compared to a single-stereo setup in the triangulation step of the point cloud extraction routine. The first solid curve corresponds to the single-stereo setup while the second solid curve corresponds to the multi-stereo setup. The horizontal axis shows the base images used in each single-stereo or multi-stereo pair. The left vertical axis shows the natural logarithm (log) of the mean cloud-to-cloud (C2C) distance, comparing the point cloud extracted from each multi-stereo pair with the reference LiDAR point cloud. The dashed curve shows the mean absolute difference in κ values of the images used in each single-stereo and multi-stereo pair. The right vertical axis displays the range of the mean differences in κ angles.
Figure 11. Plot showing the effect of using a multi-stereo setup compared to a single-stereo setup in the triangulation step of the point cloud extraction routine. The first solid curve corresponds to the single-stereo setup while the second solid curve corresponds to the multi-stereo setup. The horizontal axis shows the base images used in each single-stereo or multi-stereo pair. The left vertical axis shows the natural logarithm (log) of the mean cloud-to-cloud (C2C) distance, comparing the point cloud extracted from each multi-stereo pair with the reference LiDAR point cloud. The dashed curve shows the mean absolute difference in κ values of the images used in each single-stereo and multi-stereo pair. The right vertical axis displays the range of the mean differences in κ angles.
Remotesensing 17 00024 g011
Figure 12. A subset of the UseGeo Dataset-1 showing the UseGeo DIM point cloud and the mean cloud-to-cloud (C2C) distances of UnDER-P and UnDER-FN+FPCfilter (left to right) with respect to the reference LiDAR point cloud. The bottom row shows a zoomed-in portion of the subset from the top row, indicated by the yellow box. All C2C distances greater than 0.1 m are displayed in red, all C2C distances less than 0.02 m are displayed as blue, and everything in between is displayed in a gradient of green to yellow.
Figure 12. A subset of the UseGeo Dataset-1 showing the UseGeo DIM point cloud and the mean cloud-to-cloud (C2C) distances of UnDER-P and UnDER-FN+FPCfilter (left to right) with respect to the reference LiDAR point cloud. The bottom row shows a zoomed-in portion of the subset from the top row, indicated by the yellow box. All C2C distances greater than 0.1 m are displayed in red, all C2C distances less than 0.02 m are displayed as blue, and everything in between is displayed in a gradient of green to yellow.
Remotesensing 17 00024 g012
Figure 13. Histogram of mean C2C distance values of UseGeo DIM, UnDER-P, and UnDER-FN+FPCfilter. Values beyond 0.5 m were truncated for better visualization.
Figure 13. Histogram of mean C2C distance values of UseGeo DIM, UnDER-P, and UnDER-FN+FPCfilter. Values beyond 0.5 m were truncated for better visualization.
Remotesensing 17 00024 g013
Table 1. Image details of the three datasets used in our experiments.
Table 1. Image details of the three datasets used in our experiments.
DatasetLocationAverage GSD (m)DimensionsCount
UseGeoItaly0.017 7953 × 5279 224
UAV-NunspeetNetherlands0.017 4032 × 3024 312
UAV-Zeche-ZollernGermany0.024 4592 × 3448 35
Table 2. Quantitative performance comparison of the different setups of our point cloud extraction routine. Accuracy metrics included the mean μ and standard deviation σ of cloud-to-cloud (C2C) distances, comparing the point clouds extracted from Dataset-1 of UseGeo with the reference LiDAR point cloud. The total number of points n is also included in the table, for comparison.
Table 2. Quantitative performance comparison of the different setups of our point cloud extraction routine. Accuracy metrics included the mean μ and standard deviation σ of cloud-to-cloud (C2C) distances, comparing the point clouds extracted from Dataset-1 of UseGeo with the reference LiDAR point cloud. The total number of points n is also included in the table, for comparison.
Method μ  (m) σ  (m)n
— O U R S —UnDER-P 0.08580.09782.31 × 109
UnDER-FN 0.07300.07992.62 × 109
UnDER-FU 0.07280.08262.50 × 109
UnDER-P+FPCfilter 0.06800.04941.31 × 109
UnDER-FN+FPCfilter 0.05400.04541.51 × 109
UnDER-FU+FPCfilter 0.05460.04921.45 × 109
UseGeo DIM[47]0.08140.09405.84 × 107
MSP [32]0.08450.0805-
Re-trained MVSFormer[32]0.13160.1099-
Table 3. Quantitative performance of UnDER-P+FPCfilter on the two additional datasets, UAV-Nunspeet and the subset of UAV-Zeche-Zollern. Accuracy metrics included the mean μ and standard deviation σ of cloud-to-cloud (C2C) distances, comparing the point clouds extracted from the UAV-Nunspeet dataset and the UAV-Zeche-Zollern with the extracted point cloud from Pix4D as reference.
Table 3. Quantitative performance of UnDER-P+FPCfilter on the two additional datasets, UAV-Nunspeet and the subset of UAV-Zeche-Zollern. Accuracy metrics included the mean μ and standard deviation σ of cloud-to-cloud (C2C) distances, comparing the point clouds extracted from the UAV-Nunspeet dataset and the UAV-Zeche-Zollern with the extracted point cloud from Pix4D as reference.
Dataset μ (m) σ (m)
Table 4. Running time of UnDER-P+FPCfilter and Pix4D on the subset of the UAV-Zeche-Zollern dataset.
Table 4. Running time of UnDER-P+FPCfilter and Pix4D on the subset of the UAV-Zeche-Zollern dataset.
MethodRunning Time (mins)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bergado, J.R.; Nex, F. UnDER: Unsupervised Dense Point Cloud Extraction Routine for UAV Imagery Using Deep Learning. Remote Sens. 2025, 17, 24.

AMA Style

Bergado JR, Nex F. UnDER: Unsupervised Dense Point Cloud Extraction Routine for UAV Imagery Using Deep Learning. Remote Sensing. 2025; 17(1):24.

Chicago/Turabian Style

Bergado, John Ray, and Francesco Nex. 2025. "UnDER: Unsupervised Dense Point Cloud Extraction Routine for UAV Imagery Using Deep Learning" Remote Sensing 17, no. 1: 24.

APA Style

Bergado, J. R., & Nex, F. (2025). UnDER: Unsupervised Dense Point Cloud Extraction Routine for UAV Imagery Using Deep Learning. Remote Sensing, 17(1), 24.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop