1. Introduction
Reforestation is widely recognized as one of the most important tools in combating climate change [
1], offering numerous benefits such as carbon sequestration and restoring biodiversity for native species. However, legacy tools for quantifying reforestation efforts face significant challenges in terms of accuracy, cost and scalability, holding back capital investments and preventing reforestation from playing a larger role in regulating our climate. Specifically, traditional field work for forest inventories to obtain tree positions and diameter at breast height (dBH) measurements is labor-intensive and time-consuming [
2,
3,
4,
5]. Moreover, field samples today are measured using simple tools like tape measures, calipers, and clinometers, all of which are susceptible to human error and lead to inaccuracies in collected data [
6]. Traditional fieldwork, being labor-intensive, is also constrained in its ability to cover large areas. This limitation forces forestry projects to rely on sampling small regions, which increases the likelihood of measurement errors [
7].
A recent NASA project, the Global Ecosystem Dynamics Investigation (GEDI), provides a repeated canopy height map for the first time, collecting relative heights and canopy heights at a resolution of 25 m [
8]. The Ice, Cloud, and land Elevation Satellite-2 (ICESat-2) collects canopy height measurements with a
m footprint [
9]. To improve the accuracy of spaceborne LiDAR data from GEDI or ICESat-2, several projects have proposed using multi-sensor fusion to enhance aboveground biomass mapping [
10], including integration with optical imagery from Sentinel-2 [
11]. Additionally, two recent global vegetation height maps are freely available at 10-m and 30-m spatial resolutions, derived from Sentinel-2 [
12] and Landsat [
13], respectively. However, while these maps are useful for scientific applications, they remain limited for local forestry applications that require very high accuracy and resolution due to their relatively coarse resolution and large uncertainties.
On the other hand, studies that apply state-of-the-art deep network models directly to satellite imagery to estimate forest details, such as canopy height and above-ground carbon stocks in forest ecosystems [
14,
15,
16,
17], are becoming increasingly popular. Mugabowindekwe et al. applied deep learning techniques to nationwide tree-level carbon stock estimation in Rwanda, leveraging high-resolution imagery for precise mapping [
14]. Other studies have explored different methodologies for carbon stock estimation, including the integration of multi-modal satellite time-series data [
16] and the use of self-supervised vision transformers trained on aerial LiDAR to generate high-resolution canopy height maps from RGB satellite images [
17]. However, these models are still limited by satellite resolution. Even high-resolution commercial satellites such as WorldView-3 typically provide ground sample distances (GSDs) of about 0.3–3 m. In addition, publicly available datasets, including Landsat and Sentinel-2, have even lower spatial resolutions, with GSDs of 30 and 20 m, respectively, [
15]. Satellite imagery is also predominantly captured from near-vertical perspectives, which limits the accuracy of canopy height estimation due to occlusion and the overlap of tree crowns [
14]. Moreover, weather conditions, such as cloud cover and atmospheric interference, pose significant challenges for optical satellite imagery [
18,
19]. These limitations hinder the ability to capture detailed forest features such as crown size and canopy height [
20]. The issues are particularly critical in reforestation settings, where measuring year-over-year changes in carbon stock is essential for generating carbon credits, yet satellite-based models often struggle to perform accurately.
Airborne LiDAR is considered the gold standard and ground truth for many forestry applications [
19]. Previous studies demonstrate the effectiveness of airborne LiDAR for 3D reconstruction in forestry mapping, canopy height calculations, and Digital Terrain Maps (DTMs)—all of which are highly important for reforestation projects [
21,
22,
23]. However, one key challenge is the high operational cost associated with LiDAR surveys, which require specialized aircraft and sensors. This makes frequent large-scale monitoring financially unfeasible, particularly for developing regions and conservation projects with limited budgets. Additionally, the high costs of equipment and logistics further constrain the scalability of LiDAR deployment, rendering it impractical for most forestry projects [
24].
Three-dimensional Gaussian Splatting [
25] and its scene reconstruction capabilities have shown promise in various applications such as robotics and autonomous systems, reproducing texture more faithfully compared to traditional methods like LiDAR and point cloud techniques [
26,
27,
28,
29]. In this project, we present a novel reforestation MRV system integrating large-scale 3D Gaussian Splatting. Our framework relies solely on camera footage from consumer-grade inexpensive hardware such as DJI Mini drones, bridging the current gap between inaccurate hand measurements and expensive LiDAR scans, enabling large-scale, cost-effective, and accurate measurement for forestry.
Technically, we propose a comprehensive and cost-effective pipeline for accurate and scalable forestry modeling using only high-resolution images captured by a consumer-grade drone. For Gaussian Splatting, we introduce a simple yet effective method to address training challenges in large-scale environments by integrating a neural-agnostic scaffold densification strategy with a lightweight partitioning process. Additionally, we present an efficient approach for estimating canopy height maps from Multi-View Stereo point clouds and 3D Gaussian models.
Furthermore, we open-source a large-scale forestry dataset covering 200 acres, which includes LiDAR data and 13,657 drone images. We believe this dataset will serve as a valuable benchmark and comparison baseline for advancing MRV methodologies using Gaussian Splatting or other computer vision-based approaches.
3. ForestSplat Pipeline
We introduce ForestSplat, a large-scale Gaussian Splatting-based pipeline designed for mapping vegetation in reforestation projects. ForestSplat leverages a novel combination of Structure from Motion (SfM), Multi-View Stereo (MVS), and 3D Gaussian Splatting (GS) to enhance the fidelity of modeled trees and successfully generate canopy height maps (CHMs) from GS models.
Figure 2 illustrates the overall design of our approach.
3.1. Problem Statement
Given a set of
N aerial images
, the objective is to derive a high-resolution canopy height map representing relative vegetation height to the ground. This process involves reconstructing a sparse 3D model (
Section 3.3) by leveraging SfM to estimate sparse 3D points
and camera poses
, densifying the reconstruction using MVS (
Section 3.5), and fitting Gaussian distributions to the dense point cloud through GS to obtain
map (
Section 3.6). The final step involves extracting height values from Gaussian splats and differentiating canopy and ground levels to produce the CHM (
Section 3.7). The overall pipeline is summarized as follows:
3.2. Preprocessing
Image Pairs Generation
Initially, SfM requires a set of image pairs to guide feature matching. However, generating image pairs exhaustively would be inefficient, as it creates
image pairs. This is impractical for thousands of images. Instead, we leverage GNSS coordinates for more efficient image pair generation, focusing on images with high overlap. This approach reduces unnecessary computations while still maintaining a highly accurate 3D sparse model. Specifically, we calculate the ground footprint size of each image as follows:
where
A is the altitude, and
and
are the sensor width and height of the camera, respectively. From this, we derive the 2D corner coordinates of the initial ground footprint
as:
We then calculate the exact ground coordinates for each frame by where ′ indicates the homogeneous coordinates, and is the UTM coordinates. Since the ground footprints are represented as polygons, we calculate the intersection over union as and use this to represent the score value of each image pair.
3.3. Structure from Motion
Now, given a limited number of robust image pairs generated using the technique presented in the previous section, we aim to determine the camera extrinsics for each image i, where and , as well as a set of sparse 3D coordinates corresponding to feature j.
For the forestry-based dataset, we observed that sparse keypoints present significant challenges for achieving accurate matching between image pairs. To address this issue, we leverage semi-dense features for local feature matching of each pair. Specifically, we use TopicFM [
30] to extract 2D–2D correspondences. Due to the snake-pattern flight path, consecutive flight legs result in alternate images being flipped by 180° or rotated by 90°. To improve feature matching accuracy during the Structure from Motion (SfM) process, we rotate these images to maintain a consistent orientation.
For global applicability to any rotation angle present in the image pairs, we calculate a global yaw angle that appears most frequently across the dataset. The rotation needed for each image i is then determined as .
After completing all matching steps for the aligned images, we reapply the inverse transformation to the matched pixels j using to restore the original orientation.
Additionally, we observed that the traditional incremental SfM pipeline [
31] requires a significant amount of time to fully reconstruct 13K images. This is because incremental methods attempt to reconstruct the scene starting from two views and then sequentially register new camera images along with their associated 3D structures. To reduce processing time, we instead leverage a global SfM pipeline for this sparse 3D reconstruction. Specifically, we use Glomap [
32], a method that combines camera positioning, bundle adjustment, and structure refinement into a single global positioning step. This approach reduces the 3D sparse reconstruction time on our large dataset from several days to just a couple of hours.
3.4. Transformation to World Coordinates
Sparse 3D models reconstructed using (SfM) are inherently unscaled [
32], as the camera poses and 3D scene coordinates do not align with a real-world coordinate system. To address this limitation, we propose a simple method for estimating a transformation model that maps SfM-derived coordinates to real-world coordinates using noisy GNSS data.
Given
, the camera position in the original coordinate system of the SfM model, and
, the corresponding camera position in the real-world coordinate system derived from GNSS data, we aim to estimate a similarity transformation
. The transformation
T is represented as:
where
s is a scaling factor,
is a rotation matrix, and
is the translation vector. The transformation from
to
can be expressed as:
To obtain a robust estimation of
T, we leverage the RANSAC algorithm [
33] to minimize the following objective function:
In comparison to traditional photogrammetry methods, such as those employed by Pix4D software [
34], which typically require multiple ground control points (GCPs) to establish a proper scale and coordinate alignment, our approach eliminates this dependency. This significantly enhances the level of automation in the reconstruction pipeline.
3.5. Multi-View Stereo
Following SfM, we densify sparse 3D SfM point clouds using Multi-View Stereo (MVS) algorithms. Specifically, we use ET-MVSNet (Enhanced Texture Multi-View Stereo), which leverages enhanced texture information to generate high-fidelity, dense point clouds [
35]. This method significantly improves the detail and provides a more comprehensive representation of the surveyed environment, forming a better foundation for subsequent splatting.
In detail, given the camera poses , intrinsic matrix K, and images , as well as sparse 3D coordinates obtained from a previous SfM step, the goal is to densify the sparse model from to a denser representation , where .
Specifically, given a reference image i and source images, the method first extracts feature representations for all images using a Feature Pyramid Network (FPN) integrated with an Epipolar Transformer module at the coarsest resolution. These enhanced features are then propagated through subsequent layers of the pipeline.
Next, the cost volumes are constructed by warping source image pixels into the reference camera frustum and measuring their similarity. These feature volumes are aggregated to construct a 3D cost volume for each depth hypothesis of the reference image. Finally, a 3D CNN is applied to the cost volume for regularization, enabling the inference of the most likely depth hypothesis for each pixel in the reference image.
Leveraging this, each reference image i now can have a depth map . In the subsequent Gaussian Splatting step, this depth map is further utilized as a robust loss function for training a dense and robust 3D model.
3.6. Gaussian Splatting
We develop a custom large-scale Gaussian Splatting model called ForestSplat that is built from the
gsplat [
36] framework and incorporates elements from the implementation of Level of Gaussians [
36,
37]. Gaussian splats are especially adept at modeling tree features like leaves and branches, and allows us to further enrich the models developed from MVS with pixel-perfect 3D reconstructions. Our approach captures fine-grained details and accurate texture representations, which are crucial for precise carbon stock estimation and tree dimension measurements. It also ensures efficient processing and rendering of large-scale datasets, making it suitable for extensive reforestation projects.
Preliminary, 3D-GS [
36] represents a scene using a set of anisotropic 3D Gaussians, denoted as
, where
is the 3D position, typically initialized from Structure from Motion (SfM) models.
is the covariance matrix of the 3D Gaussian, encoding the scale and orientation in 3D space, while
represents its opacity, which is used for rendering and pruning. The Gaussian splats are projected onto the image plane during rendering, where gradients in 2D image space are leveraged for optimization.
In this paper, instead of initializing
from
, we initialize the Gaussian means
from the results of a prior coarse densification process, denoted as
. This approach significantly enhances the density and quality of the surface Gaussian model
, which is critical for generating a high-accuracy and high-resolution CHM model. In addition, we also transform the initial 3D points and camera poses to a world coordinate system using
T estimated in
Section 3.4.
Inspired from GS-scaffold [
38], our work introduces a neural-agnostic scaffold densification strategy that enhances scene representation without directly relying on neural features. Specifically, we aim to address challenges in dense scene coverage by introducing gradient-driven anchor growing and structured pruning methods. This later avoids reliance on computationally expensive neural predictions while maintaining high-quality scene representation. We present these techniques in the following sub-sections.
3.6.1. Gradient-Driven Anchor Growing
In the proposed approach, new Gaussians are added adaptively based on 2D image-plane gradients, allowing for a more geometrically informed densification process. Each Gaussian splat
contributes a normalized 2D image-plane gradient, denoted as
. Anchor candidates are selected based on a threshold
, forming the growth mask:
Similar to [
38], an anchor is treated as the center of each voxel that represent
M point cloud
as:
where
denotes voxel centers, and
is the voxel size.
To prevent over-densification, candidate positions are quantized into a voxel grid of size , ensuring spatial consistency. Unique positions within the voxel grid are retained, and new anchors are initialized with appropriate scales and opacities. This geometric-only approach allows efficient expansion of the Gaussian set, even in scenarios where neural features are unavailable.
3.6.2. Structured Pruning
To maintain computational efficiency and avoid overgrowth, low-opacity and geometrically inconsistent Gaussians are pruned periodically. The pruning process uses three criteria:
Opacity Threshold: Gaussians with opacity
below a threshold
are removed:
Scale Constraint: Gaussians with overly large scales (as determined by the eigenvalues of
) are pruned:
Geometric Bounds: Gaussians outside the scene’s vertical bounds
are removed:
The final pruning mask is the union of these individual masks:
3.6.3. Gradient-Driven Updates
To refine the representation over time, we track the running average of the gradient norms for each Gaussian:
where
T is the current epoch, and
is the number of frames in which the Gaussian
is visible. This allows us to prioritize highly relevant regions for future densification or refinement.
3.6.4. Loss Function
To train the 3D Gaussian models we use the following loss function:
where
is the balance coefficient for each loss function. In detail, these loss functions are defined as:
Reconstruction Loss: The reconstruction loss ensures that the rendered colors
match the ground truth
:
where
balances the L1 loss and the Structural Similarity Index Measure (SSIM).
Scale Regularization: The scale regularization penalizes excessively large scales of Gaussians:
where
represents the logarithmic scale parameters of Gaussian
j.
Opacity Regularization: The opacity regularization ensures meaningful opacity values:
where
is the sigmoid function applied to the opacity parameters
.
Depth Loss: The depth loss enforces consistency between rendered depth and ground-truth depth:
where
is a mask selecting pixels with valid depth measurements, and
is derived from MVS in
Section 3.5.
3.6.5. Partitioning for Training Large-Scale GS Model
Since the reconstructed 3D sparse model from SfM is too large to train a single 3D Gaussian Splatting (GS) model, we propose a simple partitioning process to divide the SfM model into multiple smaller partitions. Each partition is then trained individually using the GS settings described above.
Figure 3 illustrates the process of dividing a large SfM model into multiple sub-models.
Specifically, we calculate the origin point of the SfM model by averaging the latitude and longitude of all images. Using this origin point as a reference, we define the position of each partition within a grid of boxes, each measuring 300 m in width and height. This approach results in multiple boxes, as illustrated in
Figure 3. To better accommodate large-scale datasets, we further employ a Level of Detail (LoD) strategy [
39] to enhance rendering quality and train multiple detail levels for each partition.
3.7. Estimate Canopy Height Map from Gaussian Models
In this section, we describe the process of estimating the canopy height map (CHM) from the dense Gaussian models introduced earlier. The CHM construction begins with calculating the Digital Terrain Model (DTM) using the Cloth Simulation Filter (CSF) [
40], which estimates the ground surface from the given point cloud data. The CSF algorithm simulates a cloth draped over the inverted point cloud, classifying the points touched by the cloth as ground points.
Starting with the dense point cloud
, derived from ET-MVS [
35] (see
Section 3.5), which represents a combination of surfaces including the ground, vegetation, and other objects, we extract a set of cloth nodes
from
using CSF [
40]. Note that the cloth nodes are not directly extracted from the MVS point cloud; rather, they are points from the simulated cloth. Specifically, the
z positions of the cloth nodes represent the simulated cloth’s height at certain
positions, which are arranged in a grid formation. These cloth nodes are subsequently used to interpolate the DTM as a continuous surface,
.
Once the DTM is generated, the CHM is computed by estimating the height of the Gaussian splats relative to the ground surface
. Using the 3D Gaussian model
obtained in
Section 3.6, we render a dense depth map
via orthographic projection [
25], ensuring a high-resolution and geometrically accurate representation of the canopy. The initial CHM is then computed as:
where
is the camera height above the reference plane. To mitigate noise and interpolation errors, we use an external vegetation filter based on a U-Net [
41] model
, which predicts a vegetation mask
from the orthographic RGB image rendered from
:
where
, and
represents the probability that pixel
corresponds to vegetation. This filter refines the CHM as:
However, the cloth nodes
obtained from CSF may include unreliable points, especially in areas with canopy cover. To address this, we refine the cloth nodes using the vegetation filter
. Specifically, we create a binary mask
for the cloth nodes based on a threshold
:
For each cloth node
, we compute a vegetation score by extracting a local region of the vegetation mask within a radius
r:
where
is defined as the set of pixels
within a circular region centered at
with radius
r. Cloth nodes with
are considered invalid and removed. Heights for these invalid nodes are interpolated linearly using their nearest valid neighbors.
Finally, the refined ground model is used to compute the final CHM using the earlier equations, ensuring accurate canopy height estimation.
3.8. Evaluation Metrics and Baselines
Evaluation metrics: Following previous works [
12,
17], we use several metrics to evaluate the CHM estimation results of the proposed method. These metrics include the RMSE (Equation (
23)), which emphasizes higher errors in height estimation, the MAE (Equation (
24)), which provides an equal average of all height errors, and the ME Equation (
25)), which quantifies height bias, with a negative bias indicating that the estimated height
is lower than the LiDAR reference ground truth
.
We additionally report the
R2-block (
R2), which is defined as follows:
where
is the mean of actual target values. The
, also known as the coefficient of determination, is a statistical measure that indicates how well a regression model fits the observed data. In this paper, we use it to evaluate how well a block of 150 × 150 pixels (∼15 m × 15 m) corresponds to the same block size of LiDAR results. Unlike [
17], we select a block size of 15 m instead of 30 m to achieve more reliable results, ensuring higher accuracy for future applications such as biomass measurement. Furthermore, we also report the performance of the proposed method using smaller, more challenging block sizes of 10 m and 5 m.
Baselines: To assess the modeling fidelity of ForestSplat, we compare its canopy height models (CHMs) with those derived from airborne LiDAR data. Canopy height is a critical proxy for biomass and serves as the sole metric where airborne LiDAR is unequivocally considered the ground truth. The CHM for airborne LiDAR is generated using ArcGIS [
42], a widely used tool for processing and analyzing LiDAR data in geospatial research.
To benchmark ForestSplat, we select two baseline methods for comparison, ensuring an objective comparison of our technique against industry gold standards. The first is SSL-Satellite [
17], a state-of-the-art approach for producing very high-resolution canopy height maps from RGB imagery using a self-supervised vision transformer and a convolutional decoder. This method can predict CHMs with high accuracy directly from satellite images.
The second baseline is a traditional photogrammetry-based approach [
34], which employs Pix4D software to generate CHMs. This method relies on well-established photogrammetric techniques to reconstruct canopy height models from imagery.
4. Experiments
This section presents a detailed explanation of the experimental setups and the corresponding results, benchmarked against the baselines, for reconstructing high-resolution and highly accurate CHMs. The data collection methodology, including drone flight parameters and image acquisition details, is described in
Section 2.
4.1. Experimental Setups
SSL-Satellite [17]: Using Google Earth Pro, we downloaded satellite imagery of the survey area corresponding to the same time in November 2023 as the LiDAR data collection. The satellite image is shown in
Figure 1, where the region of interest, enclosed by the red polygon, represents the study area initially used for reforestation research and serves as the primary benchmark dataset in this work. Specifically, the satellite image was downloaded at a high resolution of 8192 × 4437 pixels (approximately 0.34 m GSD per pixel). Following the settings in [
17], we cropped the satellite image into a set of 256 × 256 pixel tiles, resulting in 544 non-overlapping crops. We then applied the same configurations proposed in [
17] to generate CHMs for each crop, where each loaded RGB satellite image was normalized using a mean of (0.420, 0.411, 0.296) and a standard deviation of (0.213, 0.156, 0.143). Note that the GSD and image normalization configurations were recommended by the authors [
17]. Since the SSL-Satellite provides several pre-trained models for predicting CHMs, we selected the same evaluation model as [
17], namely SSLhuge-Satellite.
After obtaining all CHMs for 544 non-overlapping crops through SSLhuge-Satellite model, we merged them back in their correct order to reconstruct an image of the original size, 8192 × 4437 pixels, and converted it to the GeoTIFF format for accurate comparison with our method and other baselines.
Photogrammetry-based [34]: For this baseline, we used 13K images captured in this site in January 2024. We then divided the site into four non-overlapping areas for processing. This step was necessary because the entire survey area was too large to be processed at once using Pix4D. Additionally, since the proposed dataset lacks the recommended 70% side overlap for Pix4D (having only 30% side overlap, as mentioned in
Section 2, which makes the task highly challenging), we had to manually label matching pixels between images to ensure acceptable results. Without this manual intervention, the CHM’s performance would have been significantly worse. This process required at least 24 h of manual labor. Finally, after exporting the point cloud from Pix4D, we used CloudCompare, which employs the CSF algorithm [
40], to estimate the ground surface and calculate the CHMs. The four obtained CHM parts from this model were also merged into a unified one and registered as a GeoTIFF for later comparison.
The Proposed ForestSplat Pipeline: We used the following configurations. We set for image pair generation, resulting in approximately 500K image pairs compared to 93M pairs generated using an exhaustive approach. For Gaussian Splatting, we used a partition size of , which created 18 non-overlapping partitions. Each partition contains a different number of valid regions, as some partitions are located near the borders of the survey area.
To train each Gaussian Splatting partition model, we set the growing anchor threshold , pruning opacity threshold , and scale . The weighting coefficients in the total loss function were configured as follows: to regulate the Gaussian scales, to penalize opacity values, to enforce depth consistency, and to balance the reconstruction loss between the L1 term and the SSIM term.
For CHM generation, we used , , and for all partitions. For training the vegetation filter , we randomly selected 90 tiles (each covering 50 m × 50 m of ground area) from nearby regions of the proposed dataset and manually labeled them with binary masks to distinguish tree areas from non-tree areas. To ensure that the model does not encounter out-of-distribution data when applied to the proposed dataset, we included several randomly labeled tiles selected from the proposed dataset of the 200-acre area. Note that the total number of labeled tiles used for training and evaluation remained at 90. The filter was then trained for 50 epochs on of the training data (67 tile images) using a learning rate of 0.001. The trained filter was subsequently applied to the entire 200-acre dataset. Interestingly, we found that training on this small amount of data was sufficient to achieve good results for filtering in the final CHM calculation.
Since the images were captured two months after the LiDAR data, we adjusted the CHMs by reducing the height of all values greater than 1 m by 10 cm to enhance the reliability of comparisons. This adjustment assumes a tree growth rate of 5 cm per month and was applied to the photogrammetry-based method [
34] and the proposed method. In contrast, the SSL-Satellite results [
17] were left unchanged, as the satellite images were obtained at the same time as the LiDAR data.
4.2. Results
4.2.1. Results at 15 m × 15 m Block Size
We present the obtained results in
Table 1 and
Table 2. Specifically,
Table 1 summarizes the comparison results of the proposed pipeline against SSL-Satellite [
17] and the photogrammetry-based method [
34] across all partitions in terms of RMSE, MAE, and ME metrics. The proposed pipeline consistently outperformed the other methods, achieving the lowest weighted average RMSE (0.272 m), MAE (0.172 m), and ME (0.007 m) across all partitions. In contrast, the SSL-Satellite method exhibited the highest average errors, with RMSE, MAE, and ME values of 0.854 m, 0.522 m, and 0.090 m, respectively. The photogrammetry-based method demonstrated intermediate performance, with average errors of 0.653 m (RMSE), 0.433 m (MAE), and 0.280 m (ME). These results highlight the robustness and reliability of the proposed approach in reconstructing CHMs with high precision. In
Table 2, we provide a detailed number of valid blocks,
scores, and an arbitrary accuracy at a threshold of 0.5 m. In the weighted average of all partitions, the proposed ForestSplat achieved the highest accuracy at
and demonstrated an
score of 0.79, compared to −1.21 and −0.99 for the photogrammetry-based and SSL-Satellite methods, respectively. This high
score indicates that ForestSplat’s estimations align closely with the ground truth, outperforming the baseline methods that fail to fit the ground truth effectively.
Figure 4 and
Figure 5 show the comparative results of canopy height models (CHMs) generated by different methods, using two example partitions (#19 and #24), evaluated against LiDAR as the ground truth. For both partitions, the proposed ForestSplat pipeline consistently demonstrates superior performance compared to SSL-Satellite and photogrammetry-based methods. The photogrammetry-based method completely failed to produce CHMs in areas with water, leaving these regions blank. Similarly, the SSL-Satellite method shows poor performance when merging crop regions, as evidenced by the frequent appearance of sharp edges in its CHMs, disrupting the continuity of the height maps. Additionally, the block-wise CHMs and scatter plots visually highlight ForestSplat’s ability to deliver consistent and accurate height estimations, further underscoring its effectiveness in producing accurate and reliable canopy height maps.
4.2.2. Merging CHMs Result
Here, we present a visual comparison of all methods for generating merged CHMs, including ForestSplat, crop-based CHMs from SSL-Satellite [
17], and four sub-sites from photogrammetry [
34], with LiDAR serving as the ground truth. The visual results, shown in
Figure 6, highlight that the proposed ForestSplat method achieves a consistent match with the LiDAR CHM. In contrast, the photogrammetry-based method demonstrates the poorest performance in the merging process, failing to cover many areas and leaving significant gaps. Meanwhile, the SSL-Satellite method struggles to ensure smooth transitions between adjacent crop regions, resulting in noticeable discontinuities and sharp edges in the merged CHMs.
4.2.3. Changing Blocks Size
To evaluate the performance of each method under different spatial resolutions, we analyzed the results of canopy height models (CHMs) using varying block sizes:
,
, and
. The scatter plots in
Figure 7 illustrate the correlation between the predicted block mean heights and the LiDAR ground truth for SSL-Satellite, photogrammetry-based, and the proposed ForestSplat method at these resolutions.
Across all block sizes, ForestSplat consistently demonstrates superior performance, achieving the highest values (e.g., 0.788 at , 0.775 at , and 0.759 at ) and the lowest mean absolute error (MAE) and mean error (ME). These results indicate that ForestSplat maintains high accuracy and robustness in capturing canopy height distributions at finer resolutions, even when block sizes decrease.
In contrast, the photogrammetry-based method shows moderate performance at larger block sizes (e.g., at ), but its reliability deteriorates significantly as the block size decreases, with lower scores and increasing error. Additionally, large gaps and deviations from the LiDAR ground truth are observed in smaller blocks.
The SSL-Satellite method performs poorly across all block sizes, with consistently negative values (e.g., at and at ). This method struggles to capture fine-grained canopy height variations and exhibits higher MAE and ME compared to the other approaches.
These results demonstrate that ForestSplat is robust to changes in block sizes, offering reliable estimations at varying spatial resolutions, while the other methods exhibit significant limitations, particularly at smaller block sizes or very high resolutions. This is because ForestSplat leverages 3D Gaussian Splatting, which preserves fine-scale details by continuously optimizing scene representation at a sub-pixel level. On the other hand, the SSL-Satellite method relies on coarse-resolution satellite imagery, leading to blurred height estimations at smaller block scales. Photogrammetry-based CHMs often require high image overlap (≥70%) for accurate reconstruction; however, since our dataset had only 30% side overlap, its reliability at finer spatial resolutions was significantly reduced. Additionally, the point clouds produced by the photogrammetry-based method remain relatively coarse, whereas ForestSplat refines them further—first through MVS (Multi-View Stereo) to improve density and accuracy, and then through Gaussian Splatting, which continuously optimizes and smooths the representation for a more precise canopy height model (CHM).
4.2.4. Adjustment for Tree Growth
To account for an assumed tree growth of
per month and the two-month gap between the LiDAR data collection and image acquisition, all CHM partitions were adjusted by
with
m. As shown in
Table 3, this adjustment led to a general improvement in performance metrics, highlighted in green, with notable enhancements in
(e.g.,
on average) and a reduction in MAE (
on average). While some partitions experienced slight reductions in performance, marked in red, the overall metrics demonstrate better alignment with the LiDAR ground truth after this correction, achieving an average accuracy of
within
. This adjustment reflects the significance of accounting for tree growth when comparing CHMs to ground truth data.
4.2.5. Additional Results
In this section, we present additional example results that demonstrate the high fidelity of the proposed rendered images compared to satellite images. We also provide qualitative vegetation prediction results obtained using the proposed method on the rendered images.
In
Figure 8, we present two examples of rendered image tiles, each covering an area of 50 m × 50 m, compared to satellite images. These examples demonstrate that the proposed method can achieve the ultra-high resolution required for MRV. The rendered images have a ground sampling distance (GSD) of 1 cm per pixel, with each tile containing
pixels. The last column displays predicted vegetation masks generated from the corresponding rendered images.
Since no hand-labeled ground truth is available for these images, we cannot provide quantitative metrics for these vegetation predictions. However, we found that the current filter is sufficient for removing incorrect height estimates from non-vegetation regions. Nonetheless, we believe that further improving the vegetation filter could potentially enhance CHM estimation accuracy. Therefore, we identify this as a potential direction for future work.
5. Discussion
Discussion. This work demonstrates the promise of using computer vision-based 3D reconstruction (ForestSplat) as a highly accurate yet more scalable alternative to airborne LiDAR for reforestation MRV. ForestSplat relies solely on a low-cost camera to achieve competitive results, with the potential to reduce operational costs by up to 100× compared to airborne LiDAR scans.
ForestSplat proves most effective in forestry settings where ground visibility is preserved within localized areas. As reforested canopies close over time, accurate geographical understanding of the ground can still be maintained by incorporating temporal dimensions into the modeling process. However, in forest conservation scenarios with enclosed, dense canopies or challenging terrains, such as primary tropical rainforests, the performance of Gaussian Splatting may be limited due to its inability to perceive the ground effectively.
Limitation. While ForestSplat demonstrates high accuracy and cost-effectiveness for canopy height estimation, there are several limitations to consider. First, the method relies on high-quality aerial imagery, meaning that variations in lighting conditions, camera calibration, and flight stability can impact reconstruction quality. Second, dense forest canopies, such as those found in tropical rainforests, may obstruct ground visibility, leading to errors in height estimation. Additionally, complex terrains, such as steep slopes or mountainous regions, may introduce distortions in 3D reconstruction if not properly corrected with terrain-aware adjustments or external Digital Terrain Models (DTMs). Another limitation is the dependence on GNSS accuracy for aligning SfM reconstructions to real-world coordinates; in GNSS-denied environments, alternative localization methods, such as visual-inertial odometry (VIO) or SLAM, may be required.
Furthermore, while LiDAR serves as the primary comparison ground truth in this study, it is not without its own errors, particularly in sparse vegetation regions or dense canopies where penetration is limited, potentially affecting the accuracy of reference canopy height maps. Additionally, variations in LiDAR point density and ground filtering algorithms can introduce biases in height estimation, especially in regions with mixed vegetation structures. Differences in georeferencing accuracy and processing pipelines between LiDAR and photogrammetry-based CHMs may also contribute to local misalignments, affecting direct comparisons.
Finally, scalability remains a challenge—while ForestSplat has been validated on a 200-acre site, further optimizations in data processing and computational efficiency will be needed for large-scale deployments spanning thousands of acres. Addressing these challenges is crucial to ensuring ForestSplat’s broader applicability and effectiveness in more diverse and demanding environments.
Furture work. To overcome these limitations, future work will explore the applicability of ForestSplat to diverse forest types, including tropical rainforests, temperate forests, and coniferous woodlands, where differences in canopy density and structure may require adaptations in Gaussian Splatting parameters and multi-temporal data collection. Additionally, extending the method to complex terrain conditions, such as mountainous regions, will involve integrating Digital Terrain Models (DTMs) and exploring GNSS-free camera pose optimization for areas with poor GNSS signals. To enhance scalability, future research will investigate distributed processing techniques and cloud-based CHM computation to extend ForestSplat’s usability to large-scale forest monitoring.
Another important direction is improving tree-level segmentation of 3D Gaussian Splat models to enable more precise biomass estimation and carbon stock analysis. By refining individual tree extraction, ForestSplat could provide valuable insights into forest structure, growth dynamics, and carbon sequestration potential, with applications in forest conservation and climate impact studies. Validation will involve comparisons against comprehensive ground truth data, including hand-measured transect data and additional scans from airborne and terrestrial LiDAR systems. Developing a cost-effective and scalable MRV system with LiDAR-comparable accuracy is vital for advancing forestry management and nature-based climate solutions. Such a system would empower foresters to transition from simple drone scans to detailed, tree-specific intelligence, enhancing forestry practices such as planting strategies, maintenance, and carbon credit generation. We hope this work contributes to accelerating the growth of the emerging forestry-based carbon removal sector.