SatelliteRF: Accelerating 3D Reconstruction in Multi-View Satellite Images with Efficient Neural Radiance Fields

Zhou, Xin; Wang, Yang; Lin, Daoyu; Cao, Zehao; Li, Biqing; Liu, Junyi

doi:10.3390/app14072729

Open AccessArticle

SatelliteRF: Accelerating 3D Reconstruction in Multi-View Satellite Images with Efficient Neural Radiance Fields

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

²

Key Laboratory of Target Cognition and Application Technology (TCAT), Beijing 100190, China

³

Key Laboratory of Network Information System Technology (NIST), Beijing 100190, China

⁴

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(7), 2729; https://doi.org/10.3390/app14072729

Submission received: 3 March 2024 / Revised: 21 March 2024 / Accepted: 22 March 2024 / Published: 25 March 2024

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

Our approach uses multi-view satellite images to achieve a fast and efficient reconstruction of the Earth’s surface, generating digital surface models (DSMs), point clouds, and mesh models, which have important applications in key areas such as urban mapping, autonomous navigation, and ecological monitoring.

Abstract

In the field of multi-view satellite photogrammetry, the neural radiance field (NeRF) method has received widespread attention due to its ability to provide continuous scene representation and realistic rendering effects. However, the satellite radiance field methods based on the NeRF are limited by the slow training speed of the original NeRF, and the scene reconstruction efficiency is low. Training for a single scene usually takes 8–10 h or even longer, which severely constrains the utilization and exploration of the NeRF approach within the domain of satellite photogrammetry. In response to the above problems, we propose an efficient neural radiance field method called SatelliteRF, which aims to quickly and efficiently reconstruct the earth’s surface through multi-view satellite images. By introducing innovative multi-resolution hash coding, SatelliteRF enables the model to greatly increase the training speed while maintaining high reconstruction quality. This approach allows for smaller multi-layer perceptron (MLP) networks, reduces the computational cost of neural rendering, and accelerates the training process. Furthermore, to overcome the challenges of illumination changes and transient objects encountered when processing multi-date satellite images, we adopt an improved irradiance model and learn transient embeddings for each image. This not only increases the adaptability of the model to illumination variations but also improves its ability to handle changing objects. We also introduce a loss function based on stochastic structural similarity (SSIM) to provide structural information of the scene for model training, which further improves the quality and detailed performance of the reconstructed scene. Through extensive experiments on the DFC 2019 dataset, we demonstrate that SatelliteRF is not only able to significantly reduce the training time for the same region from the original 8–10 h to only 5–10 min but also achieves better performance in terms of rendering and the reconstruction quality.

Keywords:

multi-view satellite photogrammetry; implicit representation and neural radiance fields; earth surface reconstruction

1. Introduction

As high-resolution satellite images play an increasingly important role in geospatial cognition, using these images to perform large-scale 3D models of the Earth’s surface has become a research hotspot in the fields of remote sensing and photogrammetry [1,2,3,4]. Compared with traditional stereo or tri-stereo photogrammetry methods and the use of active sensor technologies such as lidar, 3D reconstruction methods based on multi-view satellite images are widely used in urban mapping [5,6], environmental monitoring [7], navigation [8], and emergency rescue [9,10] due to their advantages in terms of easily accessible data, high collection efficiency, and low cost. However, 3D reconstructions based on multi-view satellite images still face many challenges. Unlike stereo or tri-stereo images, which are acquired almost simultaneously, multi-view satellite images are usually obtained during multiple acquisitions at different times, which means that images from different viewpoints may vary due to illumination conditions, seasonal changes, and changes in the surface content [11]. In addition, the number of multi-view satellite images and the distribution of viewpoints are often sparse. These characteristics greatly increase the difficulty of the 3D reconstruction of satellite remote sensing.

In traditional 3D reconstruction methods for satellite images, stereo matching methods like semiglobal matching (SGM) [12] are usually first used to generate depth maps, and then these depth maps are fused to build the digital surface model (DSM). The entire 3D reconstruction pipeline involves several steps such as stereo pair selection, dense stereo matching, triangulation, and depth map fusion [13]. When processing multi-view satellite images with appearance variations, this approach may struggle to extract effective visual features, resulting in holes or complete failure. Additionally, the selection of stereo pairs and the process of designing features often depend on manual design and expert knowledge. Deep-learning-based approaches improve the robustness of appearance changes in multi-view satellite images. This is achieved through an encoder–decoder architecture, which establishes a mapping relationship between features and 3D volumes [14]. However, although deep-learning-based reconstruction methods overcome the challenges faced by traditional methods and show advantages in dealing with appearance variations, their generalization capabilities are still limited when facing unseen viewpoints and scenes. Furthermore, these methods require large amounts of data and computing resources to train the models and rely on real 3D models for supervised learning.

Recently, methods based on neural radiance fields [15] have demonstrated great potential in the area of novel view synthesis and 3D reconstruction due to their compact model and realistic rendering effects. Refs. [11,16] studied multi-view satellite images’ 3D reconstruction based on neural radiance fields. They used satellite images captured from different viewpoints to implicitly model the scene and synthesize novel views and 3D models of the scene. As shown in Figure 1, a set of multi-view satellite images and their corresponding pose information are taken as inputs to generate a novel view, depth map, or mesh model of the scene. Compared with traditional stereo-vision-based methods and deep learning methods, the 3D reconstruction of satellite images based on neural radiance fields has the advantages of end-to-end and self-supervision. At the same time, the NeRF learns a continuous volume representation of the scene, enabling it to uniquely integrate different properties of physical scenes, such as surface radiance and illumination. This allows it to handle complex situations in a natural manner, including changes in appearance.

Derksen et al. [11] first explored the use of the NeRF for 3D reconstructions of satellite images and proposed shadow neural radiance field (S-NeRF), which are specially designed to reconstruct the appearance and geometry of the terrestrial surface from satellite images. Sat-NeRF [16] further combines some recent trends in neural rendering with the rational polynomial coefficient (RPC) model and deals with transient phenomena, such as cars and vegetation, that cannot be explained by the sun’s position through a shadow-aware irradiance model and uncertainty weights. These studies demonstrate the great potential of using the NeRF for satellite photogrammetry missions. However, these methods, like most NeRF model methods, usually require 8–10 h or even longer for training and optimization, which limits their application in practical scenarios.

In this work, we propose a new efficient remote sensing neural radiance field variant, called SatelliteRF, to efficiently model the Earth’s terrain based on satellite imagery. SatelliteRF uses multi-resolution hash position encoding [17] to reduce the size of the multi-layer perceptron (MLP), which speeds up model training and improves the efficiency of scene reconstruction. To address issues such as illumination variations in satellite images, SatelliteRF represents the scene as a static surface with an albedo color and a radiance model, with the rendered color determined by the product of the albedo and radiance models. In addition, SatelliteRF effectively models transient elements, such as cars and shadows, in the scene by learning the transient embeddings for each image. In order to further improve the quality of scene reconstruction, we adopt a random structural similarity loss function and utilize the rich structural information in remote sensing images to train the model. Our experiments on the DFC 2019 dataset [18,19] shows that SatelliteRF not only achieves a superior reconstruction quality but also reduces the training time for the same region from 8~10 h to 5~10 min.

Our key contributions are summarized as follows:

We significantly improve the speed of the 3D reconstruction of multi-view satellite images by using multi-resolution hash encoding and smaller MLP networks.
The challenge of dynamic changes in satellite images, such as illumination changes, shadows, and transient objects, is effectively addressed by radiance modeling and learning transient embeddings.
Stochastic structural similarity loss is used to exploit structural information in satellite images to improve the quality of scene reconstruction.

2. Related Work

Our method learns a volumetric representation of a scene from a set of multi-view satellite images and captures both the geometry and view-dependent appearance. In this section, we review representative studies in related fields and discuss the connections between these studies and our approach to provide a solid background for our work.

2.1. Neural Radiance Fields for Fast Training

Since the NeRF was proposed, it has inspired a large amount of research work. These studies aim to expand the capabilities of the NeRF from multiple aspects, including improving the quality of visual synthesis [20,21], speeding up rendering [17,22,23], reducing the number of input views required [24,25,26], improving generalization to different scenes [27], reconstructing large-scale urban scenes [28,29,30,31], and implementing scene editing [32]. A review of related work can be found in the literature [33]. Here, we will focus on those research efforts aimed at speeding up NeRF training.

Traditional NeRF methods are generally slow in terms of training speed, which has motivated a series of studies, such as DVGO [34], Plenoxels [35], TensoRF [36], and Instant-NGP [17], which aim to accelerate the training process of the NeRF. DVGO represents the scene using an optimized dense mesh approach, which consists of a density mesh for scene geometry and a feature mesh to capture view-dependent appearance details. Impressively, DVGO’s training time on the Synthetic-NeRF dataset is only 15 min, which is a significant speedup over the time required by the traditional NeRF. Plenoxels represent the scene as a sparse 3D mesh with spherical harmonics without any neural network, and the optimization speed is two orders of magnitude faster than the NeRF. TensoRF also adopts an MLP-free strategy by decomposing the 3D scene, storing the scalar density and vector color features in a three-dimensional voxel grid, and expressing them as tensors. By further decomposing these tensors into more compact vector and matrix factors, TensoRF is able to represent the scene in a more compact form and speed up the training process.

Instant-NGP [17] proposes a multi-resolution hash position encoding technology and utilizes small but efficient MLP to represent the scene. This method achieves an extremely fast training speed. Furthermore, Instant-NGP can even complete the reconstruction task of small-scale scenes within 1 min by adopting an empty space-skipping strategy and efficient CUDA implementation. Most importantly, an Instant-NGP framework is designed to be more flexible, allowing it to be easily expanded to different application tasks. In particular, for large-scene reconstruction, Instant-NGP demonstrates its unique advantages and can effectively handle the details and complexity of large-scale scenes while solving the time and resource consumption problems commonly encountered in processing large-scale scenes.

2.2. Neural Radiance Fields for Satellite Photogrammetry

Reconstructing 3D geometry from satellite images has become a key research topic in the fields of remote sensing and geographic information science. Since satellite images are almost always taken at an oblique angle, providing a unique multi-view perspective of the scene that cannot be provided by a single-view image, the reconstruction of a 3D model of the Earth’s surface from a collection of satellite imagery taken from different angles plays a vital role in the fields of urban planning, environmental monitoring and protection, and disaster assessment and management.

Neural radiance field technology has been successfully applied to reconstruct the Earth’s surface from multi-view satellite imagery, demonstrating its potential in the domain of 3D reconstruction for satellite imagery. Refs. [11,16,37,38,39] have explored satellite 3D reconstruction through the neural radiance field method. Derksen et al. [11] took the lead in exploring remote sensing 3D reconstruction based on neural radiance fields and proposed S-NeRF, which is a significant advancement in the discipline of remote sensing 3D reconstruction based on neural radiance fields. A basic assumption of the NeRF is that the scene is static and captured at the same time. However, unlike stereo or tri-stereo images, multi-view satellite images are often captured at different times, resulting in obvious appearance and color differences between different images. S-NeRF adapts to multi-view satellite images with different illumination conditions by modeling radiance as the product of irradiance and albedo. In addition, Sat-NeRF [16] introduces the RPC camera model to more accurately calculate the origin and direction of rays for ray marching in the NeRF, and it solves the problem of transient phenomena that are not easily explained by the sun’s position through the shadow-aware irradiance model and uncertainty weights. Sat-Mesh [39], on the other hand, uses the MLP to learn the sign distance function (SDF) and integrates it within the volume-rendering framework to achieve multi-view satellite reconstruction. Season-NeRF [37] enables the NeRF model to learn and render seasonal features by including time as an additional input variable. SpS-NeRF [38] uses the low-resolution dense depth generated by the traditional multi-view stereo (MVS) for supervision, which enables the network to learn geometric structures even with few satellite image inputs.

Although the above works have achieved good results in remote sensing 3D reconstruction, they usually require 8–10 h or even longer for training and optimization, as shown in Table 1, which significantly limits further research and applications in practical scenarios. Therefore, one of the goals of our research work is to accelerate neural radiance field methods for multi-view satellite imagery.

In this study, we thoroughly explore and extend Instant-NGP and Sat-NeRF. By adopting hash grid encoding in Instant-NGP, we significantly improve the efficiency of model training, which provides new possibilities for processing large-scale datasets. At the same time, we improved the irradiance model of Sat-NeRF to cope with illumination changes and transient objects (cars, shadows, etc.) in multi-date images. In addition, we use the random structural similarity loss function to capture the rich structural information in remotely sensed images and improve the model’s ability to reconstruct geometric details in complex scenes.

3. Method

In Section 3.1, we review the core principles of neural radiance fields to provide the necessary theoretical foundation for our approach; Section 3.2 details improvements to the irradiance model; the network model structure is presented in Section 3.3, while Section 3.4 delves into the multi-resolution Hash grid encoding; and finally, in Section 3.5, we introduce the joint loss function that combines uncertainty awareness and random structural similarity.

3.1. Preliminary

Representation. The NeRF [15] models the structure and appearance of a scene by using a limited collection of images, representing the 3D scene as a continuous 5D function and taking as input the 3D position

x = (x, y, z)

and the viewing direction

d = (θ, ϕ)

and predicting the radiance

c (x, d) = (r, g, b)

and the volume density

σ (x)

. The color depends on both the 3D position

x

and the viewing direction

d

to capture view-dependent effects, while the density depends only on

x

to maintain geometric consistency. The NeRF is typically parameterized by a multi-layer perceptron (MLP):

(c, σ) = F_{θ} (γ_{x} (x), γ_{d} (d))

(1)

where

F_{θ}

represents the MLP with weight

θ

.

γ (\cdot)

is the position encoding that maps each element of the vector to a higher-dimensional frequency

L

space:

γ_{L} (x) = {[\sin π x, \cos π x, \dots, \sin 2^{L - 1} π x, \cos 2^{L - 1} π x]}^{T}

(2)

In the implementation,

x

is first fed into the MLP,

σ

and intermediate features are output, and then the feature and position encoding of the viewing direction are input into an additional fully connected layer to predict the color

c

, as shown in Figure 2. Therefore, the volume density is determined by the spatial position

x

, while the color

c

is determined by both the spatial position and the viewing direction:

(σ, z) = F_{θ_{1}} (γ_{x} (x))

(3)

c = F_{θ_{2}} (γ_{d} (d), z)

(4)

where

θ_{1}

and

θ_{2}

are the MLP parameters and

γ_{x}

and

γ_{d}

are the position encoding functions applied to each value in

x

and

d

, respectively.

Render. To render each pixel in the image, a camera ray r(t) = o + td is projected from the camera origin o along the direction d toward the scene, as shown in Figure 3. According to volume rendering [15], the NeRF computes samples along the ray and accumulates the color of the sampled points across the camera ray to visualize the scene from any given camera position. The color

\hat{C} (r)

of the pixel in the image that aligns with the ray r(t) is determined via discrete integration as follows:

\hat{C} (r) = \sum_{i = 1}^{N} T_{i} (1 - \exp (- σ_{i} δ_{i})) c_{i}

(5)

T_{i} = \exp (- \sum_{j = 1}^{i - 1} σ_{j} δ_{j})

(6)

δ_{i} = t_{i + 1} - t_{i}

(7)

where

c_{i}

and

σ_{i}

are the color and density of point

r (t_{i})

,

δ_{i}

is the distance between adjacent samples, and

T_{i}

is the cumulative transmittance along the ray, which denotes the probability that the ray does not hit any particles from

t_{j}

to

t

.

Optimization. The NeRF network is trained under the guidance of a color-loss function, which aims to reduce MSE loss between the rendered color of ray

r

and its corresponding ground-truth color:

L_{M S E} = \sum_{r \in R} {‖\hat{C} (r) - C (r)‖}_{2}^{2}

(8)

where

C (r)

represents the ground-truth value and

\hat{C} (r)

represents the color of the generated ray

r

.

3.2. Irradiance Model

Multi-date satellite images are collected at different times due to environmental variations such as illumination, resulting in significant appearance variations between images, as shown in Figure 4, which violates the NeRF’s assumption that the world is photometrically static. Therefore, we follow the irradiance model used in [11,16] to explicitly model the photometric differences between images. The irradiance model explicitly models the albedo

c_{a} (x)

as the output of the network rather than modeling the radiance

c (x)

directly. Scene irradiance is expressed as the product of the static surface

c_{a} (x)

with the albedo color and the ambient illumination radiance

l (x, d_{s})

:

c (x, d_{s}) = l (x, d_{s}) \cdot c_{a} (x)

(9)

where

l (x, d_{s}) = t (x) + (1 - t (x)) \cdot a (d_{s})

(10)

l (x, d_{s})

represents the ambient light radiance and

t (x)

is a transient scalar with a value between 0 and 1;

a (d_{s})

is the ambient color; and the global ambient color is determined according to the sun direction

d_{s}

, which is independent of the scene geometry. Different from Sat-NeRF, we model

t (x)

as a transient scalar rather than just a shadow-aware scalar and eliminate the input of the sun direction vector.

For dynamic changes such as the shadows and cars present in multi-date images, we follow Sat-NeRF [16] and NeRF-W [40] by embedding the image-specific latent vector

t_{j}

into the input layer to capture transient phenomena observed in the input image and separate them from the static representation of the 3D world.

In summary, according to the irradiance model, the color

c

of each point

x

along the ray is

c (t_{j}, x, d_{s},) = c_{a} (x) \cdot (t (x) + (1 - t (x)) \cdot a (d_{s}))

(11)

3.3. Network Architecture

The network architecture is shown in Figure 5. The network takes the 3D point

x

as the input, predicts the volume density

σ

, and outputs intermediate features. The intermediate features and viewing direction are merged into the albedo head to predict the albedo

c_{a}

. Furthermore, this intermediate feature is merged with the transient embedding

t_{j}

into the transient head to predict the transient scalar and uncertainty. The sun direction is fed separately into the MLP network to predict the environment color

a

, which is constant for all points x. For the input position, we use multi-resolution hash grid coding (described in the next section), while for the direction vectors d and

d_{s}

, we use spherical hamonics encoding.

3.4. Multi-Resolution Hash Encoding

Due to the large amount of time required for training, it is difficult to extend the NeRF to handle large-scale scenarios such as cities. Recently, Instant-NGP proposes a learned parameterized multi-resolution hash encoding that is trained simultaneously with the MLP network of the NeRF model. Thanks to the representation capabilities of multi-resolution hash encoding, Instant-NGP can accurately represent scenes by using tiny and efficient MLPs, which greatly improves the training and inference speed of the NeRF models. Our work uses multi-resolution hash coding to encode location coordinates to accelerate model training.

Multi-resolution hash coding transforms each input 3D point into higher-dimensional concatenated features learned from each resolution level by defining a multi-scale network structure. This structure finds embeddings for input coordinates in O(1) time by using hash tables [41] and interpolates them to obtain feature values for each position coordinate. For different resolution levels, the hash table is graded according to the level from coarse to fine, and at each resolution level, find the corner points of the voxels containing the coordinates, obtain their encodings, and then they are interpolated to create a collection of feature vectors for the queried coordinates, which are concatenated to form a single input embedding.

Specifically, multi-resolution hash encoding is structured into L levels, with each level comprising T feature vectors, each of dimension F. These levels operate independently and hold the feature vectors at the vertices of a mesh. The resolution of this mesh at each level follows a geometric progression, ranging from the coarsest to the finest resolution:

N_{l} = ⌊N_{m i n} \cdot b^{l}⌋

(12)

b = e x p (\frac{\ln N_{m a x} - \ln N_{m i n}}{L - 1})

(13)

where

N_{m a x}

represents the maximum resolution and

N_{m i n}

represents the basic resolution.

As shown in Figure 6, for each point

x \in R^{d}

, first find the grid unit where the point is located and then perform

d

linear interpolation on the features at the grid vertices to generate the F-dimensional feature vector of the point

x

. Finally, features from multiple resolution levels are concatenated into a single vector. For a given level

l

, consider that the input coordinates

x

belong to

R^{d}

, scaled according to the grid resolution and rounded down and up to map to a certain grid cell within level

l

:

⌊x_{l}⌋ = ⌊x \cdot N_{l}⌋

(14)

⌈x_{l}⌉ = ⌈x \cdot N_{l}⌉

(15)

The interpolation weight of the feature vector of each corner point is calculated based on its relative position to

x

as

w_{l} = x_{l} - ⌊x_{l}⌋

(16)

In multi-resolution hash encoding, the total number of F-dimensional features at each level is fixed to T. At the coarse level, each vertex has a unique coordinate, but at the fine level, if more than T features are required, a spatial hash function [41] is used to extract the required feature vectors.

The size of the hash table T can be a trade-off between time and quality. The learned feature-embedding vectors allow the dimensionality of the MLP network to be reduced, thereby significantly reducing the overall time of model training. Furthermore, by implementing this process in optimized CUDA code, the convergence time of the model is orders of magnitude less than that of the original model implementation.

3.5. Loss Function

Traditional NeRF training paradigms usually only optimize point-wise MSE loss, ignoring the rich structural information contained in pixel groups. High-resolution remote sensing images contain rich structural textures, and different types of targets, such as airports, ports, buildings, green spaces, sea surfaces, etc., have different texture structures. Inspired by the above, in addition to MSE loss, we introduce a new structural supervision loss

L_{S 3 I M}

based on stochastic structural similarity [42] to capture the structural information in the input satellite images.

The stochastic structural similarity (S3IM) index measures the similarity between two sets of pixels and captures non-local structural similarity information from randomly sampled pixels. S3IM is model-independent and can be universally applied to all types of neural field methods with limited coding and computational costs. It has been demonstrated in [42] that S3IM is very effective in improving the NeRF learning of correct colors and geometric structures and is robust even when the input is sparse and the scene content is complex.

S3IM builds on SSIM and is a stochastic variant of SSIM to preserve positional relationships in local blocks in random training. SSIM is a commonly used image quality metric used to evaluate local structural similarities between images. SSIM is believed to have a good correlation with the quality perception of the human visual system and is widely used to evaluate the quality of NeRF-rendered images. Assuming that x and y represent two different image blocks, SSIM consists of three indicators: luminance, contrast, and structure:

SSIM (x, y) = l (x, y) c (x, y) s (x, y)

(17)

In the proposed method, we quantify the image quality by evaluating three key components: luminance

l (x, y)

, contrast

c (x, y)

, and structure

s (x, y)

. These components are defined with respect to the local statistics of an image as follows:

The luminance component

l (x, y)

is calculated by using the equation

l (x, y) = \frac{2 μ_{x} μ_{y} + C_{1}}{μ_{x}^{2} + μ_{y}^{2} + C_{1}}

(18)

where

μ_{x}

and

μ_{y}

represent the mean luminance values of two images (or two regions within an image) being compared, and

C_{1}

is a constant to avoid division by zero.

The contrast component

c (x, y)

is represented by

c (x, y) = \frac{2 σ_{x} σ_{y} + C_{2}}{σ_{x}^{2} + σ_{y}^{2} + C_{2}}

(19)

where

σ_{x}

and

σ_{y}

are the standard deviations of the luminance values, indicating the image contrast levels, and

C_{2}

is a constant for stabilization.

Lastly, the structure component

s (x, y)

is defined as

s (x, y) = \frac{σ_{x y} + C_{3}}{σ_{x} σ_{y} + C_{3}}

(20)

with

σ_{x y}

denoting the covariance between the two images (or regions), indicating the structural similarity, and

C_{3}

is a constant to ensure mathematical stability.

S3IM is computed on a local block containing some position information:

S 3 I M (\hat{R}, R) = \frac{1}{M} \sum_{m = 1}^{M} S S I M (P^{(m)} (\hat{C}), P^{(m)} (C))

(21)

where

P (\hat{C})

and

P (C)

denote a rendering block and the corresponding ground truth image block randomly formed by B rays/pixels in batch R.

S S I M (P^{(m)} (\hat{C}), P^{(m)} (C))

denotes the computation of SSIM with kernel size of K × K and step size of s on the rendering block and the corresponding ground truth patches. The final S3IM is obtained by averaging the M-estimated SSIM values.

Although S3IM requires multiple computations of SSIM over multiple random patches, it only needs to be backpropagated once per iteration, and thus it does not incur significant additional computational costs. Since S3IM is located at [–1, 1] and is positively related to image quality, the S3IM-based loss

L_{S 3 I M}

is defined as

L_{S 3 I M} = 1 - \frac{1}{M} \sum_{m = 1}^{M} S S I M (P^{(m)} (\hat{C}), P^{(m)} (C))

(22)

Our total loss function includes two parts: color loss and structure loss. For color loss, we use a loss function similar to NeRF-W [40] and Sat-NeRF [16], reducing the contribution of camera rays emitted by pixels at the transient phenomenon location by an uncertainty scalar β:

L_{c o l o r} = \sum_{r \in R} \frac{{‖\hat{C} (r) - C (r)‖}_{2}^{2}}{2 β^{'} {(r)}^{2}} + (\frac{\log β^{'} (r) + η}{2})

(23)

where

R

represents all rays,

C (r)

represents the ground-truth value,

\hat{C} (r)

represents the color of the generated ray

r

, and

β^{'} (r) = β (r) + β_{m i n}

. To prevent negative values, set

β_{m i n} = 0.05

and

η = 3

.

The total loss

L

is the weighted sum of the pointwise MSE color loss and structural supervision loss:

L = L_{c o l o r} + λ L_{S 3 I M}

(24)

where λ is the weight of the structural supervision loss.

4. Experiment

4.1. Experiments Setting

We evaluate our method on the dataset from the 2019 IEEE GRSS Data Fusion Contest [18,19]. This dataset provides 26 WorldView-3 images at 0.3 m resolution collected in Jacksonville, Florida, between 2014 and 2016. For comparison with other methods, we followed the settings of [16] and selected scenes numbered 004, 068, 214, and 260, covering vegetation, urban buildings, water bodies, etc., as shown in Figure 7. Following [16], the original 2048 × 2048 image was cropped according to the DSM geographical area of each scene, approximately 800 × 800 pixels, with a resolution of 0.3 m, so each AOI area covers 256 × 256 m, approximately 60,000 m². Table 2 displays the comprehensive details for each scene within the dataset.

Metric. For quantitative comparison, the Peak Signal-to-Noise Ratio (PSNR), which computes the mean square error (MSE) in a logarithmic scale, and the Structural Similarity Index Measure (SSIM), designed to reflect human visual perception, are employed. The calculation of SSIM metrics is described in Section 3.5. The PSNR metric is calculated as follows:

P S N R (I) = 10 \cdot \log_{10} (\frac{{M A X (I)}^{2}}{M S E (I)})

(25)

where

M A X (I)

represents the maximum potential value of a pixel in the image (an eight-bit integer is 255), and

M S E (I)

is the pixel mean square error calculated on all color channels.

Baseline. We compare our method with the original NeRF [15] method, as well as S-NeRF [11] and Sat-NeRF [16], which are specifically designed for remote sensing scenes. The same framework and data loading is used in all NeRF variants.

Implementation details. Our approach builds on NerfAcc [43]. NerfAcc is a plug-and-play toolbox that provides efficient sampling for volume rendering and is universally applicable to various radiation fields. For the training process, we employ the Adam optimizer and select a batch size of 1024 rays. The computational environment for our experiments is detailed as follows: our models are trained on a machine equipped with an NVIDIA RTX 3090 GPU. The CPU is an Intel Core i9-10900K, complemented by 64 GB RAM. The system runs on Ubuntu 20.04 LTS, and we use Python 3.8 for our implementations. Each model converges within a timeframe of 5–10 min, showcasing the efficiency of our setup.

4.2. Results

In the quantitative analysis, we present the performance metrics of novel view synthesis for selected scenes in Table 3, which includes the PSNR and SSIM scores for our proposed method across various scenarios within the experimental dataset. Notably, our method exhibits superior performance over previous methods in terms of both the PSNR and SSIM metrics, indicating its effectiveness in generating high-quality synthesized views. However, a closer examination of the SSIM scores reveals a particular case where the SatelliteRF method on Scene 004 demonstrates a lower SSIM score compared to other NeRF-based approaches. Scene 004 is distinguished by its highly dynamic range and complex geometric structures, which pose significant challenges for depth estimation and texture rendering. The SatelliteRF method, while robust in handling various scenarios, might be less effective in accurately capturing the intricate details and maintaining texture consistency in such a complex environment. This discrepancy underscores the importance of further refining the depth estimation and texture rendering components of SatelliteRF to enhance its adaptability to scenes with complex geometries and dynamic ranges.

Figure 8 illustrates that, in the presence of varying illumination conditions typical of remote sensing scenarios, the original NeRF fails to accurately capture the scene’s appearance. In contrast, S-NeRF, Sat-NeRF, and our proposed method are capable of effectively managing these changes in illumination. As shown in Figure 9, our method has significant improvements in image reconstruction quality and visual effects. Especially when dealing with complex scenes and detail-rich areas, our method can better capture the characteristics of the original scene and generate more realistic and natural images.

To validate that our method is able to learn high-quality 3D models, we qualitatively compared the mesh models generated by SatelliteRF with those generated by lidar and S2P [1]. S2P won the championship in the 2016 IARPA Multi-View Stereoscopic 3D Mapping Challenge. As shown in Figure 10, although the model surface generated by S2P is smoother, the model obtained by SatelliteRF shows finer structural details. The unevenness of the surface of the satellite-generated model is mainly due to the fact that the NeRF is unable to accurately reach the surface during sampling and can only sample near the surface. This problem may be improved by introducing depth supervision or geometric constraints. In addition, we further visualize the reconstruction results of the 3D scene, including point clouds and meshes, in Figure 11, further demonstrating the potential of our method in practical applications.

A comparative analysis of training times for multi-view satellite image reconstruction methods based on neural radiance fields reveals the temporal demands of various scene representation techniques across a range of application scenarios, as summarized in Table 4. Notably, our method requires a mere 6.1 min of training time, yet it sustains competitive accuracy and efficiency in scene reconstruction tasks. This observation is particularly salient, underscoring the optimization and effectiveness of our approach, especially pertinent in contexts necessitating rapid model development. Given the extensive training durations typically required by conventional methods, our approach marks a significant advancement, facilitating quicker iterations and deployment in time-sensitive applications. This advantage is especially pertinent in dynamic settings where swift model updates are crucial for maintaining operational accuracy and relevance.

5. Conclusions

In this paper, we propose SatelliteRF, a novel neural radiance field variant designed for the fast and efficient 3D reconstruction of the Earth’s surface from satellite images. By introducing multi-resolution hash encoding, SatelliteRF is able to significantly improve the training speed of the model and the efficiency of scene reconstruction while using a smaller network size without losing reconstruction quality. In addition, by combining irradiance models and random structural similarity loss, SatelliteRF further improves the novel view synthesis and 3D reconstruction quality. The method supports the export of 3D models such as the digital surface model (DSM), point cloud, and mesh, providing data support for applications in urban planning, environmental monitoring, disaster assessment, and other fields. Although SatelliteRF provides a powerful tool for fast and efficient reconstruction using remote sensing images, there are still challenges in modeling special scenes such as specular reflection on water surfaces and scaling up to large-scale scenes, which require further research to improve its versatility and adaptability, and we leave them for future work.

Author Contributions

Conceptualization, X.Z., Y.W. and D.L.; methodology, X.Z. and D.L.; software, X.Z.; validation, X.Z., Y.W. and D.L.; formal analysis, X.Z.; investigation, X.Z.; resources, J.L.; data curation, B.L.; writing—original draft preparation, X.Z.; writing—review and editing, J.L.; visualization, Z.C. and B.L.; supervision, J.L.; project administration, Y.W. and J.L.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

https://ieee-dataport.org/open-access/data-fusion-contest-2019-dfc2019 (accessed on 1 March 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Facciolo, G.; De Franchis, C.; Meinhardt-Llopis, E. Automatic 3D reconstruction from multi-date satellite images. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 57–66. [Google Scholar]
Zhang, K.; Snavely, N.; Sun, J. Leveraging vision reconstruction pipelines for satellite imagery. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 2139–2148. [Google Scholar]
Leotta, M.J.; Long, C.; Jacquet, B.; Zins, M.; Lipsa, D.; Shan, J.; Xu, B.; Li, Z.; Zhang, X.; Chang, S.-F. Urban semantic 3D reconstruction from multiview satellite imagery. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 1451–1460. [Google Scholar]
Zhao, L.; Wang, H.; Zhu, Y.; Song, M. A review of 3D reconstruction from high-resolution urban satellite images. Int. J. Remote Sens. 2023, 44, 713–748. [Google Scholar] [CrossRef]
Huang, X.; Wen, D.; Li, J.; Qin, R. Multi-level monitoring of subtle urban changes for the megacities of China using high-resolution multi-view satellite imagery. Remote Sens. Environ. 2017, 196, 56–75. [Google Scholar] [CrossRef]
Li, S.; Zhu, Z.; Wang, H.; Xu, F. 3D virtual urban scene reconstruction from a single optical remote sensing image. IEEE Access 2019, 7, 68305–68315. [Google Scholar] [CrossRef]
Zhao, S.; Wang, Q.; Li, Y.; Liu, S.; Wang, Z.; Zhu, L.; Wang, Z. An overview of satellite remote sensing technology used in China’s environmental protection. Earth Sci. Informatics 2017, 10, 137–148. [Google Scholar] [CrossRef]
Huang, Y.; Dugmag, H.; Barfoot, T.D.; Shkurti, F. Stochastic planning for asv navigation using satellite images. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: New York, NY, USA, 2023; pp. 1055–1061. [Google Scholar]
Barrile, V.; Bilotta, G.; Fotia, A.; Bernardo, E. Road extraction for emergencies from satellite imagery. In Computational Science and Its Applications–ICCSA 2020: 20th International Conference, Cagliari, Italy, 1–4 July 2020, Proceedings; Part IV 20; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 767–781. [Google Scholar]
Liu, C.; Szirányi, T. Road Condition Detection and Emergency Rescue Recognition Using On-Board UAV in the Wildness. Remote Sens. 2022, 14, 4355. [Google Scholar] [CrossRef]
Derksen, D.; Izzo, D. Shadow neural radiance fields for multi-view satellite photogrammetry. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 1152–1161. [Google Scholar]
Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 30, 328–341. [Google Scholar] [CrossRef] [PubMed]
Marí, R.; Facciolo, G.; Ehret, T. Multi-Date Earth Observation Nerf: The Detail Is in the Shadows. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 2034–2044. [Google Scholar]
Li, X.; Fan, Z.; Liu, X.; Zhang, Y.; Ge, Y.; Wen, L. Photogrammetry for Unconstrained Optical Satellite Imagery with Combined Neural Radiance Fields. IEEE Geosci. Remote Sens. Lett. 2023, 21, 3337352. [Google Scholar] [CrossRef]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Marí, R.; Facciolo, G.; Ehret, T. Sat-nerf: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using rpc cameras. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1311–1321. [Google Scholar]
Müller, T.; Evans, A.; Schied, C.; Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 2022, 41, 1–15. [Google Scholar] [CrossRef]
Bosch, M.; Foster, K.; Christie, G.; Wang, S.; Hager, G.D.; Brown, M. Semantic stereo for incidental satellite images. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; IEEE: New York, NY, USA, 2019; pp. 1524–1532. [Google Scholar]
Le Saux, B.; Yokoya, N.; Hansch, R.; Brown, M.; Hager, G. 2019 data fusion contest [technical committees]. IEEE Geosci. Remote Sens. Mag. 2019, 7, 103–105. [Google Scholar] [CrossRef]
Barron, J.T.; Mildenhall, B.; Tancik, M.; Hedman, P.; Martin-Brualla, R.; Srinivasan, P.P. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 10–17 October 2021; pp. 5855–5864. [Google Scholar]
Verbin, D.; Hedman, P.; Mildenhall, B.; Zickler, T.; Barron, J.T.; Srinivasan, P.P. Ref-nerf: Structured view-dependent appearance for neural radiance fields. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 5481–5490. [Google Scholar]
Garbin, S.J.; Kowalski, M.; Johnson, M.; Shotton, J.; Valentin, J. Fastnerf: High-fidelity neural rendering at 200fps. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 10–17 October 2021; pp. 14346–14355. [Google Scholar]
Reiser, C.; Peng, S.; Liao, Y.; Geiger, A. Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 10–17 October 2021; pp. 14335–14345. [Google Scholar]
Chen, A.; Xu, Z.; Zhao, F.; Zhang, X.; Xiang, F.; Yu, J.; Su, H. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 10–17 October 2021; pp. 14124–14133. [Google Scholar]
Yu, A.; Ye, V.; Tancik, M.; Kanazawa, A. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4578–4587. [Google Scholar]
Deng, K.; Liu, A.; Zhu, J.Y.; Ramanan, D. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12882–12891. [Google Scholar]
Jain, A.; Tancik, M.; Abbeel, P. Putting nerf on a diet: Semantically consistent few-shot view synthesis. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 10–17 October 2021; pp. 5885–5894. [Google Scholar]
Rematas, K.; Liu, A.; Srinivasan, P.P.; Barron, J.T.; Tagliasacchi, A.; Funkhouser, T.; Ferrari, V. Urban radiance fields. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12932–12942. [Google Scholar]
Turki, H.; Ramanan, D.; Satyanarayanan, M. Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12922–12931. [Google Scholar]
Tancik, M.; Casser, V.; Yan, X.; Pradhan, S.; Mildenhall, B.; Srinivasan, P.P.; Barron, J.T.; Kretzschmar, H. Block-nerf: Scalable large scene neural view synthesis. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8248–8258. [Google Scholar]
Xiangli, Y.; Xu, L.; Pan, X.; Zhao, N.; Rao, A.; Theobalt, C.; Dai, B.; Lin, D. Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 106–122. [Google Scholar]
Yuan, Y.J.; Sun, Y.T.; Lai, Y.K.; Ma, Y.; Jia, R.; Gao, L. Nerf-editing: Geometry editing of neural radiance fields. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 18353–18364. [Google Scholar]
Gao, K.; Gao, Y.; He, H.; Lu, D.; Xu, L.; Li, J. Nerf: Neural radiance field in 3d vision, a comprehensive review. arXiv 2022, arXiv:2210.00379. [Google Scholar]
Sun, C.; Sun, M.; Chen, H.-T. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5459–5469. [Google Scholar]
Fridovich-Keil, S.; Yu, A.; Tancik, M.; Chen, Q.; Recht, B.; Kanazawa, A. Plenoxels: Radiance fields without neural networks. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5501–5510. [Google Scholar]
Chen, A.; Xu, Z.; Geiger, A.; Yu, J.; Su, H. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 333–350. [Google Scholar]
Gableman, M.; Kak, A. Incorporating season and solar specificity into renderings made by a NeRF architecture using satellite images. arXiv 2023, arXiv:2308.01262. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Rupnik, E. Sparsesat-NeRF: Dense depth supervised neural radiance fields for sparse satellite images. arXiv 2023, arXiv:2309.00277. [Google Scholar] [CrossRef]
Qu, Y.; Deng, F. Sat-Mesh: Learning Neural Implicit Surfaces for Multi-View Satellite Reconstruction. Remote Sens. 2023, 15, 4297. [Google Scholar] [CrossRef]
Martin-Brualla, R.; Radwan, N.; Sajjadi, M.S.; Barron, J.T.; Dosovitskiy, A.; Duckworth, D. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7210–7219. [Google Scholar]
Teschner, M.; Heidelberger, B.; Müller, M.; Pomerantes, D.; Gross, M.H. Optimized spatial hashing for collision detection of deformable objects. In Proceedings of the 8th International Fall Workshop on Vision, Modeling, and Visualization, VMV 2003, Munchen, Germany, 19–21 November 2003; Ertl, T., Ed.; Aka GmbH: Munchen, Germany, 2003; pp. 47–54. [Google Scholar]
Xie, Z.; Yang, X.; Yang, Y.; Sun, Q.; Jiang, Y.; Wang, H.; Cai, Y.; Sun, M. S3im: Stochastic structural similarity and its unreasonable effectiveness for neural fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 18024–18034. [Google Scholar]
Li, R.; Tancik, M.; Kanazawa, A. Nerfacc: A general nerf acceleration toolbox. arXiv 2022, arXiv:2210.04847. [Google Scholar]

Figure 1. The task of 3D reconstruction of satellites using neural radiance fields involves processing a set of satellite images taken from multiple viewpoints along with their associated poses. An MLP (multi-layer perceptron) network is employed to simulate the scene’s color and density attributes. The outcome is the generation of a rendered scene in the form of RGB images, digital surface models (DSM), point clouds, or mesh representations.

Figure 2. NeRF network architecture. The position encoding of the input position is passed through 8 fully connected layers with 256 channels each. Additional layers output volumetric density and feature vectors. This feature vector is concatenated with the position encoding of the input viewing direction, and finally the RGB color is output.

Figure 3. Ray casting and ray marching. Connect the camera position to the image pixels in world coordinates system to generate a ray r(t) = o + td. NeRF computes sample points along the rays and applies volume-rendering techniques to determine each ray’s color. The 3D coordinates

x = (x, y, z)

and ray direction

d = (θ, ϕ)

of each sampled point are processed through a positional encoding

γ

and subsequently fed into a fully connected network. This network is responsible for predicting the volume density

σ

and RGB value

c

.

Figure 3. Ray casting and ray marching. Connect the camera position to the image pixels in world coordinates system to generate a ray r(t) = o + td. NeRF computes sample points along the rays and applies volume-rendering techniques to determine each ray’s color. The 3D coordinates

x = (x, y, z)

and ray direction

d = (θ, ϕ)

of each sampled point are processed through a positional encoding

γ

and subsequently fed into a fully connected network. This network is responsible for predicting the volume density

σ

and RGB value

c

.

Figure 4. Illumination changes and transient objects in multi-date satellite images. The left half shows scene 214 of the DFC 2019 dataset. The right half clearly reveals illumination changes and the existence of transient objects including cars and shadows through the comparison of local areas of 214 scenes in different satellite images. The red box in the left figure indicates the local area.

Figure 5. Network architecture. The inputs to the network are the position

(x, y, z)

of the 3D point

x

, the observation direction d, the sun direction

d_{s}

, and the transient embedding

t_{j}

. The output of the network is the albedo

c_{a}

, density

σ

, transient scalar

t

, uncertainty

β

, and ambient color

a

.

Figure 5. Network architecture. The inputs to the network are the position

(x, y, z)

of the 3D point

x

, the observation direction d, the sun direction

d_{s}

, and the transient embedding

t_{j}

. The output of the network is the albedo

c_{a}

, density

σ

, transient scalar

t

, uncertainty

β

, and ambient color

a

.

Figure 6. Example of 2D multi-resolution hash encoding. For a given input coordinate, we find surrounding voxels at L = 2 resolution level and assign indices to their corner points by hashing their integer coordinates. For the generated angular index, we look up the corresponding F-dimensional feature vector from the hash table and perform linear interpolation based on the relative position of the coordinates within each voxel. Finally, the results of each level are concatenated to generate the input y of the MLP. Scene 068 in the DFC 2019 dataset is used as an example.

Figure 7. DFC 2019 dataset examples. The scene contains vegetation, urban buildings, water bodies, etc.

Figure 8. Visualization of novel view synthesis results on four scene data of the DFC 2019 dataset.

Figure 9. Visualization of new view synthesis results for DFC 2019 dataset 260 scenes. We zoom in on local areas to demonstrate the ability of our method to reconstruct scene details. The red boxes indicate the local areas.

Figure 10. 3D visualization of mesh models generated by LiDAR, S2P, and SatelliteRF. Compared with S2P, SatelliteRF delivers more intricate details and crisper edges, yet displays some local inconsistencies.

Figure 11. Point cloud and mesh models of an example object. (a) shows the point cloud model generated by our method, highlighting the detailed representation of the object’s surface as a collection of points. (b) illustrates the mesh model derived from the point cloud, demonstrating how the points are interconnected to form a continuous surface. Our method supports exporting both point clouds and mesh models for various applications, facilitating flexible use in different domains.

Table 1. Comparison of training time of multi-view satellite image reconstruction methods based on neural radiance fields. This table shows the training time of multi-view satellite image reconstruction algorithms for different scene representation techniques and their diverse application scenarios. The training time data are obtained from the respective original papers. It should be noted that it is not fair to directly compare the training times of different methods due to different computing devices, but this does not affect the overall conclusion that the satellite radiance field method requires a lot of time to train.

Representation	Method	Taxonomy	Training Steps	Training Time
NeRF	S-NeRF [11]	Reconstruction	100k	8 h
	Sat-NeRF [16]	Reconstruction	300k	~10–20 h
	Season-NeRF [37]	Editing	50k	~8 h
	SpS-NeRF [38]	Few-shot, depth supervision	30k	2 h
SDF	Sat-Mesh [39]	Reconstruction	300k	~8 h

Table 2. Dataset details. The table includes the number of images, train and test split, and height ranges.

Area	004	068	214	260
number of images	11	19	24	17
split	9/2	17/2	21/3	15/2
height ranges	[−24, 1]	[−27, 30]	[−29, 73]	[−30, 13]

Table 3. Novel view synthesis metrics. We report image rendering metrics for test views of selected scenes.

Area	004	068	214	260
	PSNR (dB)
NeRF	20.72	20.99	18.42	20.08
S-NeRF	25.86	24.29	24.16	21.37
Sat-NeRF	26.32	25.11	24.99	21.79
SatelliteRF (Ours)	26.59	25.29	25.51	22.03
	SSIM
NeRF	0.640	0.826	0.808	0.773
S-NeRF	0.864	0.897	0.936	0.816
Sat-NeRF	0.877	0.912	0.946	0.842
SatelliteRF (Ours)	0.858	0.924	0.958	0.857

Table 4. Training time comparison for multi-view satellite image reconstruction methods.

Method	Average Training Steps	Average Training Time
S-NeRF [11]	100k	8.04 h
Sat-NeRF [16]	300k	18.06 h
Our	50k	6.1 min

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, X.; Wang, Y.; Lin, D.; Cao, Z.; Li, B.; Liu, J. SatelliteRF: Accelerating 3D Reconstruction in Multi-View Satellite Images with Efficient Neural Radiance Fields. Appl. Sci. 2024, 14, 2729. https://doi.org/10.3390/app14072729

AMA Style

Zhou X, Wang Y, Lin D, Cao Z, Li B, Liu J. SatelliteRF: Accelerating 3D Reconstruction in Multi-View Satellite Images with Efficient Neural Radiance Fields. Applied Sciences. 2024; 14(7):2729. https://doi.org/10.3390/app14072729

Chicago/Turabian Style

Zhou, Xin, Yang Wang, Daoyu Lin, Zehao Cao, Biqing Li, and Junyi Liu. 2024. "SatelliteRF: Accelerating 3D Reconstruction in Multi-View Satellite Images with Efficient Neural Radiance Fields" Applied Sciences 14, no. 7: 2729. https://doi.org/10.3390/app14072729

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SatelliteRF: Accelerating 3D Reconstruction in Multi-View Satellite Images with Efficient Neural Radiance Fields

Abstract

Featured Application

Abstract

1. Introduction

2. Related Work

2.1. Neural Radiance Fields for Fast Training

2.2. Neural Radiance Fields for Satellite Photogrammetry

3. Method

3.1. Preliminary

3.2. Irradiance Model

3.3. Network Architecture

3.4. Multi-Resolution Hash Encoding

3.5. Loss Function

4. Experiment

4.1. Experiments Setting

4.2. Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI