Neural Radiance Fields for High-Resolution Remote Sensing Novel View Synthesis

Lv, Junwei; Guo, Jiayi; Zhang, Yueting; Zhao, Xin; Lei, Bin

doi:10.3390/rs15163920

Open AccessTechnical Note

Neural Radiance Fields for High-Resolution Remote Sensing Novel View Synthesis

by

Junwei Lv

^1,2,3,

Jiayi Guo

^1,2,3,*,

Yueting Zhang

^1,2,3

,

Xin Zhao

^1,2,3

and

Bin Lei

^1,2

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

Key Laboratory of Technology in Geo-Spatial Information Processing and Application Systems, Chinese Academy of Sciences, Beijing 100190, China

³

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(16), 3920; https://doi.org/10.3390/rs15163920

Submission received: 26 June 2023 / Revised: 21 July 2023 / Accepted: 6 August 2023 / Published: 8 August 2023

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Remote sensing images play a crucial role in remote sensing target detection and 3D remote sensing modeling, and the enhancement of resolution holds significant application implications. The task of remote sensing target detection requires a substantial amount of high-resolution remote sensing images, while 3D reconstruction tasks generate denser models from diverse view perspectives. However, high-resolution remote sensing images are often limited due to their high acquisition costs, a scarcity of acquisition views, and restricted view perspective variations, which pose challenges for remote sensing tasks. In this paper, we propose an advanced method for a high-resolution remote sensing novel view synthesis by integrating attention mechanisms with neural radiance fields to address the scarcity of high-resolution remote sensing images. To enhance the relationships between sampled points and rays and to improve the 3D implicit model representation capability of the network, we introduce a point attention module and batch attention module into the proposed framework. Additionally, a frequency-weighted position encoding strategy is proposed to determine the significance of each frequency for position encoding. The proposed method is evaluated on the LEVIR-NVS dataset and demonstrates superior performance in quality assessment metrics and visual effects compared to baseline NeRF (Neural Radiance Fields) and ImMPI (Implicit Multi-plane Images). Overall, this work presents a promising approach for a remote sensing novel view synthesis by leveraging attention mechanisms and frequency-weighted position encoding.

Keywords:

novel view synthesis; neural radiance fields; remote sensing; attention; volume rendering

Graphical Abstract

1. Introduction

Remote sensing images have advanced significantly, enabling the capturing of Earth’s features at local, regional, and global spatial scales. The proliferation of remote sensing satellites over the years has resulted in the extensive utilization across various domains, including road planning, urban construction, target detection, and 3D scene reconstruction. This evolution of remote sensing technology pertaining to Earth’s resources has facilitated the acquisition of high-resolution outcomes, enabling the precise identification of substantial structures such as large buildings, roads, rivers, and artificial features.

The improvement in remote sensing image resolution has facilitated the accomplishment of remote sensing object detection tasks and 3D reconstruction tasks. However, the availability of high-resolution remote sensing images within the same geographical area is often limited due to constraints associated with satellite shooting angles and frequencies. Consequently, the synthesis of novel views based on limited view perspective observation data emerges as an effective approach for expanding the dataset. This technique holds immense research significance in various applications, including 3D modeling [1], target refinement, information extraction, and beyond.

Novel view synthesis endeavors to generate new images from the arbitrary view perspective, utilizing a collection of pre-existing multi-view images. While classical methods such as Structure from Motion (SFM) [2], blender, and image-based rendering have made initial progress in this task, current research in view synthesis demonstrates promising potential through the application of neural rendering. Neural rendering-based view synthesis methods typically employ an intermediate 3D scene representation to produce high-quality virtual views. Three-dimensional scene representations can be categorized as explicit or implicit. Explicit representations of 3D scenes often rely on mesh rendering [3,4,5] and volume representation [6,7,8], among other approaches. However, these explicit representations typically pertain to single objects or manually crafted models, akin to the 3D models provided by ShapeNet [9]. Additionally, there exist differentiable rendering methods for explicit view synthesis and 3D scene reconstruction, such as SoftRasterizer [10] and Neural 3D Mesh Renderer [11], among others. Nevertheless, the discretization of 3D representations, such as meshes, point clouds, voxels, etc., may lead to artifacts such as overlapping and the absence of fine details. Moreover, the storage requirements for explicit 3D scenes can be substantial, and the limitations of memory impede the application of high-resolution scenes.

The implicit representation of a 3D scene utilizes a function to describe its geometry, effectively storing intricate scene information within the parameters of the function. Implicit representation offers a more compact parameterization compared to explicit representation, enabling the expression of high-resolution scenes. Its continuous nature allows for a refined representation of the scene without necessitating 3D signal supervision. However, prior to the introduction of NeRF [12], implicit representation struggled to synthesize photo-realistic virtual views, unlike occupancy fields [13,14] or signed distance functions [15,16]. NeRF employs neural networks to model the radiance field and voxel density of an object or scene, utilizing minimal storage space, and subsequently performing novel view synthesis through classical volume rendering. However, NeRF is not well-suited for high-resolution remote sensing images, as the acquisition method for these images differs from standard cameras, and the distance between the target and the Unmanned Aerial Vehicle (UAV) [17] or carrier is often substantial. DoNeRF [18] addresses this challenge by utilizing true depth information and focusing on important samples around the object surface, resulting in faster and superior novel views. However, obtaining accurate depth information is often impractical for sparse high-resolution remote sensing images. PixelNeRF [19] demonstrates effective multi-view synthesis using a small number of images but is typically employed for ShapeNet [9] or single objects. In more complex scenes, the views generated from a limited number of images may suffer from blurring issues. ImMPI [20] introduce an implicit Multi-Plane Images (MPIs) representation of 3D scenes and conduct novel view synthesis for remote sensing images. They also provide robust remote sensing datasets captured by UAVs. However, in some cases, the surrounding area of the test novel view may exhibit blurring, which can affect the accuracy of 3D reconstruction in the remote sensing region.

To address the challenges associated with novel view synthesis, we propose a novel method that utilizes neural radiance fields and incorporates attention mechanisms, namely the point attention module and the batch attention module. Effective novel view synthesis necessitates a comprehensive understanding of the overall scene as well as meticulous attention to individual targets, capturing their distinct colors at various positions to reflect the inherent characteristics of each point. The quality of the synthesized image relies heavily on the clarity and accuracy of these intricate details. The point attention module enhances the characteristics of the learned points themselves, improving color expression at each sampled point and, consequently, enhancing the quality of image details. By employing two layers of nonlinear activation functions, the point attention module surpasses the capabilities of the original network, bolstering the representation ability of the 3D implicit model. Additionally, the batch attention module enables the network to grasp the intricate relationships between points within each training batch. By comprehending the connections between different points on rays, the internal constraints of the spatial point-set are enhanced, resulting in a clearer and more detailed 3D scene representation, which is particularly beneficial for enhancing remote sensing image representation. Furthermore, incorporating residuals into the advanced network enhances its optimization. Images consist of high and low-frequency features corresponding to areas of sharp and gradual intensity changes. To achieve optimal novel view synthesis results, it is crucial to consider the various frequency features of the image through position encoding. However, in remote sensing image capturing, the distance between the camera and the target plays a significant role. For distant positions, the scene is relatively larger, with smaller details, where low-frequency Fourier transform suffices. Conversely, for close range positions, more attention is required for the additional details and locations, demanding the use of high-frequency Fourier transform. To overcome this challenge, we introduce a frequency-weighted position encoding approach that assigns different weights to different frequency components. Our technique achieves exceptional quality assessment metrics on remote sensing datasets, and our main contributions can be summarized as follows:

A new method is proposed for remote sensing synthesis based on Neural Radiance Fields with an attention mechanism;
A point attention module is added to increase the nonlinear capabilities of the network and the ability of implicit 3D representation;
A batch attention module is introduced to enhance the relationship between different rays and sampled points to improve the constraint inside the spatial points;
A frequency-weighted position encoding is proposed to make the network focus on the most significant feature in different frequencies.

The rest of this paper is organized as follows. In Section 2, we introduce the related work. Section 3 introduces the materials and methods in detail. The experimental data and results are displayed in Section 4. Specific discussions are carried out in Section 5, and Section 6 concludes the paper.

2. Related Work

2.1. NeRF and NeRF Variants

The use of NeRF is an advanced technique used for synthesizing novel views of 3D scenes. It employs an implicit neural scene representation through a Multi-Layer Perceptron (MLP) and volume rendering. NeRF [12], introduced by Mildenhall et al., are the first to utilize a continuous 5D function to represent a 3D implicit model, resulting in state-of-the-art visual quality for rendering novel views.

Building upon NeRF, Mip-NeRF [21] improves the volume rendering process by using cone tracing instead of ray tracing. They also introduce Integrated Position Encoding, which performs well at lower resolutions. PixelNeRF [19] introduces a CNN-based encoder to learn scene priors and perform well in synthesizing new views with a limited number of input images, such as only three images. Ray Prior NeRF [22] explores a neural model better suited for view extrapolation. They leverage a Ray Atlas technique by extracting a rough 3D mesh and introduce Random Ray Cast augmentations to enhance training rays with a predetermined probability. Nerfies [23] introduce a deformation field to significantly improve performance by accommodating nonrigid transformations in the scene. Depth-Supervised NeRF [24] utilizes depth supervision from point-cloud object models and extract sparse point clouds from training input images to provide depth supervision. NerfingMVS [25] employs multi-view images as an input to the NeRF model, with a focus on depth reconstruction. COLMAP [2] is commonly used for sparse or dense reconstruction by inputting multi-view images in the aforementioned depth reconstruction or depth supervision methods. PointNeRF [26] employes a pre-trained 3D CNN to generate depth and surface probability, utilizing feature point clouds as an intermediate step for volume rendering. Neural Sparse Voxel Fields [27] propose a voxel-based NeRF model to represent the scene as radiance fields set into bounded voxels. FastNeRF [28] accelerates the inference time by caching depth radiance in spatial positions. KiloNeRF [29] enhances training speed by employing thousands of small MLPs instead of the baseline NeRF’s MLPs. Sat-NeRF [30] utilizes RPC (Rational Polynomial Camera) models to learn multi-view images with transient objects and shadow models.

2.2. Remote Sensing Novel View Synthesis and 3D Reconstruction

In the domain of remote sensing, Zhao et al. proposed a framework that reconstructs buildings in 3D space using single-view remote sensing images and 3D-VAE-IWGAN [31]. Matsunaga et al. suggested a time-efficient 3D reconstruction method for nearly planar surfaces from UAV images [32]. Fraundorfer et al. explored the utilization of small-scale UAVs for 3D building and urban site reconstruction, discussing the current state-of-the-art and benefits of using UAV imagery [33]. Wu et al. annotated large-scale UAV datasets from 3D city models and simulated UAV image sequences from virtual flights in Google Earth Studio. They proposed a 3D reconstruction method that achieves high accuracy and rich details in remote sensing image reconstruction based on a multi-view 3D occupancy network [34,35].

Furthermore, the paper introduces ImMPI [20], which presents an implicit MPI representation for depicting remote sensing 3D scenes. It also introduces a new dataset for remote sensing novel view synthesis, which is utilized in the paper’s research.

3. Materials and Methods

3.1. Preliminaries on NeRF

Our proposed method builds upon the foundation of NeRF [12], which leverages a neural network to model the scene as a continuous 5D radiance field. This radiance field is represented implicitly, and through training the network on a sparse collection of multi-angle images along with their corresponding poses, a neural radiance field model can be acquired. By employing volume rendering techniques, our method facilitates the generation of photo-realistic images from the arbitrary view perspective.

In general, NeRF [12] is an implicit rendering process that can be divided into two stages for novel view synthesis, as shown in Figure 1: (1) 5D neural radiance fields representation and (2) volume rendering with radiance fields. The first stage involves a 2D to 3D modeling process, where the 3D point coordinates

x = (x, y, z)

and 2D azimuthal angles

(θ, ϕ)

are used as inputs for implicit modeling through a MLP network.

F_{Θ} : (x, d) \to (c, σ),

(1)

and optimizes its weights

θ

to map from each input 5D coordinate to its corresponding density

σ

and directional emitted color

c

. Since the volume density

σ

to be predicted is only related to the location of the sampled points, while the RGB color

c

to be predicted is related to both the location and the viewing angle of the sampled points, NeRF predicts RGB color

c

based on both the location and viewing direction to encourage the representation to be multi-view consistent. To achieve this, the MLP

F (θ)

first processes the input 3D coordinates

x

with 8 fully connected layers with 256 channels and outputs the density

σ

and a 256-dimensional feature vector. This feature vector is then connected to the viewing direction of the camera light and passed to an additional fully connected layer with 128 channels to output view-dependent RGB color

c

. In the second stage, NeRF renders the colors of each ray by classical volume rendering [36], which is the process of 3D to 2D modeling. The volume density

σ (x)

can be interpreted as the differential probability of a ray terminating in an infinitesimal particle at position

x

. The rendered color

C (r)

of the camera ray

r (t) = o + t d

is computed as:

C (r) = \int_{t_{n}}^{t_{f}} T (t) σ (r (t)) c (r (t), d) d t, where T (t) = exp (- \int_{t_{n}}^{t} σ (r (s)) d s)

(2)

The function

T (t)

calculates the cumulative transmittance along the ray from

t n

to t, which represents the probability that the ray travels from

t n

to t without colliding with any other particle. A stratified sampling method is used to divide the ray range

[t_{n}, t_{f}]

into N uniform sampling spacing and draw one sample at random from each spacing:

t_{i} \sim U [t_{n} + \frac{i - 1}{N} (t_{f} - t_{n}), t_{n} + \frac{i}{N} (t_{f} - t_{n})]

(3)

After sampling the points along each ray, the rendered color

C (r)

with the quadrature rule in the volume rendering is:

\hat{C} (r) = \sum_{i = 1}^{N} T_{i} (1 - exp (- σ_{i} δ_{i})) c_{i}, where T_{i} = exp (- \sum_{j = 1}^{i - 1} σ_{j} δ_{j}),

(4)

where

δ_{i} = t_{i + 1} - t_{i}

is the distance between sampled points.

3.2. Overall Architecture

This paper presents an advanced method for high-resolution remote sensing view synthesis, as illustrated in Figure 2. Prior to inputting the data into the network, our approach incorporates frequency-weighted position encoding to map the sampled points and view perspective along the ray to different frequencies, assigning weights to each position encoding. Subsequently, the encoded points and view perspective are fed into a neural network to represent the implicit model of 3D remote sensing scenes. The network model employed in our method comprises several modules. Firstly, the point attention module utilizes fully connected layers with residual connections. By introducing an attention mechanism to the channels of the input point set, we enhance the relationships between spatial points, enabling the network to focus more on discovering the importance of specific channels and better capture the scene features in the image. Secondly, the radiance module primarily predicts the RGB values of each point in the neural radiance field. During the prediction process, we introduce a batch attention module, allowing the network to extract features for each spatial point and simultaneously process points on a ray. This enhances the internal constraints of the points on the ray, resulting in more accurate viewpoint and radiance values. The density module, consisting of only a linear layer, predicts the volume density of each spatial point. The radiance values and volume density values in the neural radiance field are then used for rendering novel views through differentiable volume rendering techniques. This method enables high-quality image synthesis from any viewpoint, offering a valuable data augmentation approach for remote sensing scenes.

3.3. 3D Scene Representation Network with Attention Mechanism

3.3.1. Network Design

As depicted in Figure 2, our network architecture comprises the frequency-weighted position encoding module, point module, radiance module, and density module. The frequency-weighted module is responsible for assigning weights to the position encoding of each frequency. The point module is composed of MLPs, residual blocks [37], and point attention. The utilization of residual blocks in the network is recommended due to their importance in deep networks and their suitability for learning the neural radiance field. By incorporating point attention, which operates on the channel dimension and consists of two linear layers and two non-linear activation functions, we enhance the network’s non-linear capacity and improve the learning of spatial point features.

The output features of the point module directly contribute to the generation of the volume density for each spatial point within the neural radiance field by the density module. The volume density can be interpreted as the opacity of the ray at the spatial point. In addition to the output features of the point module, the radiance module also requires the viewing direction encoded by frequency-weighted position encoding. Within the radiance module, we introduce batch attention to enhance the correlation among points on each ray by operating on the quantity channel of the spatial points. As mentioned in the NeRF, the radiance value of a spatial point depends on not only its spatial position but also the viewing direction, while the volume density solely relies on the spatial position. Since the generation and sampling of each ray are based on the viewing direction, we introduce batch attention solely in the radiance module and not in the density module to better capture the constraints imposed by the viewing direction and spatial position on the points.

3.3.2. Frequency-Weighted Position Encoding

As we all know, different images exhibit various frequency variations, where high-frequency features reflect detailed variations while low-frequency features often characterize flat areas. Deep learning has been shown to be less effective in learning high-frequency features of images [38], as it tends to favor learning lower frequency functions. However, the addition of position encoding techniques [12] can enhance the network’s capability to capture high-frequency features. By introducing the image feature space into a high-dimensional domain before feeding the input point coordinates and view perspective into the neural network, the network can effectively learn both low-frequency and high-frequency features. This approach is similar to the baseline NeRF, which has demonstrated excellent performance in analyzing single objects or simple scenes, such as synthetic renderings of objects such as Lego or chairs [39] and Shapenet [9].

Nevertheless, when it comes to remote sensing scene images, the shooting height and range of the camera result in high-resolution remote sensing images containing a multitude of positions from both distant and nearby areas, far beyond what is typically found in optical images of individual objects. As shown in the baseline NeRF results in our experimental section, the position encoding strategy proposed in the baseline NeRF cannot fully capture the contextual information and complex details of the central object in remote sensing images. In order to display more complex and detailed information in the new view, such as specific detailed targets in the scene, a certain amount of high-frequency information is required. For distant scenes and scenes with insignificant color changes, such as the shadows or side areas of some buildings, low-frequency Fourier position encoding may be sufficient for representation. The fixed-frequency position encoding utilized in the baseline NeRF fails to consider both high-frequency and low-frequency information present in remote sensing scenes. To address this issue, we propose a frequency-weighted position encoding strategy, which assigns weights to each Fourier position encoding at different frequencies, as illustrated below:

γ (p) = (γ {(p)}_{0}, γ {(p)}_{1}, γ {(p)}_{2}, \dots, γ {(p)}_{i}),

(5)

γ {(p)}_{i} = ω_{i} * (sin (2^{i - 1} π p), cos (2^{i - 1} π p)), i = 0, 1, 2, \dots, L,

(6)

where

ω_{i}

is the frequency weight of each position encoding, and the position encoding function still contains the sine and cosine functions.

In scenarios where there exists a wide range of distances, encompassing both high and low values within the view, the weight assigned to low-frequency encoding will be significantly larger, potentially several times greater than that of high-frequency encoding. In cases where the view consists of a greater number of nearby positions and intricate details, our focus should be directed towards capturing high-frequency features. Regarding the selection of frequency weights, we initiate an encoding weight module prior to the network input, allowing the network to learn the relative importance of position encoding at different frequencies. This enables the network to dynamically allocate frequency weights based on the characteristics of the specific scenes being analyzed. Since each frequency position encoding function consists of a pair of sine and cosine functions, in order to streamline the process, we assign identical weights to both the sine and cosine functions within the position encoding of the same frequency.

3.3.3. Point Module with Point Attention

The first primary component of our network architecture is the point module, which plays a crucial role in processing spatial point coordinates

(x, y, z)

encoded with weighted position information. The core structure of this module consists of a residual fully connected layer network that incorporates point attention, as depicted in Figure 3a. To illustrate the operation of the first half of this structure, the spatial point coordinates, already encoded with position information, traverse through three fully connected layers, each followed by a non-linear activation function. Subsequently, the output is fed into our point attention module and combined with the input through residual calculation. The structure of the point attention module is presented on the right side of Figure 3b. The size of the network input is denoted as

N * c

, where

N

represents the number of sampled points in space, and

c

signifies the number of channels. To discern the significance of each channel within the discrete spatial points, we introduce an attention mechanism for each channel. Following the first fully connected layer, the input is compressed into a smaller vector, enhancing its ability to extract global features from channels. Subsequently, the second fully connected layer expands the output back to the original feature map size, while simultaneously converting it into a vector containing the weights assigned to each channel. Furthermore, the point attention module encompasses two non-linear layers, namely ReLU and Sigmoid, which contribute to the network’s nonlinear capacity and improve the representation ability of 3D models.

Furthermore, the incorporation of residual networks proves to be advantageous in optimizing neural networks that employ attention mechanisms. The inclusion of cross-layer connections facilitates smoother gradient flow, mitigates the issue of gradient vanishing, and enhances the network’s training ability. Additionally, these cross-layer connections enable the direct propagation of original features within the network, thereby preserving crucial details of the input and enhancing the accuracy and robustness of the model. When combined with point attention, the point module becomes adept at adjusting the weight of each channel. This adjustment process significantly improves the model’s responsiveness to important features and intricate details, while simultaneously reducing reliance on redundant features. Consequently, the overall performance of the model is enhanced.

3.3.4. Radiance Module with Batch Attention

In the radiance module, within the neural radiance field framework, the radiance value corresponds to the color value of the image at the sampled point. It is evident that the RGB values of an object within a scene typically vary across different positions. When observing the same position from a different view perspective, factors such as occlusion, shadows, and lighting conditions can cause variations in the color value of the corresponding spatial point. The process of ray sampling in the scene is inherently view perspective based. Therefore, strengthening the relationship among all sampled points along the same ray could enhance the constraints imposed by view perspective and spatial position on these spatial points.

The network architecture of the radiance module is illustrated in Figure 4. The input to the radiance module consists of spatially correlated point features generated by the point module, as well as the view perspective with frequency-weighted encoding. The batch attention module primarily transforms the input data from

N_{p} * c

to

N_{r} * N_{rp} * c

based on the view perspective, number of rays, and number of sampled points on each ray. Here,

N_{p}

denotes the total number of spatial points,

N_{r}

represents the number of rays, and

N_{rp}

indicates the number of sampled points on each ray. Subsequently, the module utilizes the multi-layer self-attention mechanism derived from the transformer encoder [40]. This enables the network to learn the intricate relationships between different points along the rays, thereby effectively extracting the features of spatial points under the joint influence of view perspective and spatial position. Finally, the output is passed through a fully connected layer, which yields radiance values for all sampled points along each ray.

3.3.5. Density Module

In cases where the sampled points reside within the scene’s object, they remain invisible in the image, which is intuitively understandable. Consequently, the volume density of a spatial point relies solely on its spatial position. Given that spatial position features have already been extracted for all sampled points during the point module stage, the density module merely necessitates the output of the volume density for each spatial sampling point. This can be accomplished by passing the features generated by the point module through a fully connected layer within the density module.

3.4. Sampling and Volume Rendering

Furthermore, we improve the representation of 3D remote sensing scenes through the implementation of a hierarchical volume sampling strategy and a “coarse to fine” approach, which resemble the techniques employed in the baseline NeRF framework [12]. Leveraging the differentiable properties of the classical volume rendering process, we seamlessly integrate it with MLP networks to facilitate the training of network parameters.

4. Experiments and Results

4.1. Dataset

The LEVIR-NVS dataset [20] is a newly proposed dataset for remote sensing image novel view synthesis. The dataset comprises 16 scenes, including mountains, cities, schools, stadiums, colleges, and more, with each scene containing 21 multi-view images sized at 512 × 512. These images were captured using actual aerial photography with pose transformations such as wrapping and swinging, making the dataset approximate UAV images in a realistic manner. We selected 11 images for training, which include the 0, 1, 3, 5, 7, 9, 11, 13, 15, 17, and 19 images, and the other 10 images for testing.

4.2. Quality Assessment Metrics

Novel view synthesis by NeRF and variants commonly use visual quality assessment metrics for benchmarks such as the Peak Signal to Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM) [41], and Learned Perceptual Image Patch Similarity (LPIPS) [42].

The PSNR is a no-reference quality assessment metric to evaluate image quality. The PSNR between novel view I and ground truth G is given by

P S N R (I) = 10 \cdot {log}_{10} (\frac{M A X {(I)}^{2}}{M S E (I, G)}),

(7)

M S E (I, G) = \frac{1}{n} \sum_{i = 1}^{n} {(G_{i} - I_{i})}^{2},

(8)

where

M A X (I)

is the maximum possible pixel value of image I, and

M S E (I, G)

is the mean squared error of image I and image G.

SSIM is a commonly used metric for measuring the structural similarity between two images. It takes into account the perception of local structural changes in an image and quantifies its properties in terms of luminance, contrast, and structure. The SSIM values range from 0 to 1, with larger values indicating greater similarity between the images. If two images are identical, the SSIM value is 1. When calculating SSIM for a novel view image (I) and a ground truth image (G), the formula is as follows:

S S I M (I, G) = \frac{(2 μ_{I} μ_{G} + C_{1}) (2 σ_{I G} + C_{2})}{(μ_{I}^{2} + μ_{G}^{2} + C_{1}) (σ_{I}^{2} + σ_{G}^{2} + C_{2})},

(9)

where

μ_{I}

and

μ_{G}

are the means of I and G to estimate luminance, and contrast is estimated by variance

σ_{I}

and

σ_{G}

. Moreover,

σ_{I G}

is the covariance of I and G to estimate the structural similarity.

C_{1}

and

C_{2}

are two constants used to maintain stability, where

C_{1} = {(k_{1} L)}^{2}

and

C_{2} = {(k_{2} L)}^{2}

. L is the dynamic range of pixels, and

K_{1} = 0.01, K_{2} = 0.03

by the origin work.

LPIPS is a full reference quality assessment metric which uses learned neural network features. LPIPS is given by a weighted pixel-wise MSE of feature maps over multiple layers.

L P I P S (I, G) = \sum_{l}^{L} \frac{1}{H_{l} W_{l}} \sum_{h, w}^{H_{l}, W_{l}} {∥w_{l} ⊙ (I_{h w}^{l} - G_{h w}^{l})∥}_{2}^{2},

(10)

where

I_{h w}^{l}

and

G_{h w}^{l}

are feature of novel views and ground truth at pixel width w, height h and layer l.

H_{l}

and

W_{l}

are the height and width of the feature maps. The original LPIPS used several networks such as VGG, AlexNet, etc., as feature extraction backbone.

4.3. Implementation Details

Our implementation is based on PyTorch and is trained with a single GeForex RTX3090 GPU. In the training process, the images are in the initialized image size of 512 × 512. We used the Adam optimizer [43] with a learning rate beginning at

5 \times 10^{- 4}

and decays to

5 \times 10^{- 5}

over the optimization. Similar to baseline NeRF and ImMPI, we applied PSNR, SSIM, and LPIPS to evaluate the accuracy of the novel view, and LPIPS uses a VGG network to evaluate the image similarity. The loss function used in the implementation was the total squared error between the rendered radiance

\hat{C} (r)

and truth RGB

C (r)

of each pixel along the rays

R

:

L = \sum_{r \in R} {∥\hat{C} (r) - C (r)∥}_{2}^{2}

(11)

4.4. Results

4.4.1. Quantitative and Qualitative Evaluation Compared with Other Methods

We conducted a comparative analysis of our method with two other state-of-the-art view synthesis methods: baseline NeRF and ImMPI. Baseline NeRF were also included in our ablation study to serve as a benchmark. ImMPI’s work introduced the LEVIR-NVS dataset, which we utilized for comparison in our experiments.

Table 1 presents the results of our quantitative comparison. Higher PSNR and SSIM values, along with lower LPIPS values, indicate a greater similarity between the synthesized novel view images and the ground truth. Among these metrics, SSIM and LPIPS are more reliable indicators of image similarity, aligning better with human visual characteristics. Particularly when PSNR values are relatively close, these two metrics can be used for assessment. Our method outperformed the other two methods in all three metrics, for both the training view and the test view. These results suggest that our novel method excels in accurately representing the 3D implicit model of remote sensing scenes and synthesizing superior novel view images.

Based on the aforementioned quantitative metrics, it is evident that baseline NeRF, ImMPI, and our method are all capable of generating high-quality new views, with our method demonstrating superior performance. Furthermore, we provide a qualitative evaluation by showcasing the new views generated by the three methods. Figure 5 and Figure 6 display the results of our qualitative evaluation, wherein we utilize colored boxes to highlight locations with notable differences across several scenes.

It is evident from these findings that although the novel views synthesized by ImMPI and baseline NeRF demonstrate satisfactory overall performance, they still exhibit certain blurriness in their surroundings. Specifically, the new views generated by ImMPI display streaky patterns of blurriness around the edges, whereas the edges of the new views produced by baseline NeRF may exhibit some missing details. Additionally, when considering the finer details in the new views, such as the shape, color, and location of small objects, our method consistently produces superior results. In order to highlight the advantages of our method, we have enlarged specific details or notable differences in the peripheral scenes, particularly in the even rows of Figure 5 and Figure 6. These enlarged images allow for a better observation of the strengths of our approach. For instance, in the second row of Figure 5, we present a small target within the observed scene. It is evident that our method accurately reproduces the target’s colors, maintains clear outlines, and renders a sharp appearance, thereby closely resembling the real image when compared to the other two methods. Another example can be found in the sixth row of Figure 5. Here, the result obtained from baseline NeRF loses details at the edges of the target, while the ImMPI result suffers from overall blurriness. These examples serve to illustrate the superior performance of our method in faithfully capturing the intricate details within the synthesized views.

4.4.2. Ablation Study

As described in Section 3, we have incorporated several enhancements into our network, including frequency-weighted position encoding, a point attention module, and a batch attention module. To validate the effectiveness of each added module, we conducted multiple experiments. We utilized all 16 scenarios in the LEVIR-NVS dataset to perform the ablation study, and the results are presented in Table 2. We systematically removed each of the three components individually from the full model and compared the performance metrics of the resulting incomplete models. It is worth noting that the metrics of the incomplete models surpass those of the baseline NeRF but fall short of the performance achieved by the complete model, which corresponds to the method proposed in this paper.

Similarly, we have also presented qualitative visualization results of the ablation experiments as Figure 7, which provide compelling evidence for the effectiveness of the modules proposed in this paper. We select the key image area from the school scene for observation and comparison. The outcomes obtained under the complete model demonstrate remarkable similarity to the ground truth, both in terms of overall effect and intricate details. Conversely, without each of the three modules, the generated results exhibited significant blurriness, leading to the loss of crucial information in the finer details. Moreover, when the frequency-weighted position encoding module is omitted, we observe a significant challenge in generating clear details on the blue platform within the view, which strongly supports the effectiveness of our frequency-weighted position encoding approach. Furthermore, when the point attention and batch attention modules are excluded, we notice the presence of certain discernible details on the blue platform. The surrounding buildings and the positions of some small targets are generated with reasonable accuracy, albeit not as clear as the results achieved under the complete model. This finding confirms that the inclusion of frequency-weighted position encoding leads to clearer details and an overall improvement in the image generation effect, further substantiating the effectiveness of this module. Considering the primary objective of both the point attention and batch attention modules, which is to comprehend the relationships and radiation expressions among points within the neural radiation field, their presence results in the generation of clearer and more detailed information, as well as accurate target positioning.

5. Discussion

This paper addresses the challenge of synthesizing novel views for sparse high-resolution remote sensing images that have limited acquisition views and view perspective changes. To improve the quality of synthesized novel views, we introduce several novel modules: frequency-weighted position encoding, a point attention module, and a batch attention module. These modules enhance the network’s ability to represent the 3D implicit model and elevate the overall quality of the synthesized novel views.

We conducted comparative experiments on the LEVIR-NVS dataset. While the synthesized images produced by the baseline NeRF and ImMPI methods exhibit reasonable results, they still suffer from regional blurring, particularly around the image and in the edge regions, as well as unclear details within the images. In contrast, our synthesized results show superior performance in these aspects. The baseline NeRF method globally encodes positions to elevate features from a low-dimensional space to a high-dimensional space, but this approach may not be effective in capturing different frequency features across diverse images. To address this, we introduce different weights for frequency position encoding, allowing the neural network to learn the appropriate weights based on the frequency characteristics of each image, resulting in improved novel view synthesis. Furthermore, the introduction of the attention mechanism through the point attention module plays a crucial role in enhancing the learning of individual feature representations at each sampled point within the network. This aspect is not achieved in the baseline NeRF method. By strengthening the features of all sampled points themselves, we assist the neural network in better fitting and representing the 3D implicit model of remote sensing scenes, consequently improving the quality of novel view synthesis. Additionally, during training, we enhance the batch attention by incorporating multiple sampled points along the same ray in each training batch. Through the batch attention module, these sampled points can learn the relationships among each other and reinforce the constraints between points, leading to improved RGB color generation. Our ablation experiments further demonstrate that the introduction of these new modules helps improve various indicators and enhance visual effects. Overall, our method presents a promising solution for synthesizing high-quality novel views of sparse high-resolution remote sensing images. This advancement holds significant potential for various applications, including remote sensing 3D reconstruction and remote sensing object detection.

In future work, we will also make further improvements in other aspects. The above experiments have proven that our method has good performance in novel view synthesis for remote sensing images. We will continue to explore the combination of attention mechanisms and neural radiation fields, applying them to natural images and other types of images in order to synthesize novel views of more types of data. As for the efficiency of the method, our method is similar to baseline NeRF in terms of running time and rate of convergence, but the effect and metrics of our method are significantly improved in terms of close time cost. While we still employ the diverse volume rendering technology utilized in the baseline method, it is worth noting that the color calculation for each final pixel point in image synthesis requires trading all points along the corresponding ray, resulting in some time cost. So, in future work, we plan to invest and refine this aspect as well.

6. Conclusions

In this paper, we propose a novel approach to address high-resolution remote sensing novel view synthesis. Our method incorporates two novel key components, namely the point attention module and the batch attention module, into the network architecture. These modules are designed to enhance the neural network’s ability to represent the 3D implicit model effectively. Moreover, we propose a frequency-weighted position encoding strategy, which enables the network to adapt to the image characteristics across different frequencies. By leveraging these advancements, our method is capable of synthesizing novel view images using a limited number of remote sensing scene images. Extensive experiments conducted on the LEVIR-NVS dataset demonstrate the superiority of our method over baseline NeRF and ImMPI. The results validate the effectiveness of our method, facilitating the generation of additional new views from a limited number of remote sensing images. This expands the available remote sensing data, thereby benefiting remote sensing target detection and 3D reconstruction tasks.

Author Contributions

Conceptualization, J.L., J.G. and Y.Z.; methodology, validation, writing—original draft preparation, J.L.; writing—review and editing, J.L., J.G., Y.Z. and X.Z.; supervision, J.G., Y.Z. and B.L. All authors have read and agreed to the published version of the manuscript.

Funding

The National Natural Science Foundation of China: 61991421; The National Natural Science Foundation of China: 61991420; Key Research and Development Program of Aerospace Information Research Institute Chinese Academy of Sciences: E1Z208010F.

Conflicts of Interest

The authors declare no conflict of interest.

References

Remondino, F. Heritage Recording and 3D Modeling with Photogrammetry and 3D Scanning. Remote Sens. 2011, 3, 1104–1138. [Google Scholar] [CrossRef] [Green Version]
Schonberger, J.L.; Frahm, J.M. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4104–4113. [Google Scholar]
Kanazawa, A.; Tulsiani, S.; Efros, A.A.; Malik, J. Learning category-specific mesh reconstruction from image collections. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 371–386. [Google Scholar]
Wang, N.; Zhang, Y.; Li, Z.; Fu, Y.; Liu, W.; Jiang, Y.G. Pixel2mesh: Generating 3D mesh models from single rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 52–67. [Google Scholar]
Groueix, T.; Fisher, M.; Kim, V.G.; Russell, B.C.; Aubry, M. A papier-mâché approach to learning 3D surface generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 216–224. [Google Scholar]
Brock, A.; Lim, T.; Ritchie, J.M.; Weston, N. Generative and discriminative voxel modeling with convolutional neural networks. arXiv 2016, arXiv:1608.04236. [Google Scholar]
Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3D shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1912–1920. [Google Scholar]
Flynn, J.; Broxton, M.; Debevec, P.; DuVall, M.; Fyffe, G.; Overbeck, R.; Snavely, N.; Tucker, R. Deepview: View synthesis with learned gradient descent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2367–2376. [Google Scholar]
Chang, A.X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. Shapenet: An information-rich 3D model repository. arXiv 2015, arXiv:1512.03012. [Google Scholar]
Liu, S.; Chen, W.; Li, T.; Li, H. Soft rasterizer: Differentiable rendering for unsupervised single-view mesh reconstruction. arXiv 2019, arXiv:1901.05567. [Google Scholar]
Kato, H.; Ushiku, Y.; Harada, T. Neural 3D mesh renderer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3907–3916. [Google Scholar]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Genova, K.; Cole, F.; Sud, A.; Sarna, A.; Funkhouser, T. Local Deep Implicit Functions for 3D Shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Mescheder, L.; Oechsle, M.; Niemeyer, M.; Nowozin, S.; Geiger, A. Occupancy networks: Learning 3D reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4460–4470. [Google Scholar]
Jiang, C.; Sud, A.; Makadia, A.; Huang, J.; Nießner, M.; Funkhouser, T. Local implicit grid representations for 3D scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6001–6010. [Google Scholar]
Park, J.J.; Florence, P.; Straub, J.; Newcombe, R.; Lovegrove, S. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 165–174. [Google Scholar]
Yao, H.; Qin, R.; Chen, X. Unmanned Aerial Vehicle for Remote Sensing Applications—A Review. Remote Sens. 2019, 11, 1443. [Google Scholar] [CrossRef] [Green Version]
Neff, T.; Stadlbauer, P.; Parger, M.; Kurz, A.; Mueller, J.H.; Chaitanya, C.R.A.; Kaplanyan, A.; Steinberger, M. DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks. Comput. Graph. Forum 2021, 40, 45–59. [Google Scholar] [CrossRef]
Yu, A.; Ye, V.; Tancik, M.; Kanazawa, A. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4578–4587. [Google Scholar]
Wu, Y.; Zou, Z.; Shi, Z. Remote Sensing Novel View Synthesis with Implicit Multiplane Representations. arXiv 2022, arXiv:2205.08908. [Google Scholar] [CrossRef]
Barron, J.T.; Mildenhall, B.; Tancik, M.; Hedman, P.; Martin-Brualla, R.; Srinivasan, P.P. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 5855–5864. [Google Scholar]
Zhang, J.; Zhang, Y.; Fu, H.; Zhou, X.; Cai, B.; Huang, J.; Jia, R.; Zhao, B.; Tang, X. Ray Priors through Reprojection: Improving Neural Radiance Fields for Novel View Extrapolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18376–18386. [Google Scholar]
Park, K.; Sinha, U.; Barron, J.T.; Bouaziz, S.; Goldman, D.B.; Seitz, S.M.; Martin-Brualla, R. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 5865–5874. [Google Scholar]
Deng, K.; Liu, A.; Zhu, J.Y.; Ramanan, D. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12882–12891. [Google Scholar]
Wei, Y.; Liu, S.; Rao, Y.; Zhao, W.; Lu, J.; Zhou, J. Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 5610–5619. [Google Scholar]
Xu, Q.; Xu, Z.; Philip, J.; Bi, S.; Shu, Z.; Sunkavalli, K.; Neumann, U. Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5438–5448. [Google Scholar]
Liu, L.; Gu, J.; Zaw Lin, K.; Chua, T.S.; Theobalt, C. Neural sparse voxel fields. Adv. Neural Inf. Process. Syst. 2020, 33, 15651–15663. [Google Scholar]
Garbin, S.J.; Kowalski, M.; Johnson, M.; Shotton, J.; Valentin, J. Fastnerf: High-fidelity neural rendering at 200fps. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 14346–14355. [Google Scholar]
Reiser, C.; Peng, S.; Liao, Y.; Geiger, A. Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 14335–14345. [Google Scholar]
Marí, R.; Facciolo, G.; Ehret, T. Sat-NeRF: Learning Multi-View Satellite Photogrammetry with Transient Objects and Shadow Modeling Using RPC Cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1311–1321. [Google Scholar]
Zhao, C.; Zhang, C.; Su, N.; Yan, Y.; Huang, B. A Novel Building Reconstruction Framework using Single-View Remote Sensing Images Based on Convolutional Neural Networks. In Proceedings of the IGARSS 2020—2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 4211–4214. [Google Scholar] [CrossRef]
Matsunaga, R.; Hashimoto, M.; Kanazawa, Y.; Sonoda, J. Accurate 3-D reconstruction of sands from UAV image sequence. In Proceedings of the 2016 International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA), Penang, Malaysia, 16–19 August 2016; pp. 1–6. [Google Scholar] [CrossRef]
Fraundorfer, F. Building and site reconstruction from small scale unmanned aerial vehicles (UAV’s). In Proceedings of the 2015 Joint Urban Remote Sensing Event (JURSE), Lausanne, Switzerland, 30 March–1 April 2015; pp. 1–4. [Google Scholar] [CrossRef]
Wu, S.; Liebel, L.; Körner, M. Derivation of Geometrically and Semantically Annotated UAV Datasets at Large Scales from 3D City Models. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 4712–4719. [Google Scholar] [CrossRef]
Chen, H.; Chen, W.; Gao, T. Ground 3D Object Reconstruction Based on Multi-View 3D Occupancy Network using Satellite Remote Sensing Image. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 4826–4829. [Google Scholar] [CrossRef]
Kajiya, J.T.; Von Herzen, B.P. Ray tracing volume densities. ACM SIGGRAPH Comput. Graph. 1984, 18, 165–174. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27 June–1 July 2016; pp. 770–778. [Google Scholar]
Rahaman, N.; Baratin, A.; Arpit, D.; Draxler, F.; Lin, M.; Hamprecht, F.; Bengio, Y.; Courville, A. On the spectral bias of neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Beach, CA, USA, 10–15 June 2019; pp. 5301–5310. [Google Scholar]
Sitzmann, V.; Thies, J.; Heide, F.; Nießner, M.; Wetzstein, G.; Zollhofer, M. Deepvoxels: Learning persistent 3D feature embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2437–2446. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. A simple overview of NeRF.

Figure 2. The overall architecture of our method for high-resolution remote sensing novel view synthesis. Initially, we collected high-resolution remote sensing images, followed by ray and spatial sampled points on each image. Subsequently, a frequency-weighted position encoding strategy was employed to encode the spatial position

(x, y, z)

and ray viewing direction

d

of the sampled points. Our network consists of a point module, a radiance module, and a density module, which leverage attention mechanisms in conjunction with neural networks to represent the 3D implicit model of remote sensing scenes. The output of the network comprises the radiance values in the neural radiance field, specifically the RGB values, as well as the volume density of each spatial point. To generate new views, we utilize differentiable volume rendering techniques, and the network is optimized by minimizing the discrepancy between synthesized views and real images.

Figure 2. The overall architecture of our method for high-resolution remote sensing novel view synthesis. Initially, we collected high-resolution remote sensing images, followed by ray and spatial sampled points on each image. Subsequently, a frequency-weighted position encoding strategy was employed to encode the spatial position

(x, y, z)

and ray viewing direction

d

of the sampled points. Our network consists of a point module, a radiance module, and a density module, which leverage attention mechanisms in conjunction with neural networks to represent the 3D implicit model of remote sensing scenes. The output of the network comprises the radiance values in the neural radiance field, specifically the RGB values, as well as the volume density of each spatial point. To generate new views, we utilize differentiable volume rendering techniques, and the network is optimized by minimizing the discrepancy between synthesized views and real images.

Figure 3. (a) A visualization of the point module architecture. (b) Part of the architecture of the point module (left) and architecture of the point attention module (right).

N

is the number of all spatial points, and

c

is the number of channels.

Figure 3. (a) A visualization of the point module architecture. (b) Part of the architecture of the point module (left) and architecture of the point attention module (right).

N

is the number of all spatial points, and

c

is the number of channels.

Figure 4. A visualization of our radiance module architecture. The structure on the right is the batch attention module.

N_{p}

is the number of all spatial points,

N_{r}

is the number of rays,

N_{rp}

is the number of sampled points on each ray, and

c

is the number of channels.

Figure 4. A visualization of our radiance module architecture. The structure on the right is the batch attention module.

N_{p}

is the number of all spatial points,

N_{r}

is the number of rays,

N_{rp}

is the number of sampled points on each ray, and

c

is the number of channels.

Figure 5. Qualitative comparison of novel view synthesis on testing set of observation scene, building1 scene, and building2 scene in LEVIR-NVS dataset.

Figure 6. Qualitative comparison of novel view synthesis on testing set of church scene, college scene, and town2 scene in LEVIR-NVS dataset.

Figure 7. Qualitative comparison of novel view synthesis of ablation study.

Table 1. Quality assessment metrics of train view/test view in different methods.

Scene		PSNR ↑			SSIM ↑			LPIPS ↓
Name	NeRF	ImMPI	Ours	NeRF	ImMPI	Ours	NeRF	ImMPI	Ours
Building1	23.14/21.79	24.92/24.77	29.12/24.64	0.725/0.706	0.867/0.865	0.933/0.906	0.385/0.393	0.105/0.151	0.169/0.189
Building2	23.08/22.20	23.31/22.73	29.36/26.25	0.655/0.638	0.783/0.776	0.885/0.855	0.420/0.423	0.217/0.218	0.195/0.213
College	25.20/24.06	26.17/25.71	35.01/28.55	0.713/0.696	0.820/0.817	0.954/0.917	0.381/0.393	0.201/0.203	0.104/0.131
Mountain1	28.65/28.05	30.23/29.88	34.38/32.02	0.737/0.727	0.854/0.854	0.922/0.902	0.375/0.379	0.187/0.185	0.145/0.158
Mountain2	27.42/26.89	29.56/29.37	33.14/30.64	0.679/0.666	0.844/0.843	0.911/0.888	0.430/0.437	0.172/0.173	0.174/0.188
Mountain3	29.92/29.41	33.02/32.81	34.68/33.56	0.735/0.726	0.880/0.878	0.914/0.901	0.411/0.414	0.156/0.157	0.149/0.160
Observation	23.25/22.57	23.04/22.54	29.64/26.22	0.671/0.654	0.728/0.718	0.906/0.865	0.404/0.408	0.267/0.272	0.176/0.196
Church	22.71/21.65	21.60/21.04	28.72/25.43	0.679/0.658	0.729/0.720	0.891/0.858	0.405/0.413	0.254/0.258	0.183/0.205
Town1	25.49/25.00	26.34/25.88	32.01/29.56	0.759/0.752	0.849/0.844	0.938/0.922	0.343/0.349	0.163/0.167	0.118/0.134
Town2	23.37/22.41	25.89/25.31	32.95/25.09	0.691/0.667	0.855/0.850	0.942/0.874	0.385/0.402	0.156/0.158	0.127/0.168
Town3	24.64/23.77	26.23/25.68	32.19/27.89	0.733/0.717	0.840/0.834	0.924/0.893	0.361/0.367	0.187/0.190	0.148/0.168
Stadium	25.64/24.96	26.69/26.50	32.97/30.44	0.735/0.727	0.878/0.876	0.936/0.923	0.362/0.364	0.123/0.125	0.115/0.124
Factory	25.34/24.67	28.15/28.08	31.85/28.16	0.777/0.762	0.908/0.907	0.929/0.890	0.338/0.342	0.109/0.109	0.135/0.153
Park	25.90/25.55	27.87/27.81	32.18/29.85	0.796/0.788	0.896/0.896	0.941/0.921	0.352/0.358	0.123/0.124	0.137/0.149
School	24.28/24.83	25.74/25.33	30.48/27.44	0.666/0.654	0.830/0.825	0.869/0.837	0.422/0.426	0.163/0.165	0.187/0.202
Downtown	23.42/22.52	24.99/24.24	28.46/25.03	0.685/0.668	0.825/0.816	0.872/0.838	0.444/0.449	0.201/0.205	0.204/0.211
Average	25.09/24.39	26.34/25.95	31.63/28.45	0.714/0.700	0.835/0.831	0.915/0.885	0.388/0.394	0.172/0.173	0.154/0.171

The results of our method are highlighted in bold in the table.

Table 2. Quality assessment metrics of ablation study of our method on LEVIR-NVS dataset.

Scene	NoFWPE,PA,BA			NoFWPE			NoPA			NoBA			CompleteModel
Name	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS
Building1	21.79	0.706	0.393	23.62	0.829	0.255	23.56	0.822	0.270	23.64	0.826	0.260	24.64	0.876	0.189
Building2	22.20	0.638	0.423	24.48	0.790	0.291	24.31	0.772	0.317	24.79	0.803	0.273	26.25	0.855	0.213
College	24.06	0.696	0.393	27.54	0.879	0.183	27.16	0.877	0.192	27.37	0.889	0.172	28.55	0.917	0.131
Mountain1	28.05	0.727	0.379	30.46	0.852	0.229	30.48	0.853	0.234	30.68	0.860	0.225	32.02	0.902	0.158
Mountain2	26.89	0.666	0.437	29.39	0.831	0.261	28.90	0.817	0.287	29.51	0.830	0.264	33.14	0.911	0.174
Mountain3	29.41	0.726	0.414	31.88	0.856	0.229	31.87	0.848	0.246	31.71	0.851	0.235	33.56	0.901	0.160
Observation	22.57	0.654	0.408	25.17	0.813	0.258	25.04	0.799	0.284	25.02	0.809	0.261	26.22	0.865	0.196
Church	21.65	0.658	0.413	23.99	0.798	0.271	23.99	0.795	0.282	24.02	0.800	0.270	25.43	0.858	0.205
Town1	25.00	0.752	0.349	27.82	0.874	0.198	27.65	0.865	0.217	27.66	0.873	0.198	29.56	0.922	0.134
Town2	22.41	0.667	0.402	23.65	0.765	0.304	23.80	0.765	0.317	23.48	0.767	0.303	29.56	0.874	0.168
Town3	23.77	0.717	0.367	25.01	0.809	0.270	25.09	0.809	0.270	24.98	0.807	0.297	27.89	0.893	0.168
Stadium	24.96	0.727	0.364	28.45	0.872	0.201	28.19	0.865	0.217	28.56	0.873	0.202	30.44	0.923	0.124
Factory	24.67	0.762	0.342	26.88	0.853	0.227	26.79	0.845	0.257	26.86	0.852	0.227	28.16	0.890	0.153
Park	25.55	0.788	0.358	27.94	0.881	0.221	28.16	0.881	0.223	28.13	0.882	0.220	29.85	0.921	0.140
School	24.83	0.654	0.426	26.04	0.764	0.301	25.86	0.754	0.321	26.18	0.765	0.296	27.44	0.837	0.202
Downtown	22.52	0.668	0.449	23.98	0.772	0.347	23.70	0.757	0.372	24.19	0.779	0.340	25.03	0.838	0.211
Average	24.39	0.700	0.394	26.64	0.827	0.253	26.53	0.820	0.271	26.67	0.829	0.253	28.45	0.885	0.171

FWPE: Frequency-weighted Position Encoding; PA: Point Attention Module; BA: Batch Attention Module. The results of complete model are highlighted in bold in the table.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lv, J.; Guo, J.; Zhang, Y.; Zhao, X.; Lei, B. Neural Radiance Fields for High-Resolution Remote Sensing Novel View Synthesis. Remote Sens. 2023, 15, 3920. https://doi.org/10.3390/rs15163920

AMA Style

Lv J, Guo J, Zhang Y, Zhao X, Lei B. Neural Radiance Fields for High-Resolution Remote Sensing Novel View Synthesis. Remote Sensing. 2023; 15(16):3920. https://doi.org/10.3390/rs15163920

Chicago/Turabian Style

Lv, Junwei, Jiayi Guo, Yueting Zhang, Xin Zhao, and Bin Lei. 2023. "Neural Radiance Fields for High-Resolution Remote Sensing Novel View Synthesis" Remote Sensing 15, no. 16: 3920. https://doi.org/10.3390/rs15163920

APA Style

Lv, J., Guo, J., Zhang, Y., Zhao, X., & Lei, B. (2023). Neural Radiance Fields for High-Resolution Remote Sensing Novel View Synthesis. Remote Sensing, 15(16), 3920. https://doi.org/10.3390/rs15163920

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Neural Radiance Fields for High-Resolution Remote Sensing Novel View Synthesis

Abstract

1. Introduction

2. Related Work

2.1. NeRF and NeRF Variants

2.2. Remote Sensing Novel View Synthesis and 3D Reconstruction

3. Materials and Methods

3.1. Preliminaries on NeRF

3.2. Overall Architecture

3.3. 3D Scene Representation Network with Attention Mechanism

3.3.1. Network Design

3.3.2. Frequency-Weighted Position Encoding

3.3.3. Point Module with Point Attention

3.3.4. Radiance Module with Batch Attention

3.3.5. Density Module

3.4. Sampling and Volume Rendering

4. Experiments and Results

4.1. Dataset

4.2. Quality Assessment Metrics

4.3. Implementation Details

4.4. Results

4.4.1. Quantitative and Qualitative Evaluation Compared with Other Methods

4.4.2. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI