Deep Learning-Based Digital Surface Model Reconstruction of ZY-3 Satellite Imagery

Zhao, Yanbin; Liu, Yang; Gao, Shuang; Liu, Guohua; Wan, Zhiqiang; Hu, Denghui

doi:10.3390/rs16142567

Open AccessArticle

Deep Learning-Based Digital Surface Model Reconstruction of ZY-3 Satellite Imagery

by

Yanbin Zhao

^1,2,*,

Yang Liu

¹,

Shuang Gao

¹,

Guohua Liu

¹,

Zhiqiang Wan

¹ and

Denghui Hu

¹

Innovation Academy for Microsatellites of Chinese Academy of Sciences, Shanghai 200120, China

²

University of Chinese Academy of Sciences, Beijing 101408, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(14), 2567; https://doi.org/10.3390/rs16142567

Submission received: 8 May 2024 / Revised: 19 June 2024 / Accepted: 1 July 2024 / Published: 12 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

This study introduces a novel satellite image digital surface model (DSM) reconstruction framework grounded in deep learning methodology. The proposed framework effectively utilizes a rational polynomial camera (RPC) model to establish the mapping relationship between image coordinates and geographic coordinates. Given the expansive coverage and abundant ground object data inherent in satellite images, we designed a lightweight deep network model. This model facilitates both coarse and fine estimation of a height map through two distinct stages. Our approach harnesses shallow and deep image information via a feature extraction module, subsequently employing RPC Warping to construct feature volumes for various angles. We employ variance as a similarity metric to achieve image matching and derive the fused cost volume. Following this, we aggregate cost information across different scales and height directions using a regularization module. This process yields the confidence level of the current height plane, which is then regressed to predict the height map. Once the height map from stage 1 is obtained, we gauge the prediction’s uncertainty based on the variance in the probability distribution in the height direction. This allows us to adjust the height estimation range according to this uncertainty, thereby enabling precise height value prediction in stage 2. After conducting geometric consistency detection filtering of fine height maps from diverse viewpoints, we generate 3D point clouds through the inverse projection of RPC models. Finally, we resample these 3D point clouds to produce high-precision DSM products. By analyzing the results of our method’s height map predictions and comparing them with existing deep learning-based reconstruction methods, we assess the DSM reconstruction performance of our proposed framework. The experimental findings underscore the robustness of our method against discontinuous regions, occlusions, uneven illumination areas in satellite imagery, and weak texture regions during height map generation. Furthermore, the reconstructed digital surface model (DSM) surpasses existing solutions in terms of completeness and root mean square error metrics while concurrently reducing the model parameters by 42.93%. This optimization markedly diminishes memory usage, thereby conserving both software and hardware resources as well as system overhead. Such savings pave the way for a more efficient system design and development process.

Keywords:

digital surface model (DSM); rational polynomial camera; cost volume; height map

1. Introduction

The advancement of aerospace satellite platform technology has broadened the application scope of satellite remote sensing image processing. This technology is now integral to a variety of applications, including basic surveying, monitoring of natural resources and early warning systems for natural disasters, agricultural censuses, and military support [1]. Our country’s ZY series observation satellites possess medium- to high-resolution capabilities, expansive coverage areas, multi-temporal characteristics, and superior positioning accuracy [2]. These features make them particularly suitable for satellite stereoscopic mapping. In satellite mapping products, the digital surface model (DSM) encapsulates ground elevation models of buildings, bridges, trees, and other terrestrial features. It captures the undulation in the earth’s actual surface, offering not only geographic spatial information but also socio-economic insights. As remote sensing image resolution improves, research centered on the three-dimensional reconstruction of satellite images has garnered significant attention. The product applications of this three-dimensional reconstruction are increasingly diverse. For instance, DSM can identify vegetation coverage and terrain features in natural geography applications, aiding in analyzing natural resource utilization and assisting in monitoring natural disasters. In urban construction, DSM facilitates rapid acquisition of urban 3D models, enabling analysis of urban development trends and guiding urban planning. In military contexts, DSM with heightened accuracy enables precise strikes on specific ground targets. Consequently, generating high-resolution and high-accuracy DSM has profound implications [2,3,4].

Stereo matching constitutes a critical component of traditional 3D reconstruction tasks. In recent years, the industry has been actively seeking an algorithm for stereo matching that can simultaneously meet the demands for accuracy and speed. The fundamental process of stereo matching algorithms typically encompasses four stages as follows: calculation of matching cost, aggregation of matching cost, disparity calculation, and disparity optimization. Commonly used matching algorithms include the Global Matching Algorithm (GM), Block Matching Algorithm (BM), and Semi-Global Matching Algorithm (SGM). Notably, the SGM algorithm reduces computational complexity by employing mutual information (MI) as the method for calculating the matching cost; however, it is susceptible to noise generation [5,6,7]. High-resolution satellite remote sensing image reconstruction methods often utilize the Rational Polynomial Camera (RPC) model for scene reconstruction [8]. As a general sensor model, the RPC typically comprises 78 polynomial coefficients and 10 normalization constants. Resolving complex third-order polynomials concerning the RPC model necessitates at least 39 sets of ground control points and corresponding relationships of related images, thereby consuming substantial computing time during the reconstruction process [9,10]. In the 3D reconstruction methodology utilizing satellite imagery, the RPC Stereo Processor (RSP) [11] employs an enhanced hierarchical SGM as its stereo image matching strategy. Following the polar line correction of the initial stereo images via the RPC model, disparity estimation is achieved through the matching algorithm. Subsequently, this disparity is transformed into three-dimensional coordinates within the world coordinate system, facilitating both DSM and orthoimage generation for satellite images. Notably, this approach can efficiently process large-frame images. S2p, a software release, offers a fully automated and modular stereo pipeline designed for generating Digital Elevation Models (DEMs) from satellite images [12]. Initially, the pinhole model aligns with the RPC model, followed by disparity estimation using the dense matching algorithm More Global Matching (MGM) [13]. The final step involves obtaining a three-dimensional point cloud through triangulation methods, achieving superior reconstruction performance compared with traditional methods [14]. COLMAP, another computer vision 3D reconstruction technique, locally refines the RPC model into a simplified pinhole camera model. This adaptation allows for convenient adjustment of camera parameters using the Structure from Motion (SFM) [15] method and simultaneously executes Multiple View Stereo (MVS) structures [16]. When applied to satellite image scenes, COLMAP exhibits slightly reduced reconstruction accuracy compared with s2p but boasts higher efficiency [17]. Notably, challenges in obtaining disparity using traditional stereo matching algorithms in satellite remote sensing images are prevalent. These include issues such as occlusion and disparity discontinuity, weak texture area matching, uneven illumination, and noise introduced by cameras, all of which can compromise the final reconstruction outcome [18].

Within the realm of computer vision and deep learning, one study [19] leveraged the robust learning capabilities of neural networks to extract image features. This approach integrated Convolutional Neural Networks (CNNs) into the conventional 3D reconstruction stereo matching algorithm process, leading to marked enhancements in both matching accuracy and computational speed. Similarly, the research in [20] utilized deep learning techniques for the three-dimensional scene reconstruction of dense matching in aerial remote sensing images. This approach was compared with traditional methods such as SGM, and it was demonstrated that through transfer learning and parameter fine-tuning, deep learning can simultaneously improve efficiency and performance. In 2018, the authors of [21] introduced the concept of cost volume from binocular stereo matching into the field of three-dimensional reconstruction based on deep learning. They constructed an end-to-end multi-view depth estimation network, MVSNet [22]. This model, inspired by the plane scanning approach, initially utilized differentiable homography transformation to construct the cost volume within the parameter view space before executing multi-view depth estimation. Cas-MVSNet is a deep learning method that employs a 3D cascaded cost volume construction, producing depth estimates in three stages from coarse to fine. This addresses the issue of excessive GPU consumption in MVSNet during depth estimation and the low completeness of point clouds. UCS-Net [23] introduces an uncertainty-aware cascaded stereo network. It selects depth hypothesis samples using the confidence variance output by each pixel of the previous layer’s probability volume, performs uncertainty estimation to adjust the depth estimation range in the subsequent stage, and estimates depth maps in three stages from coarse to fine. However, the 3D convolution regularization process incurs substantial memory overhead. RED-Net [24] proposes the use of a convolutional Gate Recurrent Unit (GRU) [25,26] in lieu of 3D convolution [27], employing a 2D recurrent encoder–decoder structure to regularize cost volume. This approach reduces memory consumption while enhancing computational efficiency, obtains multiscale neighborhood information with fewer parameters, and achieves superior reconstruction performance. Perspective camera models are typically utilized for close-range and airborne cameras, such as pinhole camera models. However, push-broom linear array satellite cameras often employ more complex RPC models than pinhole cameras [28]. The homography transformation currently used in deep learning-based MVS matching methods is only suitable for pinhole cameras and cannot be applied to 3D reconstruction tasks of satellite images. The authors of [29] converted rational polynomial camera parameters into a homography transformation matrix to construct an end-to-end height estimation network, SatMVS, based on satellite images. They directly estimated height maps by using ZY-3 satellite images and corresponding RPC parameters as input, converting them into three-dimensional points under the Universal Transverse Mercator Grid System (UTM) coordinate system and merging them into a complete DSM. This method achieved excellent DSM reconstruction performance.

This study integrates the Transformer self-attention mechanism with the CNN inductive bias to develop a lightweight deep learning model for the coarse-to-fine two-stage height estimation technique, specifically tailored for satellite remote sensing imagery. The model is devised in two phases, adhering to the plane scanning approach for image alignment and cost computation. The height map is then deduced by optimizing the aggregation of cost data. A point cloud is derived from the inverse projection of the RPC model, followed by the resampling of the three-dimensional point cloud to produce high-precision DSM survey products. Upon assessing the DSM reconstruction outcomes, a quantitative juxtaposition with prevailing satellite image reconstruction techniques reveals a reduction in root mean square error by 0.96 m and an enhancement in reconstruction completeness by 0.24%. While maintaining optimal reconstruction performance, there is a notable decrease in model parameter count by 42.93%, thereby diminishing software and hardware requirements and meeting the accuracy standards and deployment prerequisites for DSM reconstruction of remote sensing satellite images.

2. Methods

The foundational network structure delineated in this study encompasses four pivotal modules: feature extraction, cost volume construction, cost volume regularization, and height map prediction. The comprehensive framework is illustrated in Figure 1. The primary methodology involves predicting the height map through a two-stage process starting with an initial rough estimation stage followed by a refined estimation stage. In the first stage, the initial height map is derived from the assumed height plane based on the actual height range of the area. Subsequently, the second stage utilizes the rough estimation’s height map as its foundation. This stage employs a variance-based uncertainty estimation method to refine the height range, ultimately yielding more precise height prediction values. The principal innovations of the proposed network model are delineated below:

The integration of Transformer [30,31] and a CNN within the feature extraction module is employed to amalgamate both deep and shallow image information. This is achieved using the U-net [32] structure, resulting in a feature map enriched with contextual feature data.
The upsampling procedure within the feature extraction and regularization module substitutes transpose convolution with a PixelShuffle technique and an interpolation method. This approach effectively prevents information loss, thereby producing high-resolution images.
The network’s convolution component employs grouped convolution in lieu of standard convolution. This approach minimizes the quantity of parameters and reduces system overhead while concurrently ensuring the extraction of crucial information.
The height map, after the coarse estimation phase, employs a variance-based uncertainty estimation technique. This method adaptively modifies the height search range for the subsequent fine estimation stage, thereby enhancing the precision of the final predicted height value.

Figure 1. Process showing how TC-SatMVS rebuilds the network framework.

2.1. Feature Extraction Module

High-resolution optical satellite imagery, while offering expansive coverage and rich ground object data, is susceptible to factors such as illumination and occlusion. Directly inferring a height map from original images can lead to inaccuracies due to incomplete information, thereby impeding the reconstruction of DSM. Within the realm of image processing, feature extraction plays a pivotal role, influencing the ultimate outcomes of subsequent processing tasks. This process primarily encompasses feature engineering techniques and deep learning methodologies. As a subset of feature engineering, feature extraction functions as a data compression mechanism. However, the compressed feature data retain significant portions of the original image content, mitigating noise and superfluous data. This streamlined approach not only reduces memory overhead but also facilitates rapid processing. Depending on the specific application, emphasis may be placed on attributes such as color, texture, spatial positioning, and shape [33,34]. Deep learning-based feature extraction offers automated feature learning capabilities. Its end-to-end training obviates the need for manual feature extractor design, ensuring efficient extraction of image data. Given its robust learning prowess, it adeptly captures nuanced features while concurrently attending to diverse types of information [35]. Consequently, it has emerged as the preferred method for satellite image feature extraction. The feature extraction module proposed in this study integrates Transformer self-attention with a CNN. This integration employs a U-net encoder–decoder structure to extract both shallow and deep features, thereby enhancing the network’s representational capacity. Furthermore, it adopts PixelShuffle as an alternative to deconvolution for upsampling. This approach not only reduces the parameter count but also ensures the extraction of crucial feature information, as illustrated in Figure 2.

The model initially employs grouped convolution to extract features from satellite images, subsequently mapping these into higher dimensions to enrich the information. Following this, it encodes the high-dimensional features utilizing the Transformer self-attention mechanism. As depicted in Figure 3, the encoding procedure captures local feature data of neighboring pixels via convolution, serving as context input keys. These are then merged with the query feature of the current pixel to compute local attention. The resulting attention matrix is employed as a weight, which multiplies the input feature to derive a weighted feature. This weighted feature is then amalgamated with the local feature to produce the final feature map. This approach facilitates learning of inter-pixel and intra-dimensional information. As the network deepens, detail features may be lost because of over-smoothing denoising during subsequent cost volume construction. However, compared with traditional convolutional layer feature extraction techniques, this method enhances the representational capacity of features through the self-attention mechanism. It bolsters crucial feature information from input images and strengthens feature connections, preserves local texture details, reduces model size to a degree, and improves reconstruction completeness metrics.

The feature extraction network module is encoded via downsampling, followed by upsampled decoding through PixelShuffle. This process allows the low-level network to extract fine-grained local information and the deep network to extract robust semantic global information. These are then fused through concatenation. The downsampling encoding facilitates the extraction of features across different scales. The output from the final layer convolution is input into a fusion attention network, which produces the feature extraction results for the first stage. The image size during this stage is 1/4 of the original image, with a channel number of 32. General deconvolution often results in numerous zero-padding areas, which can hinder gradient optimization. However, the method of PixelShuffle enables reconstruction from low-resolution images to high-resolution images by integrating information from different channels, as illustrated in Figure 4. The decoding process initially fuses deep and shallow information, employs PixelShuffle instead of deconvolution to restore image size, and generates high-resolution feature maps equal in size to the original image. The channel number is 8, and the decoded feature map serves as the feature extraction results for the second stage. Ultimately, two stages of feature maps are outputted, each corresponding to sizes {1/4, 1} of the input image and channel numbers {32, 8}.

2.2. Cost Volume Construction

2.2.1. Affine Transformation Based on the RPC Model

The RPC model, a camera model extensively utilized in high-resolution satellite imagery, employs virtual control points uniformly distributed on the ground to calculate its parameters. A multitude of parameters are required to fit a stringent imaging model, comprising normalized translation parameters, normalized scale parameters, and polynomial coefficients. These parameters are provided to users in RPC files. Equation (1) represents the three-dimensional to two-dimensional forward projection formula. The image utilizes the pixel coordinate system, with coordinates denoted as image row and column coordinates (line, samp). Conversely, the ground adopts the World Geodetic System 1984 (WGS-84), with coordinates represented as geodetic coordinates (Lat, Long, Hei).

\{\begin{matrix} L_{n} = \frac{N u m L (X, Y, Z)}{D e n L (X, Y, Z)} \\ S_{n} = \frac{N u m S (X, Y, Z)}{D e n S (X, Y, Z)} \end{matrix}

(1)

In Equation (1), X, Y, and Z represent normalized coordinates. The cubic polynomial is delineated as follows:

\{\begin{cases} N u m L (X, Y, Z) = a_{1} + a_{2} X + a_{3} Y + a_{4} Z + a_{5} X Y + a_{6} X Z + a_{7} Y Z + a_{8} X^{2} + a_{9} Y^{2} + \\ + a_{10} Z^{2} + a_{11} Y X Z + a_{12} X^{3} + a_{13} X Y^{2} + a_{14} X Z^{2} + a_{15} X^{2} Y + a_{16} Y^{3} + a_{17} Y Z^{2} + \\ + a_{18} X^{2} Z + a_{19} Y^{2} Z + a_{20} Z^{3} \\ D e n L (X, Y, Z) = b_{1} + b_{2} X +, \dots, + a_{19} Y^{2} Z + a_{20} Z^{3} \\ N u m S (X, Y, Z) = c_{1} + c_{2} X +, \dots, + c_{19} Y^{2} Z + c_{20} Z^{3} \\ D e n S (X, Y, Z) = c_{1} + c_{2} X +, \dots, + c_{19} Y^{2} Z + c_{20} Z^{3} \end{cases}

(2)

Given that the forward and inverse basic forms of the RPC model are identical, it is crucial to maintain generality. Therefore, we define the forward projection form of Equation (1) as Equation (3), while the inverse projection form is denoted as Equation (4). We employ the method of quaternion cube form and element division for Equation (5), thereby facilitating efficient batch mapping between two-dimensional image coordinates and three-dimensional ground coordinates [29]. In this context,

P_{1}

represents the projection parameter,

P_{2}

signifies the scaling parameter,

P_{3}

denotes the rotation parameter, and

P_{4}

stands for the normalization parameter. All these parameters are cubic polynomials. The value of

C_{i j k}

also varies based on different positions. Furthermore, the sum of the integers

m_{1}

,

m_{2}

, and

m_{3}

must not exceed 3. Additionally, we adopt the Einstein summation convention in Equation (5).

\{\begin{matrix} S_{n} = \frac{P_{1}^{f w d} (X, Y, Z)}{P_{2}^{f w d} (X, Y, Z)} \\ L_{n} = \frac{P_{3}^{f w d} (X, Y, Z)}{P_{4}^{f w d} (X, Y, Z)} \end{matrix}

(3)

\{\begin{matrix} X = \frac{P_{1}^{i n v} (S_{n}, L_{n}, Z)}{P_{2}^{i n v} (S_{n}, L_{n}, Z)} \\ Y = \frac{P_{3}^{i n v} (S_{n}, L_{n}, Z)}{P_{4}^{i n v} (S_{n}, L_{n}, Z)} \end{matrix}

(4)

P (U, V, W) = \sum_{i = 0}^{m_{1}} \sum_{j = 0}^{m_{2}} \sum_{k = 0}^{m_{3}} C_{i j k} \cdot U^{i} \cdot V^{j} \cdot W^{k}

(5)

The RPC model delineates a transformation relationship between three-dimensional geographic coordinates and two-dimensional image coordinates. The first term of the polynomial captures systematic errors stemming from the optical satellite platform, while the second term represents errors attributed to the curvature of the earth and atmospheric refractive index. Any remaining unidentified systematic errors are captured by the third term. The affine transformation of the feature map based on the RPC model is ascertained using the plane scanning method. This method identifies an optimal height hypothesis plane as the search range for height. Consequently, the depth estimation issue is converted into a multiclassification problem concerning height through plane division. This involves determining which height plane of a satellite image pixel falls within the height search space, as illustrated in Figure 5. In this study, we assume two stages with a set number of height layers {32, 8} and a height sampling interval {4, 1}. From a coarse to fine estimation of height, we calculate the maximum and minimum height values using the normalization parameters of the RPC model in the initial stage. These values are then sampled at equal intervals to form initial assumed height planes, according to the predetermined number of height layers. The subsequent stage involves a local refinement process for height, estimating the uncertainty in the height range. This is achieved by predicting the height estimate after the previous stage and combining it with the preset number of assumed height layers and height sampling interval. This establishes a height search space composed of various height planes. Through the RPC model, taking one of the views as a reference view, generally, selecting the nadir view, and the rest of the views as source views, source view feature maps are mapped onto a reference view by hypothetical height planes to form feature bodies. The reference view is replicated based on the number of hypothetical height planes to yield one feature body ultimately. This results in three feature bodies, each corresponding to an input satellite image.

2.2.2. Feature Volume Fusion

The method of cost calculation facilitates the fusion of feature volumes derived from various views into a singular cost volume. Traditionally, the matching cost is computed by summing the squared differences (SSD) and normalized cross-correlation (NCC) of all pixels in the neighborhood among the target views [36,37]. This approach effectively measures the similarity among the projection points across different views. Given that variance typifies correlation, a lower correlation corresponds to higher confidence at a given height. In this study, we leverage variance to link three feature volumes to a cost map on a specific plane at an arbitrary height within a three-dimensional space. Subsequently, a cost map is established for each height plane. By computing the variance in pixel values at identical positions across different feature volumes as a cost index, we can gauge the similarity among different views. This measure is then integrated into a cost volume using Equation (6). The process of constructing a cost volume is illustrated in Figure 6.

C o s t = \frac{\sum_{i = 0}^{2} {(V_{i} - \bar{V})}^{2}}{3}

(6)

In Equation (6),

V_{i}

represents the cost volume, while

\bar{V}

signifies the mean pixel value at equivalent positions across three feature volumes.

Figure 6. Cost volume construction process.

2.3. Cost Volume Regularization

Noise present in satellite remote sensing images, stemming from factors such as occlusion and illumination, compromises the precision of multi-view stereo matching. Consequently, it becomes imperative to smooth the cost volume. Traditionally, this is achieved through either 3D or 2D convolution, which, however, are associated with suboptimal computational efficiency and significant memory consumption. This paper introduces an enhancement to the GRU recurrent encoder–decoder structure by incorporating a regularization operation for each stage’s constructed cost volume. Subsequently, a softmax operation is executed along the height direction to yield the normalized probability volume. This volume signifies the likelihood of a pixel residing within the current height plane, thereby delineating the confidence level of the height estimate. When juxtaposed with the 3D convolution method [27], the GRU regularization approach coupled with the recursive encoder–decoder structure sequentially regularizes the cost map along the height dimension. This results in enhanced efficiency and reduced memory overhead, proving advantageous for processing expansive satellite images with extensive height search capabilities.

In the cost volume regularization module, as depicted in Figure 7, encoded feature maps at each scale are subject to regularization via convolutional GRU. This process introduces group convolution in lieu of regular convolution, aiming to minimize the parameter count while preserving crucial feature information. This information is subsequently incorporated into the corresponding feature map at an equivalent scale within the decoder. During the decoding phase, the PixelShuffle technique supplants transposed convolution to facilitate the upsampling of the cost map. This approach circumvents the issue of information loss associated with zero-padding inherent in transposed convolution. Furthermore, PixelShuffle amalgamates the values from multiple channels of a singular pixel into fewer channels, thereby reducing channel count without compromising image size. This results in enhanced accuracy without an increase in parameter count. The method also upsamples the regularized cost map to match the input image’s dimensions and further reduces the channel count to a single channel. In the depth direction, sequential cost map context information is captured by the previously adjusted GRU and relayed to the current cost map, facilitating domain information consolidation. The regularization module aggregates and refines contextual features across various spatial scales. It determines the probability of each pixel position in the height direction using a normalized spatial direction cost map coupled with geometric and contextual data in that same direction. This information is then employed for height map inference.

2.4. Altitude Map Prediction

The height map regression prediction yields the final estimation value by summing up the height values of each pixel position based on a normalized probability weighted by Equation (7). Here,

h_{x}^{j}

and

P_{x}^{j}

denote the height value and probability of pixel x on the jth height hypothesis plane, respectively. This estimation strategy facilitates continuous height estimates, enabling network parameters to be learned through error back propagation and truly realizing end-to-end network model training. The range of heights in the first stage corresponds to the elevation range covered by the image parsed from the RPC parameters. While the probability distribution of height values is ideally unimodal, occlusions and textureless areas in the image can cause the distribution of heights to exhibit multiple peaks. This often results in significant deviations in the calculation results of height estimates. Therefore, in the second stage, for each pixel x, the uncertainty in the prediction is measured based on the variance

\tilde{V} (x)

of the probability distribution in the height direction. If there are multiple peaks within the height distribution, it suggests that there is no suitable height estimate value within the current height range. Consequently, it becomes necessary to adjust the height estimate range according to this uncertainty, making it easier to estimate the height value accurately, as shown in Equation (8).

{\tilde{h}}_{x} = \sum_{j = 1}^{N} h_{x}^{j} \cdot P_{x}^{j}

(7)

\tilde{V} (x) = \sum_{j = 1}^{N} P_{j} (x) \cdot {(H_{j} (x) - \tilde{h} (x))}^{2}

(8)

In Equation (8),

P_{j} (x)

and

H_{j} (x)

denote the probability and height hypotheses values of the jth height hypothesis plane at x pixel, respectively.

\tilde{h} (x)

signifies the final height estimation at x pixel. The range C(x) for adjusted height estimation is computed using Equation (9). λ serves as a scalar parameter to regulate the intensity of height range adjustment, with a default value of 1.5.

C (x) = [\tilde{h} (x) - λ \sqrt{\tilde{V} (x)}, \tilde{h} (x) + λ \sqrt{\tilde{V} (x)}]

(9)

The loss function proposed in this study for height prediction is implemented in two stages, as delineated by Equation (10). The first stage employs the initial height map to compute the loss independently between the actual and predicted height maps, which are then summed. The second refinement stage modifies the search range of height based on uncertainty to determine the loss between the refined height map and the true height map. The final loss is derived by weighting the losses from both stages using a weighting factor of {0.5, 2}. The smooth L1 loss function [38] calculates the loss value near the origin using a square function, resulting in a smoother output. When there is a significant discrepancy between the predicted and actual height values, the gradient remains constant to prevent gradient explosion during training. Conversely, when the difference is minimal, the gradient dynamically decreases to avoid fluctuations around stable values, making it easier to converge to optimal parameters. This approach enhances robustness against abnormal loss values during training, maintains brightness and color without significant changes, and improves the prediction accuracy of height values for satellite remote sensing images that are susceptible to noise.

L_{i} = \{\begin{matrix} \sum_{x \in V a l i d} 0.5 \cdot {(h_{x}^{i} - {\tilde{h}}_{x}^{i})}^{2}, |h_{x}^{i} - {\tilde{h}}_{x}^{i}| < 1 \\ \sum_{x \in V a l i d} |h_{x}^{i} - {\tilde{h}}_{x}^{i}| - 0.5, o t h e r s \end{matrix}

(10)

In Equation (10), i = 1 and 2 represent the first stage and second stage, respectively, Valid represents the pixel value with valid height, and

\tilde{h}

and

h

represent the height estimation value and height ground truth, respectively. The final total loss is shown in Equation (11).

L o s s = 0.5 \cdot L_{1} + 2 \cdot L_{2}

(11)

2.5. DSM Reconstruction Methods and Evaluation Metrics

This study initially undertakes geometric consistency detection filtering on the predicted three-view height maps. Specifically, after inputting the two original views and reference images into the model to infer the respective height maps, the output from the original view is reprojected into the space of the reference image output based on the RPC mapping relationship. This is then compared with the height map derived from the reference image. During this process, there may be two distinct pixel points corresponding to the same ground point in both height maps. The Euclidean distance between these points is calculated. If this distance falls below a predetermined threshold, it is deemed that the two height maps are geometrically consistent. If all height maps derived from both the source and reference views exhibit geometric consistency, it indicates that the matching is valid, meaning that the estimated height value at this pixel is accurate. Conversely, if this condition is not met, the point is considered invalid and is excluded from accuracy evaluation. The RPC model maps the height map of a square-like structure to object space, generating a point cloud. This process stores the three-dimensional coordinates within the WGS-84 geodetic coordinate system. Following this, the point cloud is converted into plane coordinates using the UTM projection coordinate system. The coverage in the target ground space is then divided into a uniform grid at an interval of 5 m in both the x- and y-directions. Each point is orthogonally projected onto the grid cell, preserving the maximum height value of all points within each grid cell. Grid cells that do not project any point are designated as invalid grids. Ultimately, the DSM is stored as a regular grid with a resolution of 5 m. The performance evaluation metrics for DSM reconstruction are as follows:

Mean absolute error (MAE): This metric represents the average of L1 distances of all grid cell height values between the ground truth DSM and the estimated DSM, as delineated in Equation (12).

M A E = \frac{\sum_{x \in (D \cap \tilde{D})} | h_{x} - {\tilde{h}}_{x} |}{\sum_{x \in (D \cap \tilde{D})} F (x \in (D \cap \tilde{D}))}

(12)

In Equation (12), $D$ and $\tilde{D}$ denote the ground truth DSM and the estimated DSM, respectively. A function, F(X), is defined to quantify the number of valid grid cells. When X is true, the function returns a value of 1; otherwise, it returns 0. $\tilde{h}_{x}$ and $h_{x}$ represent the height estimation value and height ground truth, respectively.

Root mean square error (RMSE): This metric quantifies the standard deviation of all grid cell height values residuals between the ground truth DSM and the estimated DSM, as delineated in Equation (13).

R M S E = \sqrt{\frac{\sum_{x \in (D \cap \tilde{D})} {(h_{x} - {\tilde{h}}_{x})}^{2}}{\sum_{x \in (D \cap \tilde{D})} F (x \in (D \cap \tilde{D}))}}

(13)

Values < 2.5 m and <7.5 m: The proportion of grid cells where the L1 distance (also known as Manhattan distance) between the estimated height value and the actual height value is less than 2.5 m and 7.5 m, respectively. The calculation is delineated in Equation (14).

\frac{\sum_{x \in (D \cap \tilde{D})} | h_{x} - {\tilde{h}}_{x} | \leq α}{\sum_{x \in (D \cap \tilde{D})} F (x \in (D \cap \tilde{D}))}, α = 2.5, 7.5

(14)

Comp: Percentage of grid cells with valid high values in the final DSM, as delineated in Equation (15).

C o m p = \frac{\sum_{x \in (D \cap \tilde{D})} F (x \in (D \cap \tilde{D}))}{\sum_{x \in D} F (x \in D))}

(15)

3. Experiments and Results

3.1. Experimental Environment and Dataset

The experimental requirements for the computer hardware in this study are stringent, and the open-source satellite image dataset is extensive. During parallel processing of each batch of images, numerous matrix operations necessitate high-performance video memory to accommodate the associated overheads. The configuration details and parameter settings for the experimental environment are presented in Table 1.

This paper utilizes the published WHU-TLC dataset, sourced from the three-line array camera (TLC) installed on our country’s ZY-3 satellite. The forward-looking camera captures images with a ground resolution of 2.1 m, while those captured by both forward- and backward-looking cameras have a resolution of 2.5 m. The overlap rate of these three views exceeds 95%, making them suitable for DSM reconstruction. The parameters of the satellite camera are incorporated into the provided RPC model and have been pre-calibrated for subpixel reprojection accuracy, allowing for direct use post-parsing. The model evaluation employs the inaugural version of large-scale satellite images, comprising 173 groups of images. Each group consists of 16-bit panchromatic resolution tri-view satellite images measuring 5120 × 5120, accompanied by the corresponding RPC parameters. Additionally, it includes ground truth DSM generated by photogrammetry software, supported by high-precision lidar observations and ground control points. An example of a large-size dataset is shown in Figure 8, which displays satellite images from three different viewpoints and ground truth DSM from left to right, covering an approximate area of 125 km².

The training model utilizes the second iteration of the dataset, comprising 5011 sets of training data and 1791 sets of validation data. Each set includes a satellite image in PNG format and its corresponding RPC parameters captured from three distinct viewpoints. Additionally, each set contains a ground truth height map in pfm format, which serves as the true label. This ground truth is derived by projecting the initial version of the DSM onto the image with RPC parameters, thereby capturing the actual height information. Both the satellite images and the ground truth height maps are cropped to a uniform size of 384 × 768. The overlap rate between these cropped images in both horizontal and vertical dimensions is maintained at 5%. Such small-scale image data proves suitable for model training and validation, as illustrated in Figure 9.

3.2. DSM Reconstruction Process

In this experiment, we designed a comprehensive TC-SatMVS network framework. This framework comprises two stages, transitioning from coarse to fine, and is comprised of four modules including feature extraction, RPC mapping construction cost volume, cost volume regularization, and height map prediction. The framework was implemented using the PyTorch deep learning architecture and trained on a workstation equipped with an NVIDIA GeForce RTX 3060 (GPU 12 GB). We utilized the WHU_TLC dataset for training and evaluating our proposed network framework, maintaining consistent hyperparameters across all experiments. We used the WHU_TLC dataset to train and test our proposed network. First, we chose different hyperparameters. The learning rate scheduler was divided into three types as follows: StepLR, LambdaLR, and SequentialLR. The optimizer mainly includes SGD, RMSprop, Adam, and AadamW [39]. In addition, the weight initialization method commonly used is Xavier or Kaiming [40]. We determined the best hyperparameters after five training epochs. When testing one hyperparameter, the other hyperparameters remained unchanged. The height prediction results of the hyperparameter sensitivity experiment are shown in Table 2.

During the training phase, the batch size was set to 1, and RMSprop [41] was chosen as the optimizer for weight parameter updates. This selection incorporated a decay factor to regulate the descent speed of second-order gradients. The network model training employed a straightforward cross-validation method, with the training efficacy being validated after each epoch of training for one epoch. Based on these validation results, hyperparameters could be adjusted in real time, allowing for an initial estimation of model performance. All modules underwent iterative training for 25 epochs, starting with an initial learning rate of 0.001. A dynamic learning rate adjustment was implemented, reducing the learning rate by a factor of two after the 6th, 8th, and 12th epoch, respectively. The model training process was visualized using the Tensorboard tool [42], as depicted in Figure 10. As the number of iterations increased, saturation was observed in various indicators, leading to rapid convergence of the weight parameters towards their optimal values.

After training the final model using existing datasets, it is essential to save both the model structure and weight parameters. Subsequently, these saved elements can be reloaded for inference of height maps. The ultimate objective is to reconstruct a high-precision DSM based on photogrammetry principles, which is achieved by generating height maps. This reconstruction process is illustrated in Figure 11. The detailed steps are as follows:

The open-source dataset comprises a large-scale resource, specifically the ZY-3 satellite image of dimensions 5120 × 5120 captured in three distinct views. Initially, this image is cropped based on the corresponding overlap rate to yield an output that aligns with the hardware specifications and the capacity of the model being utilized.
The network model is utilized to estimate the height map of small-scale images, with 768 × 384 small-scale satellite images being inputted.
The height map from various viewpoints undergoes threshold filtering and is subjected to a left–right consistency check. This is achieved by projecting the data back to the WGS-84 geodetic coordinate system within the object plane. Subsequently, the three-dimensional geographical coordinates, which include latitude, longitude, and altitude, are transformed into planar coordinates using the UTM projection system. These transformed coordinates are then stored as point clouds.
The grid is partitioned into regions, and the point cloud undergoes resampling to generate a Discrete Surface Model (DSM). This DSM is subsequently compared with the actual DSM to assess the efficacy of the reconstruction process.

Figure 11. DSM reconstruction process.

3.3. Analysis of Experimental Results

Satellite images, because of their expansive coverage and rich ground information, are often subject to various factors that can impact image quality. These include the conditions of the observation equipment, natural phenomena such as clouds and fog, and the transmission technology used, all of which can adversely affect image-processing tasks. In the context of 3D reconstruction tasks based on satellite images, three primary factors influence reconstruction performance including image area discontinuity, occlusion or illumination conditions, and weak texture. The efficacy of the model presented in this paper is evaluated through the visualization results of three sets of images and height maps utilized in the experiment. Figure 12a,b show the reference images of three sets of satellite images in a small-sized dataset and the corresponding height map estimation results, where the red box indicates a discontinuous area. It is evident that there is a significant pixel value difference between this area and its adjacent areas in the reference image, leading to matching difficulties. However, the model proposed herein employs a self-attention method, learning the correlation among adjacent pixels, thereby accurately predicting the height value of this area range. The purple box represents a darker area, which may cause mismatching due to insufficient illumination or terrain occlusion. This results in stripe pixel blocks in the height map, leading to reconstruction failure. The model presented in this paper, through its feature extraction and regularization module, filters noise while obtaining richer feature information from the image. This reduces the impact of dark areas, estimates the height value of these areas accurately, and describes the undulation changes in the terrain in these areas precisely. The blue box denotes an area with a weak texture. This area has similar adjacent pixel values and lacks distinct texture information, thereby increasing the difficulty of matching. The method proposed in this paper can fuse rich shallow and deep features, extract important detail information, and thus aid in predicting the height value of areas with weak texture.

The visualization outcomes of DSM reconstruction are depicted in Figure 13. These figures utilize large-scale satellite images as input, which are subsequently fed into a deep network to predict the height map post-cropping. Upon comparing the height maps inferred by four models, including the method proposed in this paper, it becomes evident that regions with significant height variation, as indicated by the red box in the first row of height maps in the figure, present challenges when uneven shading illumination is observed within the range of observation. This irregularity hinders the accurate matching of homonymous image points and compromises the prediction of ground target point heights. In contrast, the height maps generated using the method presented in this study exhibit a high degree of contrast and refinement when compared with those produced by SatMVS (RED-Net). This suggests that both methods can accurately predict height values through their deep network models. The superior performance of our method can be attributed to its feature extraction module’s robust feature learning capability for satellite images. This module captures global context feature information by integrating a self-attention mechanism and fusing both shallow and deep data. Additionally, the regularization module employs an enhanced GRU recursive recurrent encoder–decoder structure based on variance. This design reduces system memory overhead and retains more crucial feature information from satellite images compared with the cost volume regularization method of 3D convolution. It also mitigates accuracy degradation caused by height value prediction after context information loss during the smoothing noise process. Furthermore, based on preliminary height map predictions, our method adjusts the height estimation range using an uncertainty strategy grounded in variance, ensuring precise height value estimation.

This paper presents a method for evaluating the performance of DSM reconstruction by comparing it with SatMVS (RED-Net), SatMVS (CasMVSNet), and SatMVS (UCS-Net) reconstruction networks based on the RPC model. The comparison results are presented in Table 3. The proposed model outperforms the others in terms of root mean square error and completeness indicators. Despite having the best reconstruction performance, SatMVS (RED-Net) has a significantly reduced number of parameters, by 42.93%, and a model size reduction of 41.75% when compared with the proposed model. In the visualization results of the DSM reconstruction, the cluster statistics of the ground truth DSM accurately reflect the number of pixels in each elevation range within the area. The invalid matching areas of all four reconstruction methods are represented by white marks, indicating pixel impact due to occlusion, shadowing, and textureless surfaces, leading to reconstruction failure. The experiment identifies the densest area of invalid matching through an image processing method to extract white points. The extraction results reveal that there are significantly fewer white points in the DSM reconstructed by the proposed TC-SatMVSNet network than in those of other methods. This indicates that the proposed model exhibits superior robustness in challenging matching areas and achieves superior reconstruction results.

4. Discussion

This study presents a lightweight deep network model that integrates the Transformer self-attention mechanism and CNN inductive bias into deep learning methodologies. We employ two versions of the WHU-TLC dataset for training our model and evaluating DSM reconstruction performance. Our findings indicate that our model significantly reduces the parameter count while enhancing accuracy, with RMSE and Comp demonstrating superior performance. The model is capable of utilizing high-quality satellite images and RPC parameters from multiple perspectives as input, generating predicted height values and thereby creating an end-to-end prediction model. The experimental results demonstrate the robustness of our predicted height maps. In instances of discontinuous image regions, the model’s self-attention mechanism adeptly captures neighborhood information, thereby facilitating precise prediction of the region’s height value. In darker areas characterized by occlusions and inadequate lighting, the model’s regularization module mitigates noise interference. This enables it to amalgamate neighborhood information in the height direction and geometric information across various scales. In areas with weak texture, the model enhances the depiction of terrain features by extracting both deep and superficial information, thereby elevating the prediction accuracy of height values. In reconstructing a DSM using these height maps, it is important to consider two scenarios that can impact the performance of DSM reconstruction. The first scenario involves multiple views of images that cannot be accurately matched during the reconstruction process because of factors such as occlusion or illumination. This results in an ineffective height value at a specific grid point, leading to a failure in the reconstruction. The second scenario involves a height value at a given image coordinate position exceeding the height range specified in the RPC file for that area, also resulting in a failure. It was observed that the performance of the method proposed in this paper improved when evaluating various indicators. However, when inputting satellite images, because of video memory limitations, the original large-sized images must be cropped. Despite this cropping operation being conducted at a 5% overlap rate, meeting the requirements of stereo measurement and small-size image stitching, it inevitably alters the pixel distribution characteristics of the original large-sized images, thereby impacting reconstruction performance. Therefore, under optimal hardware conditions, large-sized images of 5120 × 5120 could be utilized as input to reconstruct a DSM, yielding superior reconstruction results. Furthermore, the open-source training dataset proved sufficient to support the training of both lightweight models and those with millions of parameters. In future research, if a higher-resolution satellite image is created for the dataset used to reconstruct a DSM, it would hold significant implications for the advancement of three-dimensional reconstruction methods for satellite images.

5. Conclusions

This paper introduces a streamlined framework for satellite image DSM reconstruction termed TC-SatMVSnet. This innovative approach amalgamates the Transformer self-attention mechanism with the inductive bias characteristics of CNNs. The design incorporates four distinct modules including feature extraction, cost volume construction, cost volume regularization, and height map prediction. These modules effectively mitigate the impact of factors such as illumination and occlusion on image matching, thereby yielding high-precision DSM products. The efficacy of the height prediction results and reconstruction effects are rigorously analyzed using open-source datasets for training and evaluation purposes. The findings indicate that the proposed method significantly diminishes the number of model parameters by 42.93% in comparison with the existing techniques, while simultaneously achieving superior RMSE and Comp, the second-best MAE, and comprehensively enhancing reconstruction performance. Furthermore, the model size is a mere 4.91 MB, thereby reducing both software and hardware overhead associated with model development and streamlining the design process of the satellite image three-dimensional reconstruction system.

Author Contributions

Conceptualization, Y.Z.; Methodology, Y.Z. and Y.L.; Validation, Y.Z.; Formal analysis, S.G. and G.L.; Investigation, Y.Z.; Resources, Z.W.; Data curation, D.H.; Writing—original draft, Y.Z.; Writing—review & editing, Y.Z.; Project administration, D.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Scientific and Technological Innovation Project for the Protection and Utilization of Black Land with the grant number XDA28050100.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kai, F.; Yi, W.; Rui, Z. Deconstruction of Related Technologies of Ground Image Processing Based on High-Resolution Satellite Remote Sensing Images. Mob. Inf. Syst. 2023, 2023, 2896471. [Google Scholar] [CrossRef]
Xinming, T.; Qingxing, Y.; Xiaoming, G. China DSM Generation and Accuracy Acessment Using ZY3 Images. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 6757–6760. [Google Scholar] [CrossRef]
Yanan, Z.; Fuguang, D.; Changqing, Z. DEM Extraction and Accuracy Assessment Based on ZY-3 Stereo Images. In Proceedings of the 2012 2nd International Conference on Computer Science and Network Technology, Changchun, China, 29–31 December 2012; pp. 1439–1442. [Google Scholar] [CrossRef]
Yang, W.; Li, X.; Yang, B.; Yang, Y.; Yan, Y. Dense Matching for DSM Generation From ZY-3 Satellite Imagery. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 3677–3680. [Google Scholar] [CrossRef]
Hou, Y.; Liu, C.; An, B.; Liu, Y. Stereo Matching Algorithm Based on Improved Census Transform and Texture Filtering. Optik 2022, 249, 168186. [Google Scholar] [CrossRef]
Lv, D.; Jiao, G. Experiment of Stereo Matching Algorithm Based on Binocular Vision. J. Phys. Conf. Ser. 2020, 1574, 012173. [Google Scholar] [CrossRef]
Li, G.; Song, H.; Li, C. Matching Algorithm and Parallax Extraction Based on Binocular Stereo Vision. In Proceedings of the Smart Innovations in Communication and Computational Sciences; Panigrahi, B.K., Trivedi, M.C., Mishra, K.K., Tiwari, S., Singh, P.K., Eds.; Springer: Singapore, 2019; pp. 347–355. [Google Scholar] [CrossRef]
Hartley, R.I.; Saxena, T. The Cubic Rational Polynomial Camera Model. In Proceedings of the Image Understanding Workshop, New Orleans, LA, USA, 11–14 May 1997; Volume 649, p. 653. [Google Scholar]
Zhang, G.; Yuan, X. On RPC Model of Satellite Imagery. Geo-Spat. Inf. Sci. 2006, 9, 285–292. [Google Scholar] [CrossRef]
Zhang, L.; Balz, T.; Liao, M. Satellite SAR Geocoding with Refined RPC Model. ISPRS J. Photogramm. Remote Sens. 2012, 69, 37–49. [Google Scholar] [CrossRef]
Qin, R. Rpc Stereo Processor (Rsp)—A Software Package For Digital Surface Model And Orthophoto Generation From Satellite Stereo ImagerY. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, 3, 77–82. [Google Scholar] [CrossRef]
De Franchis, C.; Meinhardt-Llopis, E.; Michel, J.; Morel, J.-M.; Facciolo, G. An Automatic and Modular Stereo Pipeline for Pushbroom Images. In Proceedings of the ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, Zürich, Switzerland, 5–7 September 2014; Volume II–3, pp. 49–56. [Google Scholar] [CrossRef]
Facciolo, G.; de Franchis, C.; Meinhardt, E. MGM: A Significantly More Global Matching for Stereovision. In Proceedings of the BMVC 2015, Swansea, UK, 7–10 September 2015. [Google Scholar] [CrossRef]
Mandun, Z.; Lichao, Q.; Guodong, C.; Ming, Y. A Triangulation Method in 3D Reconstruction from Image Sequences. In Proceedings of the 2009 Second International Conference on Intelligent Networks and Intelligent Systems, Tianjian, China, 1–3 November 2009; pp. 306–308. [Google Scholar] [CrossRef]
Schönberger, J.L.; Frahm, J.-M. Structure-from-Motion Revisited. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4104–4113. [Google Scholar] [CrossRef]
Liu, Y.; Li, C.; Gong, J. An Object Reconstruction Method Based on Binocular Stereo Vision. In Proceedings of the Intelligent Robotics and Applications, Wuhan, China, 16–18 August 2017; Huang, Y., Wu, H., Liu, H., Yin, Z., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 486–495. [Google Scholar] [CrossRef]
Zhang, K.; Snavely, N.; Sun, J. Leveraging Vision Reconstruction Pipelines for Satellite Imagery. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 2139–2148. [Google Scholar] [CrossRef]
Liu, J.; Ji, S. A Novel Recurrent Encoder-Decoder Structure for Large-Scale Multi-View Stereo Reconstruction from An Open Aerial Dataset. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Žbontar, J.; Lecun, Y. Computing the Stereo Matching Cost with a Convolutional Neural Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
Ji, S.; Liu, J.; Lu, M. CNN-Based Dense Image Matching for Aerial Remote Sensing Images. Photogramm. Eng. Remote Sens. 2019, 85, 415–424. [Google Scholar] [CrossRef]
Yao, Y.; Luo, Z.; Li, S.; Fang, T.; Quan, L. MVSNet: Depth Inference for Unstructured Multi-View Stereo. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 785–801. [Google Scholar] [CrossRef]
Gu, X.; Fan, Z.; Zhu, S.; Dai, Z.; Tan, F.; Tan, P. Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2492–2501. [Google Scholar] [CrossRef]
Cheng, S.; Xu, Z.; Zhu, S.; Li, Z.; Li, L.E.; Ramamoorthi, R.; Su, H. Deep Stereo Using Adaptive Thin Volume Representation With Uncertainty Awareness. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2521–2531. [Google Scholar] [CrossRef]
Chen, K.; Zhou, Z.; Li, Y.; Ji, X.; Wu, J.; Coatrieux, J.-L.; Chen, Y.; Coatrieux, G. RED-Net: Residual and Enhanced Discriminative Network for Image Steganalysis in the Internet of Medical Things and Telemedicine. IEEE J. Biomed. Health Inform. 2024, 28, 1611–1622. [Google Scholar] [CrossRef] [PubMed]
Shewalkar, A.; Nyavanandi, D.; Ludwig, S. Performance Evaluation of Deep Neural Networks Applied to Speech Recognition: Rnn, LSTM and GRU. J. Artif. Intell. Soft Comput. Res. 2019, 9, 235–245. [Google Scholar] [CrossRef]
Dey, R.; Salem, F.M. Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks. In Proceedings of the 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), Boston, MA, USA, 6–9 August 2017; pp. 1597–1600. [Google Scholar] [CrossRef]
Singh, R.D.; Mittal, A.; Bhatia, R.K. 3D Convolutional Neural Network for Object Recognition: A Review. Multimed. Tools Appl. 2019, 78, 15951–15995. [Google Scholar] [CrossRef]
Juarez-Salazar, R.; Zheng, J.; Diaz-Ramirez, V. Distorted Pinhole Camera Modeling and Calibration. Appl. Opt. 2020, 59, 11310–11318. [Google Scholar] [CrossRef] [PubMed]
Gao, J.; Liu, J.; Ji, S. Rational Polynomial Camera Model Warping for Deep Learning Based Satellite Multi-View Stereo Matching. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 6128–6137. [Google Scholar] [CrossRef]
Bi, J.; Zhu, Z.; Meng, Q. Transformer in Computer Vision. In Proceedings of the 2021 IEEE International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI), Fuzhou, China, 24–26 September 2021; pp. 178–188. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Sarvamangala, D.R.; Kulkarni, R.V. Convolutional Neural Networks in Medical Image Understanding: A Survey. Evol. Intell. 2022, 15, 1–22. [Google Scholar] [CrossRef] [PubMed]
Lu, H.; Zhang, Q. Applications of Deep Convolutional Neural Network in Computer Vision. J. Data Acquis. Process. 2016, 31, 1–17. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
Hisham, M.B.; Yaakob, S.N.; Raof, R.A.A.; Nazren, A.B.A.; Wafi, N.M. Template Matching Using Sum of Squared Difference and Normalized Cross Correlation. In Proceedings of the 2015 IEEE Student Conference on Research and Development (SCOReD), Kuala Lumpur, Malaysia, 13–14 December 2015; pp. 100–104. [Google Scholar] [CrossRef]
Bindu, N.S.; Sheshadri, H.S. A Comparative Study of Correlation Based Stereo Matching Algorithms: Illumination and Exposure. In Proceedings of the Intelligent Computing, Communication and Devices; Jain, L.C., Patnaik, S., Ichalkaranje, N., Eds.; Springer: New Delhi, India, 2015; pp. 191–201. [Google Scholar] [CrossRef]
Wei, L.; Zheng, C.; Hu, Y. Oriented Object Detection in Aerial Images Based on the Scaled Smooth L1 Loss Function. Remote Sens. 2023, 15, 1350. [Google Scholar] [CrossRef]
Feng, Y. An Overview of Deep Learning Optimization Methods and Learning Rate Attenuation Methods. Hans J. Data Min. 2018, 8, 186–200. [Google Scholar] [CrossRef]
Ding, X.; Yang, H.; Chan, R.H.; Hu, H.; Peng, Y.; Zeng, T. A New Initialization Method for Neural Networks with Weight Sharing. In Proceedings of the Mathematical Methods in Image Processing and Inverse Problems, Beijing, China, 21–24 April 2021; Tai, X.-C., Wei, S., Liu, H., Eds.; Springer: Singapore, 2021; pp. 165–179. [Google Scholar] [CrossRef]
Zou, F.; Shen, L.; Jie, Z.; Zhang, W.; Liu, W. A Sufficient Condition for Convergences of Adam and RMSProp. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 11119–11127. [Google Scholar] [CrossRef]
Luus, F.P.S.; Khan, N.; Akhalwaya, I. Active Learning with TensorBoard Projector. arXiv 2019. [Google Scholar] [CrossRef]

Figure 2. Feature extraction module.

Figure 3. Coding structure based on self-attention mechanisms.

Figure 4. PixelShuffle diagram.

Figure 5. Affine transformation procedure of the RPC model. (a) The first stage uses a hypothetical height plane, and (b) the second stage uses adaptive height planes.

Figure 7. Cost volume regularization module.

Figure 8. Large-sized satellite image and ground truth DSM dataset example.

Figure 9. Small-sized satellite image and real height map dataset example. (a) Satellite image. (b) True height map.

Figure 10. Model training visualization results. Orange represents the model training iteration process, and blue represents the validation process.

Figure 12. Height map inference results. (a) Small-sized satellite image. (b) Height map estimation results.

Figure 13. Visual results of DSM reconstruction.

Table 1. Experimental environment configuration and parameter settings.

Name	Version Parameters and Roles
Operating system	Windows11 (Microsoft, Redmond, WA, USA)
CPU configuration	Intel(R) Core(TM)i5-12500 @ 3.10 GHz (Intel, Chandler, AZ, USA)
RAM	16.0 GB
GPU configuration	NVIDIA GeForce RTX 3060 (12.0 GB) (NVIDIA, Santa Clara, CA, USA)
Deep learning framework	Pytorch 1.8.0 (Meta, San Francisco, CA, USA), Achieve GPU acceleration
Parallel computing architecture	CUDA 11.1/cuDNN 7.6.5, Improve GPU parallel computing capability
Programming language	Python 3.7.15
Management software	Anaconda3 (Anaconda, Austin, TX, USA), Environment manager

Table 2. Experimental results of hyperparameter sensitivity of the model.

Parameter Name	Parameter Value	Loss (m) ⬇	MAE (m) ⬇	RMSE (m) ⬇	<2.5 m (%) ⬆	<7.5 m (%) ⬆
Initial learning rate	0.002	3.205	2.431	3.751	79.91	96.09
	0.001	3.163	2.109	3.628	79.13	96.72
	0.0001	3.192	2.317	3.940	78.39	96.13
Learning rate scheduler	StepLR	3.151	2.212	4.198	79.63	96.62
	LambdaLR	3.195	2.278	3.832	78.93	96.01
	SequentialLR	3.391	2.319	4.271	76.27	96.26
Optimizer	SGD	3.209	2.391	3.821	78.95	96.31
	RMSprop	2.995	2.112	3.353	80.32	96.77
	Adam	3.281	2.302	3.611	79.26	96.39
	AadamW	3.271	2.293	3.297	79.79	96.64
Weight initialization	Xavier	3.112	2.209	3.416	79.89	96.69
Weight initialization	Kaiming	3.097	2.261	3.304	80.51	96.38

"⬆" indicates the positive indicator, the larger the index value, the better the evaluation, "⬇" indicates the reverse indicator, the smaller the index value, the better the evaluation.

Table 3. Comparison of the experimental results using the reconstructed networks.

Method	MAE (m) ⬇	RMSE (m) ⬇	<2.5 m (%) ⬆	<7.5 m (%) ⬆	Comp (%) ⬆	Params	Model_Size (MB)
SatMVS (RED-Net)	1.945	4.070	77.93	96.59	82.29	1,094,523	8.43
SatMVS (CasMVSNet)	2.020	3.841	76.79	96.73	81.54	934,304	7.20
SatMVS (UCS-Net)	2.026	3.921	77.01	96.54	82.21	938,496	7.24
TC-SatMVSnet	1.963	3.811	77.21	96.58	82.53	624,546	4.91

"⬆" indicates the positive indicator, the larger the index value, the better the evaluation, "⬇" indicates the reverse indicator, the smaller the index value, the better the evaluation.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Y.; Liu, Y.; Gao, S.; Liu, G.; Wan, Z.; Hu, D. Deep Learning-Based Digital Surface Model Reconstruction of ZY-3 Satellite Imagery. Remote Sens. 2024, 16, 2567. https://doi.org/10.3390/rs16142567

AMA Style

Zhao Y, Liu Y, Gao S, Liu G, Wan Z, Hu D. Deep Learning-Based Digital Surface Model Reconstruction of ZY-3 Satellite Imagery. Remote Sensing. 2024; 16(14):2567. https://doi.org/10.3390/rs16142567

Chicago/Turabian Style

Zhao, Yanbin, Yang Liu, Shuang Gao, Guohua Liu, Zhiqiang Wan, and Denghui Hu. 2024. "Deep Learning-Based Digital Surface Model Reconstruction of ZY-3 Satellite Imagery" Remote Sensing 16, no. 14: 2567. https://doi.org/10.3390/rs16142567

APA Style

Zhao, Y., Liu, Y., Gao, S., Liu, G., Wan, Z., & Hu, D. (2024). Deep Learning-Based Digital Surface Model Reconstruction of ZY-3 Satellite Imagery. Remote Sensing, 16(14), 2567. https://doi.org/10.3390/rs16142567

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning-Based Digital Surface Model Reconstruction of ZY-3 Satellite Imagery

Abstract

1. Introduction

2. Methods

2.1. Feature Extraction Module

2.2. Cost Volume Construction

2.2.1. Affine Transformation Based on the RPC Model

2.2.2. Feature Volume Fusion

2.3. Cost Volume Regularization

2.4. Altitude Map Prediction

2.5. DSM Reconstruction Methods and Evaluation Metrics

3. Experiments and Results

3.1. Experimental Environment and Dataset

3.2. DSM Reconstruction Process

3.3. Analysis of Experimental Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI