SCE-Net: Self- and Cross-Enhancement Network for Single-View Height Estimation and Semantic Segmentation

Xing, Siyuan; Dong, Qiulei; Hu, Zhanyi

doi:10.3390/rs14092252

Open AccessArticle

SCE-Net: Self- and Cross-Enhancement Network for Single-View Height Estimation and Semantic Segmentation

by

Siyuan Xing

^1,2

,

Qiulei Dong

^1,2,3,*

and

Zhanyi Hu

^1,2

¹

School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China

²

National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

³

Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(9), 2252; https://doi.org/10.3390/rs14092252

Submission received: 22 March 2022 / Revised: 5 May 2022 / Accepted: 5 May 2022 / Published: 7 May 2022

Download

Browse Figures

Versions Notes

Abstract

:

Single-view height estimation and semantic segmentation have received increasing attention in recent years and play an important role in the photogrammetry and remote sensing communities. The height information and semantic information of images are correlated, and some recent works have shown that multi-task learning methods can achieve complementation of task-related features and improve the prediction results of the multiple tasks. Although much progress has been made in recent works, how to effectively extract and fuse height features and semantic features is still an open issue. In this paper, a self- and cross-enhancement network (SCE-Net) is proposed to jointly perform height estimation and semantic segmentation on single aerial images. A feature separation–fusion module is constructed to effectively separate and fuse height features and semantic features based on an attention mechanism for feature representation enhancement across tasks. In addition, a height-guided feature distance loss and a semantic-guided feature distance loss are designed based on deep metric learning to achieve task-aware feature representation enhancement. Extensive experiments are conducted on the Vaihingen dataset and the Potsdam dataset to verify the effectiveness of the proposed method. The experimental results demonstrate that the proposed SCE-Net could outperform the state-of-the-art methods and achieve better performance in both height estimation and semantic segmentation.

Keywords:

height estimation; semantic segmentation; single aerial image; convolutional neural networks; multi-task learning; deep metric learning

1. Introduction

In recent years, with the rapid development of aerospace technology, remote sensing imagery analysis for high-resolution images acquired by aerial or satellite sensors has received extensive attention. Learning height information from single aerial images, being one of the important tasks in remote sensing imagery analysis, can provide geometric information for 3D reconstruction of ground scenes, and is widely used in a variety of applications, such as urban planning [1], change detection [2], and disaster monitoring [3]. Recently, thriving deep learning technology has made tremendous progress in the photogrammetry and remote sensing communities [4,5,6,7]. Similarly, height estimation from single aerial images mainly adopts deep-learning-based methods, including methods based on convolutional neural networks (CNNs), methods based on generative adversarial networks (GANs), and methods based on multi-task learning (MTL).

Compared with other images, remote sensing images have more complex spectral characteristics, where objects with different heights may have similar appearances due to similar materials, such as building roofs and roads. When using a deep neural network to extract features from a single image, it may generate mismatched height feature relationships, resulting in inaccurate height estimation. Generally, there is a geometric correlation between the height information and semantic information of the remote sensing scene. Compared with treating height estimation and semantic segmentation as two independent tasks, the multi-task learning methods can utilize height feature and semantic feature extracted from the image to achieve information complementarity, then leverage multi-source supervised information to improve the predictive performance. Therefore, this paper performs height estimation and semantic segmentation from single aerial images simultaneously in a unified framework.

Several recent works have shown that height estimation and semantic segmentation can benefit from each other, mainly based on the implicit assumption that changes in height generally correspond to changes in class [8,9]. However, although the height cues and semantic cues are related, they are not completely consistent. For example, objects within a same class may have different heights, while objects with a same height may belong to different classes. Therefore, straightforward fusion (summation or concatenation) of height features and semantic features will make inconsistent features negatively impact other shared features, leading to more inaccurate predictions. In addition, estimating height from single images is generally regarded as a pixel-level height regression task. However, the wide range of height values makes it challenging to obtain an accurate height value directly. Under the direct regression paradigm, existing methods generally suffer from slow convergences or sub-optimal solutions.

In this paper, a self- and cross-enhancement network (SCE-Net) is proposed to jointly learn height information and semantic labels from single aerial images under the framework of multi-task learning. Specifically, the SCE-Net first exploits the backbone network to extract shared features for both two tasks from the input image. Then, a feature separation–fusion module (FSFM) is designed to effectively separate task-aware features from the shared features and fuse cross-task features based on an attention mechanism to achieve cross-enhancement of task-related feature representation. In addition, for addressing the problem that the height range is large and difficult to regress, the height range is discretized into several intervals and a height-guided feature distance loss and a semantic-guided feature distance loss are designed to accomplish self-enhancement of feature representation based on the deep metric learning method. To verify the effectiveness of the proposed method, extensive experiments are conducted on two public datasets, namely, the Vaihingen dataset and the Potsdam dataset. Experimental results demonstrate that the proposed method outperforms recent state-of-the-art height estimation methods and achieves comparable performance to the comparison semantic segmentation methods.

The main contributions include the following:

A multi-task learning network, called self- and cross-enhancement network (SCE-Net), is proposed to simultaneously perform height estimation and semantic segmentation from single aerial images under a unified framework.
To effectively integrate the height and semantic cues of the scene, a feature separation–fusion module (FSFM) is constructed to separate the shared image features into task-aware features, and selectively fuse the cross-task features based on an attention mechanism.
A height-guided feature distance loss and a semantic-guided feature distance loss are designed to achieve task-guided representation enhancement using the deep metric learning method.

The paper is organized as follows: a brief review of related works is given in Section 2, including height estimation, semantic segmentation, and multi-task learning. Section 3 introduces the proposed self- and cross-enhancement network (SCE-Net), datasets, evaluation indicators, and implementation details. Extensive experimental results and evaluations are reported in Section 4. In Section 5, the effectiveness of each component proposed in this work is analyzed and discussed. Finally, Section 6 concludes this work.

2. Related Works

In this section, the related works are briefly introduced, including height estimation, semantic segmentation, and multi-task learning.

2.1. Height Estimation

The height information of remote sensing images can be obtained by stereo-pair photogrammetry [10,11], SAR interoferometry [12,13], Lidar processing, etc., but these methods are usually expensive, and require high computational costs or expert interpretation. With the development of deep learning technology, recent works have paid more attention to estimating the height information from single images based on deep learning networks, including convolutional neural networks (CNNs) and generative adversarial networks (GANs).

The methods based on supervised learning mainly use the height ground truth corresponding to the image to supervise the network training. The model trained on the training dataset can be applied to other images to achieve height estimation from single aerial images. Mou et al. [14] proposed a fully convolutional–deconvolutional network to learn the height information from single images in an end-to-end manner, and the predicted height map can assist in building instance segmentation. Zhang et al. [15] proposed a multi-path fusion network, using a multi-path feature fusion module and a residual upsampling block to obtain a predicted height map with good scene structure preservation. Amirkolaee et al. [16] proposed a deep convolutional encoder–decoder network to recover the precise geometry of the object by combining the global and local features, and used a post-processing method to obtain a seamless and continuous predicted height map. Li et al. [17] used a deep ordinal regression network to estimate the depth interval, and employed an ordered loss function to improve the height estimation performance. Liu et al. [18] proposed a CNN-based method to achieve height estimation from single images, and utilized the learned height map to infer the shape of 3D buildings and the footprints of 2D buildings. Xing et al. [19] proposed a progressive learning network that aggregates high-level and low-level features using an attention mechanism, and gradually refines the predicted height map in a coarse-to-fine manner through a progressive refinement module. Mo et al. [20] proposed a soft-aligned gradient-chaining network to generate the height difference between adjacent pixels in a specific direction, and designed a robust surface-based soft alignment method to improve the training quality of the network. Karatsiolis et al. [21] proposed a deep learning model to estimate the height from single images, and verified the effectiveness of the proposed method on two experimental datasets. With the wide application of generative adversarial networks in various fields of computer vision, some recent works employ generative adversarial networks for image-to-image translation to generate corresponding height maps from images. Ghamisi et al. [22] proposed to use conditional generative adversarial nets to generate elevation information from single remote sensing images. Paoletti et al. [23] proposed to use unpaired images to learn elevation from optical remote sensing images based on variational autoencoders and generative adversarial networks. Panagiotou et al. [24] applied conditional generative adversarial networks to learn digital elevation models of scenes from satellite images.

2.2. Semantic Segmentation

The semantic segmentation task predicts the semantic label for each pixel of the input image. The development of deep learning techniques in recent years has produced significant improvements in semantic segmentation. The FCN [25] replaced the fully connected layers with convolutional layers, achieving efficient image semantic segmentation. Although FCN utilizes deconvolutional layers to restore the resolution of prediction results, multiple pooling and downsampling operations in the network lead to the loss of spatial details. To address this problem, some works adopted an encoder–decoder architecture to obtain high-resolution semantic segmentation results [26,27,28]. U-Net [29] and its variants [30,31,32] used an encoder–decoder network to fuse shallow and deep information through skip connections for achieving semantic segmentation with more accurate boundaries. Contextual information in images, especially in remote sensing images, is particularly important for semantic segmentation. To fuse the multi-scale information of the image, some works used dilated convolution/atrous convolution to expand the receptive field, improving the performance of semantic segmentation [33,34,35,36,37,38,39]. DeepLab V3+ [40] used atrous spatial pyramid pooling (ASPP) to achieve multi-scale feature fusion and employed a decoding module to refine the boundary details. Another commonly used strategy is to obtain global contextual information by modeling the spatial relationship of images to improve the accuracy of pixel-level segmentation results [41,42,43,44]. Considering the complexity and diversity of remote sensing image scenes, some works used multi-scale training or feature fusion to obtain long-distance dependencies, alleviating the scale change problem of objects in the scene [45,46,47]. In addition, some works utilized multi-modal data, mainly 3D elevation data (DSM or nDSM), to assist the semantic segmentation of remote sensing images, improving the accuracy of semantic prediction [48,49].

2.3. Multi-Task Learning

In recent years, multi-task learning is advocated to exploit complementary information and improve the predictive performance of each task [50,51,52,53,54]. For remote sensing images, the height information of the scene usually has a correlation with the semantic labels, so recent works usually jointly perform these two tasks under a unified framework. Srivastava et al. [8] proposed to utilize convolutional neural networks to simultaneously predict height information and semantic labels from single images. Zheng et al. [9] performed simultaneous height estimation and semantic segmentation on input images, and improved the quality of prediction results by fusing the features of the two tasks. Carvalho et al. [55] employed an encoder to extract features from images and predict corresponding height maps and semantic labels through different decoders. Mahmud et al. [56] proposed a boundary-aware multi-task deep-learning-based framework to simultaneously predict height and semantic segmentation results, enabling 3D architectural modeling from single images. Wang et al. [57] proposed a multi-task learning network that simultaneously learns height information, semantic segmentation results, and edge information. Different from these methods, this paper focuses on the relationships between height features and semantic features, and improves the performance of height estimation and semantic segmentation by selectively fusing features between tasks and enhancing the representation of task-aware features.

3. Materials and Methods

In this section, an overview of the proposed self- and cross-enhancement network (SCE-Net) are first given. Then, the feature separation–fusion module (FSFM) and task-guided representation enhancement are introduced, including height-guided feature distance loss and semantic-guided feature distance loss. After that, the multi-task objective function, datasets, evaluation indicators, and implementation details are described. Notation of important symbols in this work are shown in Table 1.

3.1. Overview

In this paper, pixel-level height maps and semantic labels are simultaneously predicted from single aerial images under the multi-task learning framework. The proposed SCE-Net employs an encoder–decoder architecture, and the whole network consists of three parts: a backbone for feature extraction, a feature separation–fusion module, and a multi-task predictor. Unlike single-task learning methods, the SCE-Net includes a shared encoder and two task-related decoders. The overall network architecture is shown in Figure 1.

Concretely, the network adopts ResNet-50 or ResNet-101 as the backbone. For a three-channel input image, the output of the network is a one-channel height map and a semantic segmentation resulting in the same number of channels as the number of classes. In the encoding process, the image is downsampled multiple times to obtain feature maps with sizes 1/4, 1/8, and 1/16 of the input size in turn. Then, several upsampling operations are performed in the two decoder branches to obtain the height map and semantic segmentation result with the same resolution as the original input, respectively. Furthermore, the network adopts skip connection to preserve detailed information lost during multiple downsampling operations in the encoder.

Considering that height estimation and semantic segmentation are closely related but they do not have a one-to-one relation, this paper constructs a feature separation–fusion module (FSFM), which first separates the features extracted from input image into task-aware features and selectively fuses the features from another branch to obtain consistent task-related features. Then, two task-guided feature distance losses are designed based on the deep metric learning to enhance the representation of the two task-aware features. The network is trained in an end-to-end manner by optimizing a multi-task objective function. The feature separation–fusion module (FSFM), task-guided representation enhancement, and the multi-task objective function will be explained in the following sections.

3.2. Feature Separation–Fusion Module

In the existing multi-task learning methods, the features of height and semantic branches are usually fused by direct summation or concatenation. However, we believe it is desirable to select relevant and consistent features from the two tasks for handling each task. To this end, this work constructs a feature separation–fusion module (FSFM) based on an attention mechanism. The FSFM module consists of two components, including a task-aware feature separation module (TFSM) and two cross-task feature fusion modules (CFFM). The TFSM module separates the shared features extracted from the input image into height-aware features and semantic-aware features, and the CFFM module selects the beneficial features from another branch for fusion. The topologies of the TFSM module and the CFFM module are illustrated in Figure 2.

Specifically, the TFSM module employs a symmetric structure, which contains two branches with the same architecture but different weights. As seen from Figure 2a, in the TFSM module, its upper branch represents the height estimation branch for outputting height features (red features), while its lower branch represents the semantic segmentation branch for outputting semantic features (cyan features). Here, we take the height branch as an example. The shared features are first downsampled by a global average pooling layer (GAP), then the feature integration is performed by a fully connected layer (FC). Then, an attention map in the channel dimension is obtained through a sigmoid function. After that, the shared features are weighted by this attention map and added to the original shared features to obtain the height-aware features. Then, the height-aware features are integrated through three consecutive convolutional blocks (CBR), each of which is composed of a convolutional layer, a batch normalization layer, and a rectified linear unit (ReLU) function. Similarly, the semantic branch obtains semantic-aware features through the same operations as the height branch.

For the obtained task-aware features, the CFFM module is utilized to fuse the features of this branch and the beneficial features of another branch. In the CFFM module, the height-aware features and the semantic-aware features are first concatenated and passed through a

3 \times 3

convolutional layer, and then split into height branch and semantic branch. For the height branch, the features are fed into a

3 \times 3

convolutional layer and a sigmoid function to obtain an attention map, which can be used for feature selection of height-aware features in the spatial dimension. The height-aware features are multiplied with this attention map and added with the features of the semantic branch to achieve cross-task feature fusion. After that, the size of the obtained features is increased by a factor of 2 using an upsampling operation. The semantic branch is similar to the height branch; the difference is that the attention map is used to weight the semantic-aware features, and then the weighted features are added to the features in the height branch to complete the cross-task feature fusion.

The feature separation–fusion module can effectively aggregate the relevant features between height estimation and semantic segmentation. By using semantic features to constrain the representation of height features more accurately, the height-spreading phenomenon across different classes is reduced.

3.3. Task-Guided Representation Enhancement

Based on the above feature separation–fusion, a novel task-guided representation enhancement method is designed to refine the height-aware features and the semantic-aware features. Considering the local geometric relationship of the scene, the height-aware features of objects with the same height should be similar, whereas the height-aware features of objects with large height differences should be significantly different. Similarly, the semantic-aware features within the same class should be as similar as possible, and the semantic-aware features across different classes should be largely different. Therefore, two task-guided feature distance losses are designed based on the deep metric learning method, including the height-guided feature distance loss and the semantic-guided feature distance loss, to accomplish the representation enhancement of height features and semantic features.

3.3.1. Height-Guided Feature Distance Loss

The wide range of height values usually leads to a slow convergence or a sub-optimal solution when regressing pixel-level height from single images. Moreover, neighboring pixels usually have close height values, and the corresponding height features are similar. To facilitate the representation enhancement of height-aware features, the entire height range is first discretized into multiple intervals. Then, the features of the same height interval are constrained to be similar, and the features of different height intervals to be different.

For general remote sensing images, most pixels have smaller height values, and a few pixels have larger height values. However, predictions for these large height values are often subject to large uncertainties. To avoid overfocus on such pixels with large heights, the spacing-increasing discretization method in [58] is employed to uniformly discretize the height range in the log space. The formula for height interval discretization is as follows:

t_{i} = e^{l o g (α) + l o g (β / α) \times i / K}

(1)

where

α

and

β

are the lower and upper bounds of the whole height range,

t_{i} \in {t_{0}, t_{1}, \dots, t_{K}}

are the discrete thresholds, and K is the number of height intervals.

The local geometric consistency of the image makes the pixels within a small adjacent region usually have similar height values. Therefore, local patches are first cropped from the whole image in a left-to-right, top-to-bottom manner. For each local patch, pixels are divided into three groups, namely, anchor pixel, positive pixels, and negative pixels. The central pixel of the local patch is regarded as the anchor pixel, pixels in the same height interval as the anchor pixel are positive pixels, and pixels in different height intervals from the anchor pixel are negative pixels. Correspondingly, the feature distance between the positive pixels and the anchor pixel is defined as

d_{h}^{+}

, and the feature distance between the negative pixels and the anchor pixel is defined as

d_{h}^{-}

; the formulas are as follows:

d_{h}^{+} (i) = \frac{1}{| P_{i}^{+} |} \sum_{j \in P_{i}^{+}} \sqrt{{({\bar{F}}_{d}^{h} (i) - {\bar{F}}_{d}^{h} (j))}^{2}}

(2)

d_{h}^{-} (i) = \frac{1}{| P_{i}^{-} |} \sum_{j \in P_{i}^{-}} \sqrt{{({\bar{F}}_{d}^{h} (i) - {\bar{F}}_{d}^{h} (j))}^{2}}

(3)

where i represents the location of the anchor pixel,

| \cdot |

represents the number of elements in the set,

P_{i}^{+}

is the set of positive pixels,

P_{i}^{-}

is the set of negative pixels, and

{\bar{F}}_{d}^{h} = F_{d}^{h} / | | F_{d}^{h} | |

is the normalized height feature.

To make the features in the same height interval more similar and the features in different height intervals more distant, the feature distance of positive pixels and anchor pixel should be reduced, while the feature distance of negative pixels and anchor pixel should be increased. For this purpose, this work adopts the triplet loss [59,60,61] in deep metric learning, as follows:

L_{i}^{h} = m a x (0, d_{h}^{+} (i) + m_{h} - d_{h}^{-} (i))

(4)

where

m_{h}

indicates that when the feature distance of negative pixels and anchor pixel is larger than the distance of positive pixels and anchor pixel by a threshold

m_{h}

, the loss term is no longer optimized.

To reduce the noise influence, this work sets a condition for this loss as follows: When the number of positive pixels and the number of negative pixels are both greater than the threshold

T_{h}

, the loss term is calculated. Therefore, the height-guided feature distance loss is defined as

L_{H E} = \frac{\sum_{i} 𝟙 [| P_{i}^{+} |, | P_{i}^{-} | > T_{h}] \cdot L_{i}^{h}}{\sum_{i} 𝟙 [| P_{i}^{+} |, | P_{i}^{-} | > T_{h}]}

(5)

3.3.2. Semantic-Guided Feature Distance Loss

In the same spirit, a semantic-guided feature distance loss is designed to refine the semantic features. Specifically, the center pixel of the local image patch is taken as the anchor pixel, then the pixels of the same class as the anchor pixel are taken as positive pixels, and the pixels of different classes from the anchor pixel as negative pixels. Intuitively, the number of negative pixels is 0 which means that the pixels in this patch belong to the same class. When both the numbers of positive pixels and negative pixels are greater than 0, it indicates that the image patch contains objects from different classes.

The feature distance between positive pixels and anchor pixel

d_{s}^{+}

and the feature distance between negative pixels and anchor pixel

d_{s}^{-}

are defined as

d_{s}^{+} (i) = \frac{1}{| P_{i}^{+} |} \sum_{j \in P_{i}^{+}} \sqrt{{({\bar{F}}_{d}^{s} (i) - {\bar{F}}_{d}^{s} (j))}^{2}}

(6)

d_{s}^{-} (i) = \frac{1}{| P_{i}^{-} |} \sum_{j \in P_{i}^{-}} \sqrt{{({\bar{F}}_{d}^{s} (i) - {\bar{F}}_{d}^{s} (j))}^{2}}

(7)

where

{\bar{F}}_{d}^{s} = F_{d}^{s} / | | F_{d}^{s} | |

is the normalized height feature.

The corresponding semantic-guided feature distance loss is as follows:

L_{i}^{s} = m a x (0, d_{s}^{+} (i) + m_{s} - d_{s}^{-} (i))

(8)

L_{S E} = \frac{\sum_{i} 𝟙 [| P_{i}^{+} |, | P_{i}^{-} | > T_{s}] \cdot L_{i}^{s}}{\sum_{i} 𝟙 [| P_{i}^{+} |, | P_{i}^{-} | > T_{s}]}

(9)

where

m_{s}

is the feature distance threshold for semantic features.

3.4. Multi-Task Objective Function

In addition to the aforementioned height-guided feature distance loss and semantic-guided feature distance loss, the height ground truth and semantic labels are used as the supervision information for network training.

For the height estimation, following [55,62], this work adopts the L1 loss as the height loss term:

L_{H} = \frac{1}{N} \sum_{i = 1}^{N} | h_{i} - {\hat{h}}_{i} |

(10)

where h denotes the height ground truth,

\hat{h}

denotes the predicted height value, i is the pixel index in the image, and N is the total number of the valid pixels.

For the semantic segmentation, the multi-class cross-entropy loss is employed as the semantic loss term as follows:

L_{S} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i c} l o g (P_{i c})

(11)

where

y_{i c}

is 1 when the true class of pixel i is c, and 0 otherwise.

P_{i c}

is the probability scores for semantic label prediction, and C is the number of semantic classes.

Finally, the overall multi-task objective function is formulated as follows:

L_{t o t a l} = λ_{h} L_{H} + λ_{s} L_{S} + λ_{h e} L_{H E} + λ_{s e} L_{S E}

(12)

where

λ_{h}

,

λ_{s}

,

λ_{h e}

, and

λ_{s e}

are the weights of each loss item, respectively.

3.5. Datasets

To verify the effectiveness of the proposed SCE-Net, extensive experiments are performed on two public datasets, namely, the Vaihingen dataset and the Potsdam dataset, provided by ISPRS Working Group II/4. In the experiments, the normalized digital surface models (nDSMs) in [63] are used as the height ground truth.

Vaihingen: It consists of 33 tiles of different sizes, each tile contains the true orthophoto (TOP) and corresponding nDSM. The ground sampling distance of the TOP is 9 cm. The TOP contains near-infrared, red, and green bands (IRRG), while nDSM has one band. According to the official dataset partition, this work uses 16 tiles to construct the training set, and the remaining 17 tiles to form the testing set.

Potsdam: It contains 38 tiles of the same size, including the true orthophoto (TOP) and corresponding nDSM. The ground sampling distance of this dataset is 5 cm. The training images and testing images in the experiments are images containing three bands of red, green, and blue (RGB). According to the official dataset partition, 24 images are used for training, and the remaining 14 images are used for testing.

Samples from Vaihingen and Potsdam datasets are shown in Figure 3. Due to the large size of the original tiles, small patches of

512 \times 512

are randomly cropped from the raw tiles as the input images for training and testing in the experiments. When compared with other methods, the predictions of the image patches are stitched together and the results of the whole tiles are quantitatively evaluated.

3.6. Evaluation Indicators

In this paper, following [16,57], six indicators are used to evaluate the performance of height estimation. The height evaluation indicators include absolute relative error (absRel), mean absolute error (MAE), root mean square error (RMSE), and accuracy with thresholds (

δ_{i}

). The specific formulas are as follows:

absRel = \frac{1}{N} \sum_{i = 1}^{N} | h_{i} - {\hat{h}}_{i} | / h_{i}

(13)

MAE = \frac{1}{N} \sum_{i = 1}^{N} | h_{i} - {\hat{h}}_{i} |

(14)

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(h_{i} - {\hat{h}}_{i})}^{2}}

(15)

δ_{i} = m a x (\frac{h_{i}}{{\hat{h}}_{i}}, \frac{{\hat{h}}_{i}}{h_{i}}) < 1 . 25^{i}, i \in 1, 2, 3

(16)

where N represents the total number of pixels in the image, i denotes the pixel index in the image, h is the height ground truth, and

\hat{h}

is the predicted height value.

Referring to [8,25], five evaluation indicators for semantic segmentation are adopted, including overall pixel accuracy (OA): the accuracy of the overall semantic segmentation; per-class pixel accuracy (AA): the average accuracy of segmentation for different classes; mean intersection over union (mIoU): the intersection ratio between the ground truth and the predicted semantic labels; mean F1 scores (mF1): the harmonic mean of precision and recall; kappa coefficient (Kappa): the coefficient for measuring segmentation accuracy.

3.7. Implementation Details

The proposed SCE-Net is implemented based on the PyTorch framework on a single Tesla V100 with 32 GB GPU memory. The network uses ResNet-50 or ResNet-101 pretrained on ImageNet as the backbone to extract shared features from the input image. During training, the input of the network is an image of size

512 \times 512

randomly cropped from the original tiles. The size of the predicted height map and semantic segmentation result output by the network are both

512 \times 512

. The batch size is 4, and the total number of epochs is 50 for the network with ResNet-50 and 80 for the network with ResNet-101. The initial learning rate is

5 \times 10^{- 4}

and then decreased using the polynomial decay with power 0.9. During training, Adam is adopted as the optimizer with

β_{1}

= 0.5,

β_{2}

= 0.999. To prevent overfitting, three data augmentation methods are performed, including horizontal flipping, vertical flipping, and rotation with a degree between [−1.25, 1.25] with probability 0.5.

4. Results

To validate the effectiveness of the proposed method, the proposed SCE-Net is compared with recent state-of-the-art methods, including single-task learning methods (STL) that only perform either height estimation or semantic segmentation, and multi-task learning methods (MTL) that jointly perform height estimation and semantic segmentation. The qualitative and quantitative results on the Vaihingen dataset and Potsdam dataset are reported in the following sections.

4.1. Comparisons with State-of-the-Art Methods

Different from the single-task learning methods, the proposed SCE-Net adopts a multi-task learning framework and exploits the correlation between height information and semantic information to achieve collaborative learning between tasks. Unlike other multi-task learning methods, the proposed method selectively fuses features from related tasks instead of by their direct adding or concatenating to ensure selected features are more relevant to the task, and uses deep metric learning methods to enhance the representation of features.

The quantitative evaluation of the height estimation results on the Vaihingen dataset is shown in Table 2. It can be seen that the SCE-Net achieves better results than the comparison methods in five out of six metrics. We believe that the MAE of [57] is slightly better than the proposed method because their introduction of boundary detection provides more supervised information. At the same time, the semantic segmentation results on the Vaihingen dataset are reported in Table 3. The experimental results in Table 3 demonstrate that the semantic segmentation performance of the proposed method outperforms the comparison methods. This shows that the SCE-Net can not only utilize semantic features to improve the quality of height estimation results, but also make full use of height information to improve the accuracy of semantic segmentation. The qualitative results of height estimation and semantic segmentation on the Vaihingen dataset and the Potsdam dataset are illustrated in Figure 4 and Figure 5. The visualization results of some local images are shown in Figure 6 and Figure 7. It can be seen that the height estimation and semantic segmentation of the SCE-Net are accurate and reliable. Overall, the proposed SCE-Net achieves better performance than the comparison state-of-the-art methods on the Vaihingen dataset.

This section also reports the height estimation results and semantic segmentation results on the Potsdam dataset, as shown in Table 4 and Table 5, respectively. Similarly, the results of the proposed method in both height estimation and semantic segmentation are better than the comparison methods on this dataset.

4.2. Ablation Studies

To verify the effectiveness of each component in the SCE-Net, this section conducts ablation studies on the Vaihingen dataset. The experiments in this section are performed on images of size

512 \times 512

. The quantitative results of the ablation experiments are shown in Table 6. There are five methods in Table 6: (a) a single-task learning network that estimates height from single images (STL_H); (b) a single-task learning network that performs semantic segmentation from single images (STL_S); (c) a multi-task learning network that consists of a shared encoder and two decoders related to height estimation and semantic segmentation (MTL_B), hereinafter referred as baseline network; (d) baseline network accompanied by the proposed FSFM module (MTL_B+FSFM); (e) the SCE-Net, namely, baseline network with FSFM module and task-guided representation enhancement method (MTL_B+FSFM+TRE).

It can be seen from Table 6 that the results of the single-task learning network are slightly better than the results of the multi-task learning network in the case of using the same network; however, after adding the FSFM module, both height estimation and semantic segmentation are improved. This shows that the FSFM module can take advantage of the multi-task learning and fully exploit the features between related tasks. The SCE-Net achieves the best performance on both tasks, indicating that the proposed task-guided representation enhancement method can improve the feature representation ability of the network. The visualization results of different networks are shown in Figure 8, and such qualitative results also illustrate the effectiveness of each component in the SCE-Net.

5. Discussion

In this paper, the proposed SCE-Net adopts a multi-task learning framework and achieves satisfactory performances on both height estimation and semantic segmentation. Although there is a correlation between class semantics and height, evident differences also exist; clearly, objects from the same class could have substantial height variations. With this consideration in mind, this work proposes a feature separation–fusion module (FSFM) that selectively fuses height features and semantic features to prevent inconsistent features across tasks to negatively impact the predictive ability of the network.

Some examples of the predicted height maps and error maps for the baseline network and the network with the FSFM module are shown in Figure 9. It can be seen that, compared with the baseline network, the results of the network with the FSFM module have smaller errors, which demonstrates that the FSFM module can effectively select features in related tasks. Furthermore, in the Vaihingen dataset, most buildings have uneven roofs and trees have large height differences. This is the geometric inconsistency of objects in the same class with different heights. Therefore, this section also shows the attention maps learned by the network for the selection of semantic features and fusion with height features. The attention maps indicate that attention is less focused on those places with the same semantic but large height variations. It also shows that the FSFM method can effectively fuse the relevant features between the two tasks.

In addition, the results of the FSFM module are compared with two other feature fusion methods, and the experimental results are shown in Table 7. The three methods in Table 7 are (a) the height and semantic features are fused by direct summation (B+Sum); (b) the height and semantic features are fused by direct concatenation (B+Cat); (c) the proposed FSFM module (B+FSFM). As seen from the table, the FSFM module outperforms the other two feature fusion methods. It shows that the FSFM module can more effectively utilize the features between related tasks and improve the prediction performance of the model.

The FSFM module extracts task-aware features from shared features and integrates features from related task branches based on an attention mechanism. On this basis, a task-guided representation enhancement method is employed to refine the task-aware features. In this method, interval discretization is first performed for the height range, and then a height-guided feature distance loss is designed for the height intervals and a semantic-guided feature distance loss is designed for the semantic classes.

Here, the influence of the number of height intervals on height estimation and semantic segmentation is assessed. The experimental results of discretizing the height range into different numbers of intervals (10, 20, 30, 40, 50, and 60 intervals) are reported in Table 8. It can be seen that as the number of height intervals increases, the height estimation performance gradually becomes better. The results reach the best when the height interval is 30, and then the results gradually decrease as the number of intervals increases. This is because the height range of the Vaihingen dataset and Potsdam dataset is 0–25.5 m. If the number of height intervals is too small, the same height interval will contain a wide range of height values, then the height-guided feature distance loss will cause a large error in the consistency constraint of features in the same height interval. When the number of height intervals is too large, the height feature is close to the pixel-level feature, resulting in inaccurate height prediction. Therefore, in future work, the number of height intervals can be adaptively adjusted according to the approximate height range of the used dataset. It is worth noting that when changing the number of height intervals, the results of semantic segmentation remain basically unchanged. It shows that the improvement of the height estimation does not come at the expense of the performance of semantic segmentation, showing that height discretization with a proper interval is necessary.

Since the task-guided representation enhancement method is performed on local patches, here, the impact of the local patch size on height estimation and semantic segmentation is also assessed. Since the height variations of the scenes are usually more pronounced than the semantic class variation, this work chooses different local patch sizes for the height branch and the semantic branch. For the height branch, the image patch size is set to

5 \times 5

,

7 \times 7

, and

9 \times 9

, and the experimental results are shown in Table 9. It can be seen that the height estimation results under

7 \times 7

image patch are the best. The experimental results show that the image patches should not be too small or too large for height estimation. Small-sized image patches contain a small number of pixels, and the height values may all belong to the same height interval, and make the height-guided feature distance loss less effective. If the size of the image patches is too large, the pixels in the same patch may have substantial height differences, and could blur the feature expression of different height intervals. For the semantic branch, this work chooses the size of the image patches to be

9 \times 9

,

11 \times 11

, and

13 \times 13

, respectively. The experimental results are shown in Table 10. It can be seen that the results are the best when the image patch size is

11 \times 11

. Similar to height features, pixels in a small image patch are more likely to belong to the same class, while a large image patch may contain different object classes, making it difficult to obtain optimal feature representation. In the experiments, the proposed method chooses image patches of size

7 \times 7

for the height branch and size

11 \times 11

for the semantic branch.

Furthermore, the computational time of the SCE-Net on the Vaihingen dataset and the Potsdam dataset is analyzed, as shown in Table 11. Similar to in Section 4.2, B represents the baseline network, FSFM represents the feature separation–fusion module, and TRE represents the task-guided representation enhancement method. It can be seen that for the Vaihingen dataset, when the backbone adopts ResNet-50, the average inference time of the baseline network for images of size

512 \times 512

is 0.034 s, the total time for the testing dataset is 13.796 s, and the average inference time for each original tile is 0.811 s. When the FSFM module is added, the average inference time for

512 \times 512

images is 0.035 s, the total time for testing images is 14.219 s, and each original tile takes 0.836 s on average. Further, when TRE is added, the inference time of the model does not increase because the feature representation enhancement of the height features and semantic features is not included in the testing stage. When the backbone adopts ResNet-101, the average inference time for

512 \times 512

images is 0.039 s, the total inference time for testing images is 15.761 s, and the average inference time per original tile is 0.927 s. For the Potsdam dataset, when the backbone adopts ResNet-50, the average inference time for

512 \times 512

images is 0.033 s, the total time is 68.205 s, and each original tile takes about 4.871 s. When the backbone adopts ResNet-101, the inference time for each

512 \times 512

image is about 0.037 s, the total inference time for testing images is 76.140 s, and the average inference time per original tile is 5.438 s, since ResNet-101 is more complex than ResNet-50. The experimental results demonstrate that, compared with the baseline network, the proposed SCE-Net can effectively improve the height estimation and semantic segmentation performance with little increase in computational time.

6. Conclusions

Recently, several works in the literature have demonstrated that height estimation and semantic segmentation can benefit from each other and improve the performance of both tasks. In this paper, a multi-task learning network, called self- and cross-enhancement network (SCE-Net), is proposed to simultaneously learn height information and semantic labels from single aerial images. Considering that the height information and class semantics are generally related, but not always in complete agreement, a feature separation–fusion module is constructed to achieve effective separation and fusion of height features and semantic features, improving the feature representation across tasks. To obtain more discriminative feature representation, this work constrains the height-aware features of objects with the same height being similar, whereas the height-aware features of objects with large height differences are as different as possible. Similarly, semantic-aware features within the same class are constrained to be similar, while semantic-aware features across different classes are largely different. To this end, both the height-aware feature representation and semantic-aware feature representation are enhanced by constraining the height-guided feature distance loss and the semantic-guided feature distance loss, achieving better performance in height estimation and semantic segmentation. Extensive experimental results on two public datasets demonstrate the effectiveness of the proposed SCE-Net. In future work, height estimation capability could be further improved by exploring more suitable multi-scale contextual information, for example, using ASPP [40]—like tricks for multi-scale feature learning, or training the network under the multi-scale training paradigm.

Author Contributions

Conceptualization, S.X., Q.D. and Z.H.; methodology, S.X.; software, S.X.; validation, S.X.; formal analysis, S.X.; investigation, S.X.; resources, S.X.; data curation, S.X.; writing—original draft preparation, S.X.; writing—review and editing, Q.D. and Z.H.; visualization, S.X.; supervision, Q.D. and Z.H.; project administration, Q.D.; funding acquisition, Q.D. and Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61991423 and U1805264) and the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB32050100).

Data Availability Statement

The Vaihingen dataset is freely available at http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-vaihingen.html, accessed on 21 March 2022, and the Potsdam dataset is freely available at http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-potsdam.html, accessed on 21 March 2022.

Acknowledgments

The authors would like to thank the International Society for Photogrammetry and for providing the Vaihingen and Potsdam datasets for research.

Conflicts of Interest

The authors declare no conflict of interest.

References

Beumier, C.; Idrissa, M. Digital terrain models derived from digital surface model uniform regions in urban areas. Int. J. Remote Sens. 2016, 37, 3477–3493. [Google Scholar] [CrossRef]
Qin, R.; Tian, J.; Reinartz, P. 3d change detection–approaches and applications. ISPRS J. Photogramm. Remote Sens. 2016, 122, 41–56. [Google Scholar] [CrossRef] [Green Version]
Tu, J.; Sui, H.; Feng, W.; Song, Z. Automatic building damage detection method using high-resolution remote sensing images and 3d gis model. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, 3, 43–50. [Google Scholar] [CrossRef] [Green Version]
Guan, Y.; Aamir, M.; Hu, Z.; Abro, W.A.; Rahman, Z.; Dayo, Z.A.; Akram, S. A region-based efficient network for accurate object detection. Trait. Signal 2021, 38, 481–494. [Google Scholar] [CrossRef]
Thiagarajan, K.; Anandan, M.M.; Stateczny, A.; Divakarachari, P.B.; Lingappa, H.K. Satellite image classification using a hierarchical ensemble learning and correlation coefficient-based gravitational search algorithm. Remote Sens. 2021, 13, 4351. [Google Scholar] [CrossRef]
Wenkel, S.; Alhazmi, K.; Liiv, T.; Alrshoud, S.; Simon, M. Confidence score: The forgotten dimension of object detection performance evaluation. Sensors 2021, 21, 4350. [Google Scholar] [CrossRef]
Shivappriya, S.N.; Priyadarsini, M.J.P.; Stateczny, A.; Puttamadappa, C.; Parameshachari, B.D. Cascade object detection and remote sensing object detection method based on trainable activation function. Remote Sens. 2021, 13, 200. [Google Scholar] [CrossRef]
Srivastava, S.; Volpi, M.; Tuia, D. Joint height estimation and semantic labeling of monocular aerial images with cnns. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium, IGARSS 2017, Fort Worth, TX, USA, 23–28 July 2017; pp. 5173–5176. [Google Scholar]
Zheng, Z.; Zhong, Y.; Wang, J. Pop-net: Encoder-dual decoder for semantic segmentation and single-view height estimation. In Proceedings of the 2019 IEEE International Geoscience and Remote Sensing Symposium, IGARSS 2019, Yokohama, Japan, 28 July–2 August 2019; pp. 4963–4966. [Google Scholar]
Raggam, J.; Buchroithner, M.; Mansberger, R. Relief mapping using nonphotographic spaceborne imagery. ISPRS J. Photogramm. Remote Sens. 1989, 44, 21–36. [Google Scholar] [CrossRef]
Roncella, R.; Bruno, N.; Diotri, F.; Thoeni, K.; Giacomini, A. Photogrammetric digital surface model reconstruction in extreme low-light environments. Remote Sens. 2021, 13, 1261. [Google Scholar] [CrossRef]
Pinheiro, M.; Reigber, A.; Scheiber, R.; Prats-Iraola, P.; Moreira, A. Generation of highly accurate dems over flat areas by means of dual-frequency and dual-baseline airborne sar interferometry. IEEE Trans. Geosci. Remote Sens. 2018, 56, 4361–4390. [Google Scholar] [CrossRef] [Green Version]
Ka, M.H.; Shimkin, P.E.; Baskakov, A.I.; Babokin, M.I. A new single-pass sar interferometry technique with a single-antenna for terrain height measurements. Remote Sens. 2019, 11, 1070. [Google Scholar] [CrossRef] [Green Version]
Mou, L.; Zhu, X.X. Im2height: Height estimation from single monocular imagery via fully residual convolutional-deconvolutional network. arXiv 2018, arXiv:1802.10249. [Google Scholar]
Zhang, Y.; Chen, X. Multi-path fusion network for high-resolution height estimation from a single orthophoto. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2017, Shanghai, China, 8–12 July 2019; pp. 186–191. [Google Scholar]
Amirkolaee, H.A.; Arefi, H. Height estimation from single aerial images using a deep convolutional encoder-decoder network. ISPRS J. Photogramm. Remote Sens. 2019, 149, 50–66. [Google Scholar] [CrossRef]
Li, X.; Wang, M.; Fang, Y. Height estimation from single aerial images using a deep ordinal regression network. IEEE Geosci. Remote Sens. Lett. 2020, 19, 1–5. [Google Scholar] [CrossRef]
Liu, C.J.; Krylov, V.A.; Kane, P.; Kavanagh, G.; Dahyot, R. Im2elevation: Building height estimation from single-view aerial imagery. Remote Sens. 2020, 12, 2719. [Google Scholar] [CrossRef]
Xing, S.; Dong, Q.; Hu, Z. Gated feature aggregation for height estimation from single aerial images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Mo, D.; Fan, C.; Shi, Y.; Zhang, Y.; Lu, R. Soft-aligned gradient-chaining network for height estimation from single aerial images. IEEE Geosci. Remote Sens. Lett. 2020, 18, 538–542. [Google Scholar] [CrossRef]
Karatsiolis, S.; Kamilaris, A.; Cole, I. Img2ndsm: Height estimation from single airborne rgb images with deep learning. Remote Sens. 2021, 13, 2417. [Google Scholar] [CrossRef]
Ghamisi, P.; Yokoya, N. Img2dsm: Height simulation from single imagery using conditional generative adversarial net. IEEE Geosci. Remote Sens. Lett. 2018, 15, 794–798. [Google Scholar] [CrossRef]
Paoletti, M.; Haut, J.; Ghamisi, P.; Yokoya, N.; Plaza, J.; Plaza, A. U-img2dsm: Unpaired simulation of digital surface models with generative adversarial networks. IEEE Geosci. Remote Sens. Lett. 2020, 18, 1288–1292. [Google Scholar] [CrossRef]
Panagiotou, E.; Chochlakis, G.; Grammatikopoulos, L.; Charou, E. Generating elevation surface from a single rgb remotely sensed image using deep learning. Remote Sens. 2020, 12, 2002. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, 8–10 June 2015; pp. 3431–3440. [Google Scholar]
Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, 13–16 December 2015; pp. 1520–1528. [Google Scholar]
Volpi, M.; Tuia, D. Dense semantic labeling of subdecimeter resolution images with convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2016, 55, 881–893. [Google Scholar] [CrossRef] [Green Version]
Audebert, N.; Saux, B.L.; Lefèvre, S. Beyond rgb: Very high resolution urban remote sensing with multimodal deep networks. ISPRS J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef] [Green Version]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Lin, G.; Milan, A.; Shen, C.; Reid, I. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 1925–1934. [Google Scholar]
Yue, K.; Yang, L.; Li, R.; Hu, W.; Zhang, F.; Li, W. Treeunet: Adaptive tree convolutional neural networks for subdecimeter aerial image segmentation. ISPRS J. Photogramm. Remote Sens. 2019, 156, 1–13. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Ghiasi, G.; Fowlkes, C.C. Laplacian pyramid reconstruction and refinement for semantic segmentation. In Proceedings of the European Conference on Computer Vision, ECCV 2018, Amsterdam, The Netherlands, 8–16 October 2016; pp. 519–534. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [Green Version]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Bilinski, P.; Prisacariu, V. Dense decoder shortcut connections for single-pass semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6596–6605. [Google Scholar]
Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3684–3692. [Google Scholar]
Nogueira, K.; Mura, M.D.; Chanussot, J.; Schwartz, W.R.; Santos, J.A.D. Dynamic multi-context segmentation of remote sensing images based on convolutional networks. IEEE Trans. Geosci. Remote Sens. 2019, 57, 7503–7520. [Google Scholar] [CrossRef] [Green Version]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision, ECCV 2018, Amsterdam, The Netherlands, 8–16 October 2016; pp. 801–818. [Google Scholar]
Zhao, W.; Du, S. Spectral–spatial feature extraction for hyperspectral image classification: A dimension reduction and deep learning approach. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4544–4554. [Google Scholar] [CrossRef]
Hung, W.C.; Tsai, Y.H.; Shen, X.; Lin, Z.; Sunkavalli, K.; Lu, X.; Yang, M.H. Scene parsing with global context embedding. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; pp. 2631–2639. [Google Scholar]
Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7151–7160. [Google Scholar]
Mou, L.; Hua, Y.; Zhu, X.X. Relation matters: Relational context-aware fully convolutional network for semantic segmentation of high-resolution aerial images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7557–7569. [Google Scholar] [CrossRef]
Yi, Y.; Zhang, Z.; Zhang, W.; Zhang, C.; Li, W.; Zhao, T. Semantic segmentation of urban buildings from vhr remote sensing imagery using a deep convolutional neural network. Remote Sens. 2019, 11, 1774. [Google Scholar] [CrossRef] [Green Version]
Mou, L.; Hua, Y.; Zhu, X.X. A relation-augmented fully convolutional network for semantic segmentation in aerial scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 12416–12425. [Google Scholar]
Ding, L.; Zhang, J.; Bruzzone, L. Semantic segmentation of large-size vhr remote sensing images using a two-stage multiscale training architecture. IEEE Trans. Geosci. Remote Sens. 2020, 58, 5367–5376. [Google Scholar] [CrossRef]
Marcos, D.; Volpi, M.; Kellenberger, B.; Tuia, D. Land cover mapping at very high resolution with rotation equivariant cnns: Towards small yet accurate models. ISPRS J. Photogramm. Remote Sens. 2018, 145, 96–107. [Google Scholar] [CrossRef] [Green Version]
Marmanis, D.; Schindler, K.; Wegner, J.D.; Galliani, S.; Datcu, M.; Stilla, U. Classification with an edge: Improving semantic image segmentation with boundary detection. ISPRS J. Photogramm. Remote Sens. 2018, 135, 158–172. [Google Scholar] [CrossRef] [Green Version]
Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, 13–16 December 2015; pp. 2650–2658. [Google Scholar]
Xu, D.; Ouyang, W.; Wang, X.; Sebe, N. Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 675–684. [Google Scholar]
Volpi, M.; Tuia, D. Deep multi-task learning for a geographically-regularized semantic segmentation of aerial images. ISPRS J. Photogramm. Remote Sens. 2018, 144, 48–60. [Google Scholar] [CrossRef] [Green Version]
Papadomanolaki, M.; Karantzalos, K.; Vakalopoulou, M. A multi-task deep learning framework coupling semantic segmentation and image reconstruction for very high resolution imagery. In Proceedings of the 2019 IEEE International Geoscience and Remote Sensing Symposium, IGARSS 2019, Yokohama, Japan, 28 July–2 August 2019; pp. 1069–1072. [Google Scholar]
Wang, C.; Pei, J.; Wang, Z.; Huang, Y.; Wu, J.; Yang, H.; Yang, J. When deep learning meets multi-task learning in sar atr: Simultaneous target recognition and segmentation. Remote Sens. 2020, 12, 3863. [Google Scholar] [CrossRef]
Carvalho, M.; Saux, B.L.; Trouvé-Peloux, P.; Champagnat, F.; Almansa, A. Multitask learning of height and semantics from aerial images. IEEE Geosci. Remote Sens. Lett. 2019, 17, 1391–1395. [Google Scholar] [CrossRef] [Green Version]
Mahmud, J.; Price, T.; Bapat, A.; Frahm, J.M. Boundary-aware 3d building reconstruction from a single overhead image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2020, Virtual, 14–19 June 2020; pp. 441–451. [Google Scholar]
Wang, Y.; Ding, W.; Zhang, R.; Li, H. Boundary-aware multitask learning for remote sensing imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 951–963. [Google Scholar] [CrossRef]
Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2002–2011. [Google Scholar]
Schultz, M.; Joachims, T. Learning a distance metric from relative comparisons. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS 2003, Vancouver and Whistler, Vancouver, BC, Canada; Whistler, BC, Canada, 8–13 December 2003. [Google Scholar]
Wang, J.; Song, Y.; Leung, T.; Rosenberg, C.; Wang, J.; Philbin, J.; Chen, B.; Wu, Y. Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, 24–27 June 2014; pp. 1386–1393. [Google Scholar]
Jung, H.; Park, E.; Yoo, S. Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2021, Virtual, 11–17 October 2021; pp. 12642–12652. [Google Scholar]
Li, X.; Wen, C.; Wang, L.; Fang, Y. Geometry-aware segmentation of remote sensing images via joint height estimation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Gerke, M. Use of the Stair Vision Library within the ISPRS 2d Semantic Labeling Benchmark (Vaihingen). 2014. Available online: http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-vaihingen.html (accessed on 21 March 2022).
Carvalho, M.; Saux, B.L.; Trouvé-Peloux, P.; Almansa, A.; Champagnat, F. On regression losses for deep depth estimation. In Proceedings of the 25th IEEE International Conference on Image Processing, ICIP 2018, Athens, Greece, 7–10 October 2018; pp. 2915–2919. [Google Scholar]
Alidoost, F.; Arefi, H.; Tombari, F. 2d image-to-3d model: Knowledge-based 3d building reconstruction (3dbr) using single aerial images and convolutional neural networks (cnns). Remote Sens. 2019, 11, 2219. [Google Scholar] [CrossRef] [Green Version]

Figure 1. The overall network architecture of SCE-Net.

Figure 2. The topologies of feature separation–fusion module (FSFM). (a) Task-aware feature separation module (TFSM); (b) cross-task feature fusion modules (CFFM).

Figure 3. Samples from Vaihingen and Potsdam datasets. Top: Vaihingen dataset. Bottom: Potsdam dataset. (a) Images. (b) Height ground truth. (c) Semantic labels.

Figure 4. The qualitative results of height estimation and semantic segmentation on the Vaihingen dataset. (a) Images. (b) Height ground truth. (c) Predicted height maps. (d) Semantic labels. (e) Semantic segmentation results.

Figure 5. The qualitative results of height estimation and semantic segmentation on the Potsdam dataset. (a) Images. (b) Height ground truth. (c) Predicted height maps. (d) Semantic labels. (e) Semantic segmentation results.

Figure 6. The visualization results of local images on the Vaihingen dataset. (a) Images. (b) Height ground truth. (c) Predicted height maps. (d) Semantic labels. (e) Semantic segmentation results.

Figure 7. The visualization results of local images on the Potsdam dataset. (a) Images. (b) Height ground truth. (c) Predicted height maps. (d) Semantic labels. (e) Semantic segmentation results.

Figure 8. The visualization results of different networks on the Vaihingen dataset. (a) Images. (b) Ground truth. (c) MTL_B. (d) MTL_B+FSFM. (e) MTL_B+FSFM+TRE (SCE-Net).

Figure 9. The visualization results of baseline network and the network with FSFM module on the Vaihingen dataset. (a) Images. (b) Height ground truth. (c) Predicted height maps of the baseline network. (d) Error maps of the baseline network. (e) Predicted height maps of the network with FSFM module. (f) Error maps of the network with FSFM module. (g) Attention maps.

Table 1. Notation of important symbols in this work.

Symbol	Notation
STL	Single-task learning method
MTL	Multi-task learning method
SCE-Net	Self- and cross-enhancement network
FSFM	Feature separation–fusion module
TFSM	Task-aware feature separation module
CFFM	Cross-task feature fusion module
TRE	Task-guided representation enhancement method

Table 2. Quantitative evaluation of the height estimation results on the Vaihingen dataset.

Method	Task	absRel	MAE	RMSE	$δ_{1}$	$δ_{2}$	$δ_{3}$
IM2HEIGHT [14]	STL	1.009	1.485	2.253	0.317	0.512	0.609
IMG2DSM [22]	STL	-	-	2.580	-	-	-
D3Net [64]	STL	2.016	1.314	2.123	0.369	0.533	0.644
Zhang et al. [15]	STL	-	2.420	3.900	-	-	-
Amirkolaee et al. [16]	STL	1.179	1.487	2.197	0.305	0.496	0.599
3DBR [65]	STL	0.948	1.379	2.074	0.338	0.540	0.641
U-IMG2DSM [23]	STL	-	-	3.460	-	-	-
IM2ELEVATION [18]	STL	0.956	1.226	1.882	0.399	0.587	0.671
PLNet [19]	STL	0.833	1.178	1.775	0.386	0.599	0.702
Srivastava et al. [8]	MTL	4.415	1.861	2.729	0.217	0.385	0.517
Carvalho et al. [55]	MTL	1.882	1.262	2.089	0.405	0.562	0.663
BAMTL [57]	MTL	1.064	1.078	1.762	0.451	0.617	0.714
SCE-Net-50	MTL	0.773	1.137	1.768	0.512	0.713	0.804
SCE-Net-101	MTL	0.722	1.132	1.755	0.508	0.710	0.812

Table 3. Quantitative evaluation of the semantic segmentation results on the Vaihingen dataset.

Method	Input	OA	AA	mIoU	mF1	Kappa
FCN [25]	IRRG	86.5	-	72.7	83.7	-
RoteEqNet [48]	IRRG+DSM	87.5	83.9	-	84.2	-
UZ 1 [27]	IRRG+nDSM	87.8	81.4	-	83.6	83.8
S-RA-FCN [44]	IRRG	89.2	-	79.8	88.5	-
UFMG 4 [39]	IRRG+nDSM	89.4	-	-	87.7	-
DLR 10 [49]	IRRG+DSM+Edge	90.3	-	-	88.5	-
Deeplab V3+ [40]	IRRG	90.4	-	75.0	89.0	-
TreeUNet [32]	IRRG+DSM	90.4	-	-	89.3	-
Srivastava et al. [8]	IRRG	79.3	70.4	-	72.6	-
Carvalho et al. [55]	IRRG	86.1	80.4	-	82.3	-
BAMTL [57]	IRRG	88.4	85.9	-	86.9	-
SCE-Net-50	IRRG	90.3	89.0	81.1	89.4	87.1
SCE-Net-101	IRRG	90.4	89.2	81.4	89.6	87.2

Table 4. Quantitative evaluation of the height estimation results on the Potsdam dataset.

Method	Task	absRel	MAE	RMSE	$δ_{1}$	$δ_{2}$	$δ_{3}$
IM2HEIGHT [14]	STL	0.518	2.200	4.141	0.534	0.680	0.763
IMG2DSM [22]	STL	-	-	3.890	-	-	-
D3Net [64]	STL	0.391	1.681	3.055	0.601	0.742	0.830
Zhang et al. [15]	STL	-	-	3.870	-	-	-
Amirkolaee et al. [16]	STL	0.537	1.926	3.507	0.394	0.640	0.775
3DBR [65]	STL	0.409	1.751	3.439	0.605	0.742	0.823
U-IMG2DSM [23]	STL	-	-	4.620	-	-	-
IM2ELEVATION [18]	STL	0.429	1.744	3.516	0.638	0.767	0.839
PLNet [19]	STL	0.318	1.201	2.354	0.639	0.833	0.912
Srivastava et al. [8]	MTL	0.624	2.224	3.740	0.412	0.597	0.720
Carvalho et al. [55]	MTL	0.441	1.838	3.281	0.575	0.720	0.808
BAMTL-50 [57]	MTL	0.357	1.270	2.433	0.680	0.818	0.887
BAMTL-101 [57]	MTL	0.291	1.223	2.407	0.685	0.819	0.897
SCE-Net-50	MTL	0.274	1.191	2.455	0.696	0.838	0.907
SCE-Net-101	MTL	0.268	1.168	2.430	0.696	0.840	0.909

Table 5. Quantitative evaluation of the semantic segmentation results on the Potsdam dataset.

Method	Input	OA	AA	mIoU	mF1	Kappa
FCN [25]	RGB	85.6	-	78.3	87.6	-
UFMG 4 [39]	IRRGB+nDSM	87.9	-	-	89.5	-
S-RA-FCN [44]	RGB	88.6	-	82.4	90.2	-
UZ 1 [27]	IRRG+nDSM	89.9	88.8	-	88.0	86.1
V-FuseNet [28]	IRRGB+DSM+nDSM	90.6	-	-	92.0	-
TSMTA [47]	RGB	90.6	-	-	91.9	-
TreeUNet [32]	IRRGB+DSM+nDSM	90.7	-	-	92.0	-
Deeplab V3+ [40]	RGB	90.8	-	81.7	93.0	-
Srivastava et al. [8]	RGB	80.1	79.2	-	79.9	-
Carvalho et al. [55]	RGB	83.2	80.9	-	82.2	-
BAMTL-50 [57]	RGB	91.0	90.2	-	90.5	-
BAMTL-101 [57]	RGB	91.3	90.4	-	90.9	-
SCE-Net-50	RGB	92.3	92.5	86.5	92.6	89.7
SCE-Net-101	RGB	92.9	93.1	87.6	93.2	90.5

Table 6. Ablation studies on the Vaihingen dataset.

Method	absRel	MAE	RMSE	$δ_{1}$	$δ_{2}$	$δ_{3}$	OA	AA	mIoU	mF1	Kappa
STL_H	0.832	1.208	1.889	0.490	0.694	0.803	-	-	-	-	-
STL_S	-	-	-	-	-	-	89.8	89.0	80.8	89.2	86.4
MTL_B	0.899	1.212	1.912	0.493	0.698	0.795	90.0	88.1	80.2	88.9	86.7
MTL_B+FSFM	0.801	1.152	1.795	0.507	0.709	0.805	90.0	88.8	80.8	89.2	86.7
MTL_B+FSFM+TRE (SCE-Net)	0.773	1.137	1.768	0.512	0.713	0.804	90.3	89.0	81.1	89.4	87.1

Table 7. Experimental results of different feature fusion methods on the Vaihingen dataset.

Method	absRel	MAE	RMSE	$δ_{1}$	$δ_{2}$	$δ_{3}$	OA	AA	mIoU	mF1	Kappa
B+Sum	0.836	1.166	1.810	0.494	0.704	0.803	87.1	84.3	73.6	84.3	82.1
B+Cat	0.803	1.167	1.812	0.504	0.707	0.803	90.2	89.2	80.8	89.2	86.9
B+FSFM	0.801	1.152	1.795	0.507	0.709	0.805	90.0	88.8	80.8	89.2	86.7

Table 8. Experimental results with different height interval numbers on the Vaihingen dataset.

Intervals	absRel	MAE	RMSE	$δ_{1}$	$δ_{2}$	$δ_{3}$	OA	AA	mIoU	mF1	Kappa
10	0.810	1.140	1.811	0.509	0.708	0.801	90.2	89.2	81.0	89.3	86.9
20	0.779	1.144	1.781	0.512	0.711	0.801	90.2	89.3	81.2	89.4	87.0
30	0.773	1.137	1.768	0.512	0.713	0.804	90.3	89.0	81.1	89.4	87.1
40	0.792	1.149	1.794	0.508	0.710	0.803	90.2	89.3	81.2	89.5	87.0
50	0.807	1.157	1.860	0.509	0.710	0.803	90.3	89.0	81.0	89.4	87.0
60	0.814	1.157	1.836	0.509	0.707	0.800	90.3	89.1	80.9	89.3	87.1

Table 9. Experimental results under different height patch sizes on the Vaihingen dataset. The h and s denote the patch size of height branch and semantic branch.

Method	absRel	MAE	RMSE	$δ_{1}$	$δ_{2}$	$δ_{3}$	OA	AA	mIoU	mF1	Kappa
Res50-h5-s11	0.744	1.153	1.797	0.504	0.707	0.800	90.1	88.9	81.1	89.4	86.9
Res50-h7-s11	0.773	1.137	1.768	0.512	0.713	0.804	90.3	89.0	81.1	89.4	87.1
Res50-h9-s11	0.774	1.157	1.806	0.507	0.707	0.799	90.2	88.8	80.9	89.3	86.9

Table 10. Experimental results under different semantic patch sizes on the Vaihingen dataset. The h and s denote the patch size of height branch and semantic branch.

Method	absRel	MAE	RMSE	$δ_{1}$	$δ_{2}$	$δ_{3}$	OA	AA	mIoU	mF1	Kappa
Res50-h7-s9	0.779	1.139	1.763	0.508	0.707	0.800	90.0	89.1	80.9	89.3	86.6
Res50-h7-s11	0.773	1.137	1.768	0.512	0.713	0.804	90.3	89.0	81.1	89.4	87.1
Res50-h7-s13	0.769	1.154	1.803	0.508	0.710	0.801	90.0	88.8	80.7	89.1	86.7

Table 11. Computational time of different combinations of the used modules in the SCE-Net on the Vaihingen dataset and the Potsdam dataset.

Dataset	Method	Backbone	Inference Time (s)
Dataset	Method	Backbone	512 × 512	Original Tile	Total
Vaihingen	B	ResNet-50	0.034	0.811	13.796
	B+FSFM	ResNet-50	0.035	0.836	14.219
	B+FSFM+TRE (SCE-Net)	ResNet-50	0.035	0.836	14.219
	B+FSFM+TRE (SCE-Net)	ResNet-101	0.039	0.927	15.761
Potsdam	B+FSFM+TRE (SCE-Net)	ResNet-50	0.033	4.871	68.205
Potsdam	B+FSFM+TRE (SCE-Net)	ResNet-101	0.037	5.438	76.140

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xing, S.; Dong, Q.; Hu, Z. SCE-Net: Self- and Cross-Enhancement Network for Single-View Height Estimation and Semantic Segmentation. Remote Sens. 2022, 14, 2252. https://doi.org/10.3390/rs14092252

AMA Style

Xing S, Dong Q, Hu Z. SCE-Net: Self- and Cross-Enhancement Network for Single-View Height Estimation and Semantic Segmentation. Remote Sensing. 2022; 14(9):2252. https://doi.org/10.3390/rs14092252

Chicago/Turabian Style

Xing, Siyuan, Qiulei Dong, and Zhanyi Hu. 2022. "SCE-Net: Self- and Cross-Enhancement Network for Single-View Height Estimation and Semantic Segmentation" Remote Sensing 14, no. 9: 2252. https://doi.org/10.3390/rs14092252

APA Style

Xing, S., Dong, Q., & Hu, Z. (2022). SCE-Net: Self- and Cross-Enhancement Network for Single-View Height Estimation and Semantic Segmentation. Remote Sensing, 14(9), 2252. https://doi.org/10.3390/rs14092252

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SCE-Net: Self- and Cross-Enhancement Network for Single-View Height Estimation and Semantic Segmentation

Abstract

1. Introduction

2. Related Works

2.1. Height Estimation

2.2. Semantic Segmentation

2.3. Multi-Task Learning

3. Materials and Methods

3.1. Overview

3.2. Feature Separation–Fusion Module

3.3. Task-Guided Representation Enhancement

3.3.1. Height-Guided Feature Distance Loss

3.3.2. Semantic-Guided Feature Distance Loss

3.4. Multi-Task Objective Function

3.5. Datasets

3.6. Evaluation Indicators

3.7. Implementation Details

4. Results

4.1. Comparisons with State-of-the-Art Methods

4.2. Ablation Studies

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI