1. Introduction
In recent years, with the rapid development of aerospace technology, remote sensing imagery analysis for high-resolution images acquired by aerial or satellite sensors has received extensive attention. Learning height information from single aerial images, being one of the important tasks in remote sensing imagery analysis, can provide geometric information for 3D reconstruction of ground scenes, and is widely used in a variety of applications, such as urban planning [
1], change detection [
2], and disaster monitoring [
3]. Recently, thriving deep learning technology has made tremendous progress in the photogrammetry and remote sensing communities [
4,
5,
6,
7]. Similarly, height estimation from single aerial images mainly adopts deep-learning-based methods, including methods based on convolutional neural networks (CNNs), methods based on generative adversarial networks (GANs), and methods based on multi-task learning (MTL).
Compared with other images, remote sensing images have more complex spectral characteristics, where objects with different heights may have similar appearances due to similar materials, such as building roofs and roads. When using a deep neural network to extract features from a single image, it may generate mismatched height feature relationships, resulting in inaccurate height estimation. Generally, there is a geometric correlation between the height information and semantic information of the remote sensing scene. Compared with treating height estimation and semantic segmentation as two independent tasks, the multi-task learning methods can utilize height feature and semantic feature extracted from the image to achieve information complementarity, then leverage multi-source supervised information to improve the predictive performance. Therefore, this paper performs height estimation and semantic segmentation from single aerial images simultaneously in a unified framework.
Several recent works have shown that height estimation and semantic segmentation can benefit from each other, mainly based on the implicit assumption that changes in height generally correspond to changes in class [
8,
9]. However, although the height cues and semantic cues are related, they are not completely consistent. For example, objects within a same class may have different heights, while objects with a same height may belong to different classes. Therefore, straightforward fusion (summation or concatenation) of height features and semantic features will make inconsistent features negatively impact other shared features, leading to more inaccurate predictions. In addition, estimating height from single images is generally regarded as a pixel-level height regression task. However, the wide range of height values makes it challenging to obtain an accurate height value directly. Under the direct regression paradigm, existing methods generally suffer from slow convergences or sub-optimal solutions.
In this paper, a self- and cross-enhancement network (SCE-Net) is proposed to jointly learn height information and semantic labels from single aerial images under the framework of multi-task learning. Specifically, the SCE-Net first exploits the backbone network to extract shared features for both two tasks from the input image. Then, a feature separation–fusion module (FSFM) is designed to effectively separate task-aware features from the shared features and fuse cross-task features based on an attention mechanism to achieve cross-enhancement of task-related feature representation. In addition, for addressing the problem that the height range is large and difficult to regress, the height range is discretized into several intervals and a height-guided feature distance loss and a semantic-guided feature distance loss are designed to accomplish self-enhancement of feature representation based on the deep metric learning method. To verify the effectiveness of the proposed method, extensive experiments are conducted on two public datasets, namely, the Vaihingen dataset and the Potsdam dataset. Experimental results demonstrate that the proposed method outperforms recent state-of-the-art height estimation methods and achieves comparable performance to the comparison semantic segmentation methods.
The main contributions include the following:
A multi-task learning network, called self- and cross-enhancement network (SCE-Net), is proposed to simultaneously perform height estimation and semantic segmentation from single aerial images under a unified framework.
To effectively integrate the height and semantic cues of the scene, a feature separation–fusion module (FSFM) is constructed to separate the shared image features into task-aware features, and selectively fuse the cross-task features based on an attention mechanism.
A height-guided feature distance loss and a semantic-guided feature distance loss are designed to achieve task-guided representation enhancement using the deep metric learning method.
The paper is organized as follows: a brief review of related works is given in
Section 2, including height estimation, semantic segmentation, and multi-task learning.
Section 3 introduces the proposed self- and cross-enhancement network (SCE-Net), datasets, evaluation indicators, and implementation details. Extensive experimental results and evaluations are reported in
Section 4. In
Section 5, the effectiveness of each component proposed in this work is analyzed and discussed. Finally,
Section 6 concludes this work.
3. Materials and Methods
In this section, an overview of the proposed self- and cross-enhancement network (SCE-Net) are first given. Then, the feature separation–fusion module (FSFM) and task-guided representation enhancement are introduced, including height-guided feature distance loss and semantic-guided feature distance loss. After that, the multi-task objective function, datasets, evaluation indicators, and implementation details are described. Notation of important symbols in this work are shown in
Table 1.
3.1. Overview
In this paper, pixel-level height maps and semantic labels are simultaneously predicted from single aerial images under the multi-task learning framework. The proposed SCE-Net employs an encoder–decoder architecture, and the whole network consists of three parts: a backbone for feature extraction, a feature separation–fusion module, and a multi-task predictor. Unlike single-task learning methods, the SCE-Net includes a shared encoder and two task-related decoders. The overall network architecture is shown in
Figure 1.
Concretely, the network adopts ResNet-50 or ResNet-101 as the backbone. For a three-channel input image, the output of the network is a one-channel height map and a semantic segmentation resulting in the same number of channels as the number of classes. In the encoding process, the image is downsampled multiple times to obtain feature maps with sizes 1/4, 1/8, and 1/16 of the input size in turn. Then, several upsampling operations are performed in the two decoder branches to obtain the height map and semantic segmentation result with the same resolution as the original input, respectively. Furthermore, the network adopts skip connection to preserve detailed information lost during multiple downsampling operations in the encoder.
Considering that height estimation and semantic segmentation are closely related but they do not have a one-to-one relation, this paper constructs a feature separation–fusion module (FSFM), which first separates the features extracted from input image into task-aware features and selectively fuses the features from another branch to obtain consistent task-related features. Then, two task-guided feature distance losses are designed based on the deep metric learning to enhance the representation of the two task-aware features. The network is trained in an end-to-end manner by optimizing a multi-task objective function. The feature separation–fusion module (FSFM), task-guided representation enhancement, and the multi-task objective function will be explained in the following sections.
3.2. Feature Separation–Fusion Module
In the existing multi-task learning methods, the features of height and semantic branches are usually fused by direct summation or concatenation. However, we believe it is desirable to select relevant and consistent features from the two tasks for handling each task. To this end, this work constructs a feature separation–fusion module (FSFM) based on an attention mechanism. The FSFM module consists of two components, including a task-aware feature separation module (TFSM) and two cross-task feature fusion modules (CFFM). The TFSM module separates the shared features extracted from the input image into height-aware features and semantic-aware features, and the CFFM module selects the beneficial features from another branch for fusion. The topologies of the TFSM module and the CFFM module are illustrated in
Figure 2.
Specifically, the TFSM module employs a symmetric structure, which contains two branches with the same architecture but different weights. As seen from
Figure 2a, in the TFSM module, its upper branch represents the height estimation branch for outputting height features (red features), while its lower branch represents the semantic segmentation branch for outputting semantic features (cyan features). Here, we take the height branch as an example. The shared features are first downsampled by a global average pooling layer (GAP), then the feature integration is performed by a fully connected layer (FC). Then, an attention map in the channel dimension is obtained through a sigmoid function. After that, the shared features are weighted by this attention map and added to the original shared features to obtain the height-aware features. Then, the height-aware features are integrated through three consecutive convolutional blocks (CBR), each of which is composed of a convolutional layer, a batch normalization layer, and a rectified linear unit (ReLU) function. Similarly, the semantic branch obtains semantic-aware features through the same operations as the height branch.
For the obtained task-aware features, the CFFM module is utilized to fuse the features of this branch and the beneficial features of another branch. In the CFFM module, the height-aware features and the semantic-aware features are first concatenated and passed through a convolutional layer, and then split into height branch and semantic branch. For the height branch, the features are fed into a convolutional layer and a sigmoid function to obtain an attention map, which can be used for feature selection of height-aware features in the spatial dimension. The height-aware features are multiplied with this attention map and added with the features of the semantic branch to achieve cross-task feature fusion. After that, the size of the obtained features is increased by a factor of 2 using an upsampling operation. The semantic branch is similar to the height branch; the difference is that the attention map is used to weight the semantic-aware features, and then the weighted features are added to the features in the height branch to complete the cross-task feature fusion.
The feature separation–fusion module can effectively aggregate the relevant features between height estimation and semantic segmentation. By using semantic features to constrain the representation of height features more accurately, the height-spreading phenomenon across different classes is reduced.
3.3. Task-Guided Representation Enhancement
Based on the above feature separation–fusion, a novel task-guided representation enhancement method is designed to refine the height-aware features and the semantic-aware features. Considering the local geometric relationship of the scene, the height-aware features of objects with the same height should be similar, whereas the height-aware features of objects with large height differences should be significantly different. Similarly, the semantic-aware features within the same class should be as similar as possible, and the semantic-aware features across different classes should be largely different. Therefore, two task-guided feature distance losses are designed based on the deep metric learning method, including the height-guided feature distance loss and the semantic-guided feature distance loss, to accomplish the representation enhancement of height features and semantic features.
3.3.1. Height-Guided Feature Distance Loss
The wide range of height values usually leads to a slow convergence or a sub-optimal solution when regressing pixel-level height from single images. Moreover, neighboring pixels usually have close height values, and the corresponding height features are similar. To facilitate the representation enhancement of height-aware features, the entire height range is first discretized into multiple intervals. Then, the features of the same height interval are constrained to be similar, and the features of different height intervals to be different.
For general remote sensing images, most pixels have smaller height values, and a few pixels have larger height values. However, predictions for these large height values are often subject to large uncertainties. To avoid overfocus on such pixels with large heights, the spacing-increasing discretization method in [
58] is employed to uniformly discretize the height range in the log space. The formula for height interval discretization is as follows:
where
and
are the lower and upper bounds of the whole height range,
are the discrete thresholds, and
K is the number of height intervals.
The local geometric consistency of the image makes the pixels within a small adjacent region usually have similar height values. Therefore, local patches are first cropped from the whole image in a left-to-right, top-to-bottom manner. For each local patch, pixels are divided into three groups, namely, anchor pixel, positive pixels, and negative pixels. The central pixel of the local patch is regarded as the anchor pixel, pixels in the same height interval as the anchor pixel are positive pixels, and pixels in different height intervals from the anchor pixel are negative pixels. Correspondingly, the feature distance between the positive pixels and the anchor pixel is defined as
, and the feature distance between the negative pixels and the anchor pixel is defined as
; the formulas are as follows:
where
i represents the location of the anchor pixel,
represents the number of elements in the set,
is the set of positive pixels,
is the set of negative pixels, and
is the normalized height feature.
To make the features in the same height interval more similar and the features in different height intervals more distant, the feature distance of positive pixels and anchor pixel should be reduced, while the feature distance of negative pixels and anchor pixel should be increased. For this purpose, this work adopts the triplet loss [
59,
60,
61] in deep metric learning, as follows:
where
indicates that when the feature distance of negative pixels and anchor pixel is larger than the distance of positive pixels and anchor pixel by a threshold
, the loss term is no longer optimized.
To reduce the noise influence, this work sets a condition for this loss as follows: When the number of positive pixels and the number of negative pixels are both greater than the threshold
, the loss term is calculated. Therefore, the height-guided feature distance loss is defined as
3.3.2. Semantic-Guided Feature Distance Loss
In the same spirit, a semantic-guided feature distance loss is designed to refine the semantic features. Specifically, the center pixel of the local image patch is taken as the anchor pixel, then the pixels of the same class as the anchor pixel are taken as positive pixels, and the pixels of different classes from the anchor pixel as negative pixels. Intuitively, the number of negative pixels is 0 which means that the pixels in this patch belong to the same class. When both the numbers of positive pixels and negative pixels are greater than 0, it indicates that the image patch contains objects from different classes.
The feature distance between positive pixels and anchor pixel
and the feature distance between negative pixels and anchor pixel
are defined as
where
is the normalized height feature.
The corresponding semantic-guided feature distance loss is as follows:
where
is the feature distance threshold for semantic features.
3.4. Multi-Task Objective Function
In addition to the aforementioned height-guided feature distance loss and semantic-guided feature distance loss, the height ground truth and semantic labels are used as the supervision information for network training.
For the height estimation, following [
55,
62], this work adopts the L1 loss as the height loss term:
where
h denotes the height ground truth,
denotes the predicted height value,
i is the pixel index in the image, and
N is the total number of the valid pixels.
For the semantic segmentation, the multi-class cross-entropy loss is employed as the semantic loss term as follows:
where
is 1 when the true class of pixel
i is
c, and 0 otherwise.
is the probability scores for semantic label prediction, and
C is the number of semantic classes.
Finally, the overall multi-task objective function is formulated as follows:
where
,
,
, and
are the weights of each loss item, respectively.
3.5. Datasets
To verify the effectiveness of the proposed SCE-Net, extensive experiments are performed on two public datasets, namely, the Vaihingen dataset and the Potsdam dataset, provided by ISPRS Working Group II/4. In the experiments, the normalized digital surface models (nDSMs) in [
63] are used as the height ground truth.
Vaihingen: It consists of 33 tiles of different sizes, each tile contains the true orthophoto (TOP) and corresponding nDSM. The ground sampling distance of the TOP is 9 cm. The TOP contains near-infrared, red, and green bands (IRRG), while nDSM has one band. According to the official dataset partition, this work uses 16 tiles to construct the training set, and the remaining 17 tiles to form the testing set.
Potsdam: It contains 38 tiles of the same size, including the true orthophoto (TOP) and corresponding nDSM. The ground sampling distance of this dataset is 5 cm. The training images and testing images in the experiments are images containing three bands of red, green, and blue (RGB). According to the official dataset partition, 24 images are used for training, and the remaining 14 images are used for testing.
Samples from Vaihingen and Potsdam datasets are shown in
Figure 3. Due to the large size of the original tiles, small patches of
are randomly cropped from the raw tiles as the input images for training and testing in the experiments. When compared with other methods, the predictions of the image patches are stitched together and the results of the whole tiles are quantitatively evaluated.
3.6. Evaluation Indicators
In this paper, following [
16,
57], six indicators are used to evaluate the performance of height estimation. The height evaluation indicators include absolute relative error (absRel), mean absolute error (MAE), root mean square error (RMSE), and accuracy with thresholds (
). The specific formulas are as follows:
where
N represents the total number of pixels in the image,
i denotes the pixel index in the image,
h is the height ground truth, and
is the predicted height value.
Referring to [
8,
25], five evaluation indicators for semantic segmentation are adopted, including overall pixel accuracy (OA): the accuracy of the overall semantic segmentation; per-class pixel accuracy (AA): the average accuracy of segmentation for different classes; mean intersection over union (mIoU): the intersection ratio between the ground truth and the predicted semantic labels; mean F1 scores (mF1): the harmonic mean of precision and recall; kappa coefficient (Kappa): the coefficient for measuring segmentation accuracy.
3.7. Implementation Details
The proposed SCE-Net is implemented based on the PyTorch framework on a single Tesla V100 with 32 GB GPU memory. The network uses ResNet-50 or ResNet-101 pretrained on ImageNet as the backbone to extract shared features from the input image. During training, the input of the network is an image of size randomly cropped from the original tiles. The size of the predicted height map and semantic segmentation result output by the network are both . The batch size is 4, and the total number of epochs is 50 for the network with ResNet-50 and 80 for the network with ResNet-101. The initial learning rate is and then decreased using the polynomial decay with power 0.9. During training, Adam is adopted as the optimizer with = 0.5, = 0.999. To prevent overfitting, three data augmentation methods are performed, including horizontal flipping, vertical flipping, and rotation with a degree between [−1.25, 1.25] with probability 0.5.
5. Discussion
In this paper, the proposed SCE-Net adopts a multi-task learning framework and achieves satisfactory performances on both height estimation and semantic segmentation. Although there is a correlation between class semantics and height, evident differences also exist; clearly, objects from the same class could have substantial height variations. With this consideration in mind, this work proposes a feature separation–fusion module (FSFM) that selectively fuses height features and semantic features to prevent inconsistent features across tasks to negatively impact the predictive ability of the network.
Some examples of the predicted height maps and error maps for the baseline network and the network with the FSFM module are shown in
Figure 9. It can be seen that, compared with the baseline network, the results of the network with the FSFM module have smaller errors, which demonstrates that the FSFM module can effectively select features in related tasks. Furthermore, in the Vaihingen dataset, most buildings have uneven roofs and trees have large height differences. This is the geometric inconsistency of objects in the same class with different heights. Therefore, this section also shows the attention maps learned by the network for the selection of semantic features and fusion with height features. The attention maps indicate that attention is less focused on those places with the same semantic but large height variations. It also shows that the FSFM method can effectively fuse the relevant features between the two tasks.
In addition, the results of the FSFM module are compared with two other feature fusion methods, and the experimental results are shown in
Table 7. The three methods in
Table 7 are (a) the height and semantic features are fused by direct summation (B+Sum); (b) the height and semantic features are fused by direct concatenation (B+Cat); (c) the proposed FSFM module (B+FSFM). As seen from the table, the FSFM module outperforms the other two feature fusion methods. It shows that the FSFM module can more effectively utilize the features between related tasks and improve the prediction performance of the model.
The FSFM module extracts task-aware features from shared features and integrates features from related task branches based on an attention mechanism. On this basis, a task-guided representation enhancement method is employed to refine the task-aware features. In this method, interval discretization is first performed for the height range, and then a height-guided feature distance loss is designed for the height intervals and a semantic-guided feature distance loss is designed for the semantic classes.
Here, the influence of the number of height intervals on height estimation and semantic segmentation is assessed. The experimental results of discretizing the height range into different numbers of intervals (10, 20, 30, 40, 50, and 60 intervals) are reported in
Table 8. It can be seen that as the number of height intervals increases, the height estimation performance gradually becomes better. The results reach the best when the height interval is 30, and then the results gradually decrease as the number of intervals increases. This is because the height range of the Vaihingen dataset and Potsdam dataset is 0–25.5 m. If the number of height intervals is too small, the same height interval will contain a wide range of height values, then the height-guided feature distance loss will cause a large error in the consistency constraint of features in the same height interval. When the number of height intervals is too large, the height feature is close to the pixel-level feature, resulting in inaccurate height prediction. Therefore, in future work, the number of height intervals can be adaptively adjusted according to the approximate height range of the used dataset. It is worth noting that when changing the number of height intervals, the results of semantic segmentation remain basically unchanged. It shows that the improvement of the height estimation does not come at the expense of the performance of semantic segmentation, showing that height discretization with a proper interval is necessary.
Since the task-guided representation enhancement method is performed on local patches, here, the impact of the local patch size on height estimation and semantic segmentation is also assessed. Since the height variations of the scenes are usually more pronounced than the semantic class variation, this work chooses different local patch sizes for the height branch and the semantic branch. For the height branch, the image patch size is set to
,
, and
, and the experimental results are shown in
Table 9. It can be seen that the height estimation results under
image patch are the best. The experimental results show that the image patches should not be too small or too large for height estimation. Small-sized image patches contain a small number of pixels, and the height values may all belong to the same height interval, and make the height-guided feature distance loss less effective. If the size of the image patches is too large, the pixels in the same patch may have substantial height differences, and could blur the feature expression of different height intervals. For the semantic branch, this work chooses the size of the image patches to be
,
, and
, respectively. The experimental results are shown in
Table 10. It can be seen that the results are the best when the image patch size is
. Similar to height features, pixels in a small image patch are more likely to belong to the same class, while a large image patch may contain different object classes, making it difficult to obtain optimal feature representation. In the experiments, the proposed method chooses image patches of size
for the height branch and size
for the semantic branch.
Furthermore, the computational time of the SCE-Net on the Vaihingen dataset and the Potsdam dataset is analyzed, as shown in
Table 11. Similar to in
Section 4.2, B represents the baseline network, FSFM represents the feature separation–fusion module, and TRE represents the task-guided representation enhancement method. It can be seen that for the Vaihingen dataset, when the backbone adopts ResNet-50, the average inference time of the baseline network for images of size
is 0.034 s, the total time for the testing dataset is 13.796 s, and the average inference time for each original tile is 0.811 s. When the FSFM module is added, the average inference time for
images is 0.035 s, the total time for testing images is 14.219 s, and each original tile takes 0.836 s on average. Further, when TRE is added, the inference time of the model does not increase because the feature representation enhancement of the height features and semantic features is not included in the testing stage. When the backbone adopts ResNet-101, the average inference time for
images is 0.039 s, the total inference time for testing images is 15.761 s, and the average inference time per original tile is 0.927 s. For the Potsdam dataset, when the backbone adopts ResNet-50, the average inference time for
images is 0.033 s, the total time is 68.205 s, and each original tile takes about 4.871 s. When the backbone adopts ResNet-101, the inference time for each
image is about 0.037 s, the total inference time for testing images is 76.140 s, and the average inference time per original tile is 5.438 s, since ResNet-101 is more complex than ResNet-50. The experimental results demonstrate that, compared with the baseline network, the proposed SCE-Net can effectively improve the height estimation and semantic segmentation performance with little increase in computational time.