GeoBoost: An Incremental Deep Learning Approach toward Global Mapping of Buildings from VHR Remote Sensing Images

Yang, Naisen; Tang, Hong

doi:10.3390/rs12111794

Open AccessArticle

GeoBoost: An Incremental Deep Learning Approach toward Global Mapping of Buildings from VHR Remote Sensing Images

by

Naisen Yang

and

Hong Tang

^*

State Key Laboratory of Remote Sensing Science, Faculty of Geographical Science, Beijing Normal University, Beijing 100875, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2020, 12(11), 1794; https://doi.org/10.3390/rs12111794

Submission received: 21 April 2020 / Revised: 22 May 2020 / Accepted: 28 May 2020 / Published: 2 June 2020

Download

Browse Figures

Versions Notes

Abstract

:

Modern convolutional neural networks (CNNs) are often trained on pre-set data sets with a fixed size. As for the large-scale applications of satellite images, for example, global or regional mappings, these images are collected incrementally by multiple stages in general. In other words, the sizes of training datasets might be increased for the tasks of mapping rather than be fixed beforehand. In this paper, we present a novel algorithm, called GeoBoost, for the incremental-learning tasks of semantic segmentation via convolutional neural networks. Specifically, the GeoBoost algorithm is trained in an end-to-end manner on the newly available data, and it does not decrease the performance of previously trained models. The effectiveness of the GeoBoost algorithm is verified on the large-scale data set of DREAM-B. This method avoids the need for training on the enlarged data set from scratch and would become more effective along with more available data.

Keywords:

building extraction; data-incremental learning; GeoBoost; convolutional neural networks; semantic segmentation

Graphical Abstract

1. Introduction

In recent years, the size of satellite image data sets has been considerably enlarged compared with a decade ago. In real-world applications, such as global or regional mappings, large-scale data sets are built at multiple stages. For instance, the widely used data set, WHU-RS [1], is built through three versions. This data set was expanded from 12 classes of aerial scenes [1] to 19 [2] and 20 [3] classes, and the number of samples in each class was also increased. Actually, it is a tough task in which we decide whether a certain size of the data set is big enough, or whether the configuration of the semantic classes is reasonable before diving into the solid verification of the data set. Inevitably, large-scale data sets are built in a multi-stage manner. Correspondingly, most current models, which are adopted for the increasing data sets, have the weak continual-learning ability over time. This arouses the need for the models which are capable of continually fitting the growing data sets.

The incremental learning approaches focus on tackling the issue of studying sequentially acquired data without forgetting. This type of method is also referred to as continual learning [4,5,6,7] and lifelong learning [8,9,10]. The incremental learning approaches resorting to neural networks can be divided into the following categories [6]: regularization approaches, dynamic architectures, and memory replay. Regularization approaches add certain constraints to the update of weights to alleviate forgetting [11,12,13]. Dynamic architectures expand the networks dynamically by creating new weights or layers [14,15,16,17]. Memory replay stores a subset of samples or long-term information from the previous stage [5,18,19]. Without constraining the capacity of the model, the method of dynamic architectures is a more appropriate choice to achieve better performance [7]. Therefore, we focus on this type of method for the incremental learning of satellite images in this paper.

The task of incremental learning in this paper refers to the data-incremental learning which keeps the same classes for the data at different stages. Progressive neural networks [16] reuse each part of intermediate layers in previous models. It strengthens the representations of the final model but discards the supervision information of output layers. Network architecture evolving [17] employs neural architecture search (NAS) to optimize the entire models based on the new coming data by choosing from three options: creating new neurons, reusing the existing parameters, and tuning the existing parameters. It keeps the model size reasonable, whereas the procedure of NAS is time-consuming. Deep adaptation networks [20] attach new convolutional filters to the existing base network. These new convolutional filters are linear combinations of the trained convolutional filters of the base network. This allows the augmented network to adapt representations from previous tasks. The weights of linear combinations are determined by the controller modules. Expert gate [21] is a network of experts. It employs an auto-encoder gate to decide which expert to use at test time. The relatedness between the test sample and the relevant experts are measured by the reconstruction error of the auto-encoder gate. Error-driven incremental learning [22] grows the deep convolutional networks in a hierarchical manner. The increased capacities of networks are only beneficial for the top layers, and the feature extractor part is not expanded. The self-organizing incremental neural network [23] combines a variant of the Growing When Required (GWR) network [24] with pre-trained convolutional neural networks (CNNs) to hierarchically learn human actions. The pre-trained CNNs limit the representational power of the model. Incremental feature learning [25] adds the neurons of hidden layers of a denoising autoencoder to enhance the capacity of the network. These newly added features are trained on the collected hard examples. Then, it merges similar features to reduce the redundancy.

Incremental boosting [14] trains a base learner for each sequence of new coming data to construct an ensemble model. Essentially, it is an additive model, so the performance of trained base learners are not affected by the subsequent models. Thus, we develop the new algorithms based on the incremental boosting method in this paper. The previous studies of utilizing boosting for incremental learning [14,15,26] employ the AdaBoost algorithm [27] to adjust the weights of samples. In recent years, the gradient boosting algorithm produces state-of-the-art results on many application benchmarks [28,29]. It is quite flexible to adopt diverse loss functions. Therefore, we employ gradient boosting [30] for incremental learning in this paper.

For the large-scale applications of satellite images, geospatial distribution is one of the most distinguishing characteristics. As shown in Figure 1, satellite images from different regions have their own regularities and are diverse in terms of color, texture, and morphological structure. In addition, for the large-scale satellite images, the available training data are not evenly distributed. Figure 2 illustrates the geospatial distribution of the training labels used in this paper, which are derived from Open Street Map (OSM) [31]. It can be seen clearly that the labels of Europe, North America, and East Asia are more available than that of other areas. The geospatial information of satellite images can be utilized to guide the continual-learning process of models for improving prediction results for semantic segmentation. Additionally, for the consideration of data privacy, the researchers of satellite images can not always obtain training data from previous research, whereas the trained models of them are publicly available in general. If we have a trained global model without access to the original data set, training a new model from scratch is not the best choice when the new regional data are collected. Fine-tuning the existing model is a common approach to use the information from the trained model, but it will decrease the performance of the model on the original data set. Therefore, we propose a novel algorithm for geographically incremental learning according to the corresponding geospatial distribution for improving the prediction results, which is called GeoBoost in this paper. The GeoBoost method simplifies the optimization of gradient boosting in the task of semantic segmentation via convolutional neural networks (CNN) and attaches the geospatial distribution information of the large-scale satellite images to the base learner of gradient boosting.

The remainder of this paper is organized as follows. Section 2 describes the essential idea of GeoBoost. In Section 3, the experiment design, and its results are presented in detail. We discuss the results for the experiments in Section 4. Finally, we draw some conclusions in Section 5.

2. Methods

Boosting of neural networks has been adopted for incremental learning [14,15,26], where each base learner is trained in a manner that is similar to AdaBoost [27]. Different from previous studies, the GeoBoost algorithm employs gradient boosting [30] for incremental learning. In this section, we illustrate how gradient boosting, composed of neural networks, can be trained in an end-to-end way for geographically incremental learning. Furthermore, when gradient boosting is applied to large-scale satellite images, we show how the proposed algorithm, GeoBoost, assembles base learners of the ensemble model according to the corresponding geospatial distribution of data to improve the prediction results.

2.1. Gradient Boosting

The gradient boosting method [30] is a special type of ensemble learning approaches, which constructs additive models in a stage-wise strategy. More formally, with the training data

x = {x_{1}, \dots, x_{N}}

and the corresponding labels

y = {y_{1}, \dots, y_{N}}

, gradient boosting optimizes the ensemble model

F_{m} (x) = \sum_{m = 0}^{M} f_{m} (x)

(1)

to minimize the loss function

L (y, F (x))

, where

f_{m} (x)

is a single base learner. The whole optimization procedure is divided into multiple stages. In the m-th stage, based on the ensemble model from the previous stage

F_{m - 1} (x) = \sum_{i = 0}^{m - 1} f_{i} (x),

(2)

the base learner

f_{m} (x)

of the current stage can be derived from

F_{m - 1} (x)

and the loss function

L (y, F_{m - 1} (x))

as

f_{m} (x) = - ρ_{m} g_{m} (x) .

(3)

Here,

g_{m} (x)

is the gradient direction

g_{m} (x) = \frac{\partial L (y, F_{m - 1} (x))}{\partial F_{m - 1} (x)},

(4)

and

ρ_{m}

is the step of the linear search along the negative gradient direction

ρ_{m} = \underset{ρ_{m}}{arg min} L (y, F_{m - 1} (x) - ρ_{m} g_{m} (x)) .

(5)

In summary, the essential idea of gradient boosting is to optimize the base learner

f_{m} (x)

along the negative gradient direction of the loss function. Since

g_{m} (x)

is the gradient direction with respect to

F_{m - 1} (x)

, parameters

θ_{m}

of the base learner

f_{m} (x; θ_{m})

can not directly be trained by Equation (3). In practice,

θ_{m}

is obtained in an equivalent form

θ_{m} = \underset{θ_{m}}{arg min} {[g_{m} (x) - h_{m} (x; θ_{m})]}^{2}

(6)

and

f_{m} (x) = - ρ_{m} h_{m} (x),

(7)

where

h_{m} (x; θ_{m})

is a parameterized function. Taking into account all the mentioned factors, the generic procedure of the gradient boosting method is shown in Algorithm 1.

Algorithm 1. The algorithm of gradient boosting [30].

Input: The training data,

x

, and the corresponding labels,

y

; the parameterized function,

h (x, θ)

.

1:: $F_{0} (x) = {arg min}_{ρ_{0}, h_{0}} L (y, ρ_{0} h_{0} (x))$
2:: for $m = 1$ to M do
3:: $g_{m} (x) = \frac{\partial L (y, F_{m - 1} (x))}{\partial F_{m - 1} (x)}$
4:: $θ_{m} = {arg min}_{θ_{m}} {[g_{m} (x) - h_{m} (x; θ_{m})]}^{2}$
5:: $ρ_{m} = {arg min}_{ρ_{m}} L (y, F_{m - 1} (x) - ρ_{m} h_{m} (x))$
6:: $f_{m} (x) = - ρ_{m} h_{m} (x)$
7:: $F_{m} (x) = F_{m - 1} (x) + f_{m} (x)$
8:: end for

Output:

F_{m} (x) = \sum_{m = 0}^{M} f_{m} (x)

2.2. End-to-End Gradient Boosting

While neural networks are adopted as base learners of the gradient boosting algorithm, the existing studies [32,33] just replace

g_{m} (x)

and

ρ_{m}

with the corresponding versions of neural networks. Actually, the optimization of gradient boosting and the training of a single neural network can be combined together to further simplify the training process.

In practice, the previous studies treat the score of the softmax function as the output of neural networks for classification. Unlike the aforementioned mode, we can assume that the activation values

z

of the layer before the softmax function

σ

is the output of neural networks. Certainly,

z

is a K-dimensional vector of real values with respect to the K classes. Then,

z

is normalized by the softmax function

σ

to obtain the final normalized score:

σ {(z)}_{j} = \frac{e^{z_{j}}}{\sum_{k = 1}^{K} e^{z_{k}}}; for j = 1, \dots, K .

(8)

σ : R^{K} \to {[0, 1]}^{K} .

(9)

With the above definition of model, the base learner of gradient boosting can be expressed as

f_{m} (x) = z_{m}

, and the ensemble model as

\begin{matrix} F_{m} (x) & = & σ (f_{m} (x; θ_{m}) + \sum_{i = 0}^{m - 1} f_{i} (x)) \end{matrix}

(10)

\begin{matrix} = & σ (\sum_{i = 0}^{m} f_{i} (x)) \end{matrix}

(11)

where

θ_{m}

are the parameters of the neural network

f_{m}

. In this form, the output of

F_{m} (x)

is normalized scores.

Since Algorithm 1 is proposed for the generic purpose, any corresponding optimization method of

h_{m} (x; θ_{m})

can be utilized to find the solution of Equation (6). For instance, if

h_{m} (x; θ_{m})

is a decision tree, it can be optimized by the CART algorithm [30,34]. Thus, Equations (6) and (4) are two separated steps. Generally, neural networks are trained by the gradient descent method. Likewise, gradient boosting is optimized along the negative gradient direction of the loss function. Instead of calculating

g_{m} (x

) directly, we can incorporate this step into the process of gradient descent. In other words, the loss function can be optimized directly during the back propagation of neural networks without the extra step of obtaining

g_{m} (x)

explicitly. Therefore, this type of gradient boosting can be called end-to-end gradient boosting.

Supposing that we have the regression task with the squared loss

L (y, F_{m} (x)) = {(y - F_{m} (x))}^{2}

, the base model

f_{0}

and

f_{1}

have been learned, so we have the trained model

F_{1} (x) = f_{0} (x) + f_{1} (x)

and want to train the base model

f_{2} (x)

in the new iteration.

If

f_{2}

is a decision tree, the gradient direction is

g_{2} (x) = \frac{\partial L (y, F_{1} (x))}{\partial F_{1} (x)} .

(12)

The decision tree

f_{2}

has not been constructed yet, thus we can only get the gradients with respect to

F_{1} (x)

rather than

F_{2} (x) = f_{0} (x) + f_{1} (x) + f_{2} (x)

. After calculating

g_{2} (x)

, the decision tree

f_{2}

can be built with

θ_{2} = {arg min}_{θ_{2}} {[g_{2} (x) - f_{2} (x; θ_{2})]}^{2}

using the CART algorithm [34]. In short, the gradient boosting method utilizes

g_{2} (x) = \partial L / \partial F_{1}

to guide the learning procedure of the parameter

θ_{2}

.

If

f_{2}

is a simple neural network

f_{2} (x) = θ_{2} x + b

, the gradient direction is

g_{2} (x) = \frac{\partial L (y, F_{2} (x))}{\partial F_{2} (x)} .

(13)

Different from decision trees, neural networks can be randomly initialized before the training procedure, so the gradient with respect to

F_{2} (x)

can be obtained. The backpropagation algorithm computes the gradients of parameters

θ_{2}

as

\frac{\partial L (y, F_{2} (x))}{\partial θ_{2}} = \frac{\partial L (y, F_{2} (x))}{\partial F_{2} (x)} \frac{\partial F_{2} (x)}{\partial θ_{2}} = g_{2} (x) \frac{\partial F_{2} (x)}{\partial θ_{2}}

(14)

to guide the learning procedure of the parameters

θ_{2}^{(t)} = θ_{2}^{(t - 1)} - η \frac{\partial L (y, F_{2} (x))}{\partial θ_{2}},

(15)

where

η

is the learning rate.

Using decision trees as base learners leads to two separate steps: calculate

\partial L / \partial F_{1}

, then use it to guide the learning procedure of the parameters. The advantage of gradient boosting is that the type of the base learner is not limited. Using neural networks as base learners does not need to calculate

\partial L / \partial F_{2}

manually. It can be calculated by the backpropagation algorithm. Therefore, the end-to-end gradient boosting algorithm can only employ neural networks as base learners.

Different from line 3 in Algorithm 1 that

g_{m} (x)

is the gradient direction with respect to

F_{m - 1} (x)

, the base learner

f_{m} (x; θ_{m})

can be optimized with respect to

F_{m} (x)

(Equation (10)) by gradient descent:

f_{m} (x; θ_{m}) = \underset{θ_{m}}{arg min} L (y, F_{m} (x)) .

(16)

During the training of

f_{m} (x)

, the base learners from previous stages are frozen and non-trainable, and

θ_{m}

is the solely trainable parameters. Additionally, gradient descent of neural networks already searches the optimal point on the loss surface during training. Therefore, the linear search

ρ_{m}

is redundant and can be omitted. Finally, we obtain a concise process of training gradient boosting in an end-to-end manner shown in Algorithm 2.

Algorithm 2. The algorithm of end-to-end gradient boosting.

Input: The training data,

x

, and the corresponding labels,

y

; the neural network,

f (x; θ)

; and the softmax function,

σ

.

1:: $F_{0} (x) = σ (f_{0} (x; θ_{0}))$
2:: $f_{0} (x) = {arg min}_{θ_{0}} L (y, F_{0} (x))$
3:: for $m = 1$ to M do
4:: $F_{m} (x) = σ (f_{m} (x; θ_{m}) + \sum_{i = 0}^{m - 1} f_{i} (x))$
5:: $f_{m} (x; θ_{m}) = {arg min}_{θ_{m}} L (y, F_{m} (x))$
6:: end for

Output:

F_{m} (x) = σ (\sum_{i = 0}^{m} f_{i} (x))

2.3. GeoBoost

For the purpose of incremental learning, each base learner of boosting is trained on a set of new coming data rather than on the same one. For instance, the training collection

(X, Y)

is composed by a group of image sets

X = {x_{0}, \dots, x_{m}, \dots, x_{M}}

and the corresponding group of label sets

Y = {y_{0}, \dots, y_{m}, \dots, y_{M}}

, where each image set

x_{m} = {x_{1}, \dots, x_{N}}

is a bunch of images and the corresponding label set is

y_{m} = {y_{1}, \dots, y_{N}}

, respectively. If

x_{1}

is an image with a size of

512 \times 512

, then

y_{1}

is the label matrix of the semantic segmentation task with a size of

512 \times 512

. Thus, the image set

x_{1} = {x_{1}, x_{2}, x_{3}}

contains three images, and the corresponding label set

y_{1} = {y_{1}, y_{2}, y_{3}}

contains three label matrices. With two image sets

x_{1} = {x_{1}, x_{2}, x_{3}}

,

x_{2} = {x_{4}, x_{5}}

, their label sets are

y_{1} = {y_{1}, y_{2}, y_{3}}

and

y_{2} = {y_{4}, y_{5}}

. Hence, the group of image sets

X = {x_{1}, x_{2}}

contains five images in total, and the group of label sets

Y = {y_{1}, y_{2}}

contains five label matrices in total.

As each base learner

f_{m} (x_{m}; θ_{m})

trained on the data set

(x_{m}, y_{m})

, then Equation (10) is transformed into

F_{m} (X) = σ (f_{m} (x_{m}; θ_{m}) + \sum_{i = 0}^{m - 1} f_{i} (x_{i})) .

(17)

By utilizing gradient boosting for incremental learning [14,15,26], the performance of existing base learners is not affected by the newly added component, and the base learner in the current stage is trained without touching the data involved in previous stages. Additionally, the ensemble model can also reuse the classification capability of existing base learners.

For the large-scale satellite images, geospatial distribution is their most notable characteristic. Generally, the collected satellite images are clustered in certain areas (as shown in Figure 3), and the objects in these images from different areas are diverse in terms of color, size, and density. When gradient boosting is applied to the large-scale satellite images, the geospatial distribution information of data can be considered to improve the prediction results.

For instance, suppose that we separately collect satellite images from Europe and America, then we train a model

F_{a}

with the data from Europe and train a model

F_{b}

with the data from America. The model

F_{a}

will perform well on the data from Europe, but deteriorate on the data from America. Images from these two areas are much diverse and

F_{a}

has never touched the data from America, so it fails to predict other areas. One solution is to make the model dedicate to a certain area. In this manner,

F_{a}

will be trained on the data from Europe and predict results for the data from Europe, and

F_{b}

will be trained on the data from America and predict results for the data from America.

Specifically, in the m-th stage of gradient boosting, the coverage area of the training data set

x_{m}

can be expressed as geographic coordinates in the form of bounding box

B_{m} = (x_{min}, y_{min}, x_{max}, y_{max})

. Since the base learner

f_{m}

is trained on

x_{m}

, we can set

B_{m}

as the coverage area of the base learner

f_{m} (x_{m}; θ_{m}, B_{m})

. For a certain image

x_{j}

, it is solely classified by the base learners that cover the geographic location

p_{j}

of the image. Consequently, the ensemble model becomes

F_{m} (x_{j}) = σ (\sum_{i = 0}^{m} r (p_{j}, B_{i}) \cdot f_{i} (x_{j}, p_{j}; B_{i})),

(18)

where

r (p_{j}, B_{i})

is an indicator function

r (p_{j}, B_{i}) = \{\begin{matrix} 1 & , & if p_{j} is in B_{i}, \\ 0 & , & else . \end{matrix}

(19)

Due to the property that base learners adhere to certain areas, we name this algorithm GeoBoost. In the rest of this paper, the indicator function

r (p_{j}, B_{i})

is simply denoted as

r_{i}

, and

f_{i} (x_{j}, p_{j}; B_{i})

as

f_{i} (x_{j})

. The entire process of GeoBoost is shown in Algorithm 3. The base learners of the original gradient boosting method do not take any geospatial information into consideration, which means that they can be applied to any area. In other words, they can be treated as a special case of GeoBoost that covers the entire range of the geographic coordinates system.

Boosting methods assemble a bunch of base learners to form the final ensemble model. Due to plenty of parameters, the ensemble model can easily cause the over-fitting problem. The gradient boosting method exerts a learning rate to each base learner as regularization for preventing over-fitting [30]. The learning rate

ν

and the number of base learners M jointly determine the performance of the boosting model. Small learning rates can reduce the dominance of a single base learner. With the presence of the learning rate

ν

, we can reach the final algorithm as shown in Algorithm 4.

Algorithm 3. The algorithm of GeoBoost.

Input: The training collection of data,

X = {x_{0}, \dots, x_{M}}

, and the corresponding labels,

Y = {y_{0}, \dots, y_{M}}

; the neural network,

f (x; θ, B)

; the softmax function,

σ

; and the indicator function,

r_{i}

.

1:: $F_{0} (x_{0}) = σ (f_{0} (x_{0}; θ_{0}, B_{0}))$
2:: $f_{0} (x_{0}) = {arg min}_{θ_{0}} L (y_{0}, F_{0} (x_{0}))$
3:: for $m = 1$ to M do
4:: $F_{m} (x_{j}) = σ (\sum_{i = 0}^{m} r_{i} \cdot f_{i} (x_{j}))$
5:: $f_{m} (x_{m}; θ_{m}, B_{m}) = \underset{θ_{m}}{arg min} E_{\underset{y_{j} \in y_{m}}{x_{j} \in x_{m}}} [L (y_{j}, F_{m} (x_{j}))]$
6:: end for

Output:

F_{m} (x) = σ (\sum_{i = 0}^{m} r_{i} \cdot f_{i} (x))

Algorithm 4. The algorithm of GeoBoost with learning rates.

Input: The training collection of data,

X = {x_{0}, \dots, x_{M}}

, and the corresponding labels,

Y = {y_{0}, \dots, y_{M}}

; the neural network,

f (x; θ, B)

; the softmax function,

σ

; the learning rate of GeoBoost,

ν

; and the indicator function,

r_{i}

.

1:: $F_{0} (x_{0}) = σ (f_{0} (x_{0}; θ_{0}, B_{0}))$
2:: $f_{0} (x_{0}) = {arg min}_{θ_{0}} L (y_{0}, F_{0} (x_{0}))$
3:: for $m = 1$ to M do
4:: $F_{m} (x_{j}) = σ (r_{m} \cdot f_{m} (x_{j}) + \sum_{i = 0}^{m - 1} ν \cdot r_{i} \cdot f_{i} (x_{j}))$
5:: $f_{m} (x_{m}; θ_{m}, B_{m}) = \underset{θ_{m}}{arg min} E_{\underset{y_{j} \in y_{m}}{x_{j} \in x_{m}}} [L (y_{j}, F_{m} (x_{j}))]$
6:: end for

Output:

F_{m} (x) = σ (\sum_{i = 0}^{m} ν \cdot r_{i} \cdot f_{i} (x))

3. Experiment

3.1. Data Set

For the semantic segmentation tasks of satellite images, the commonly used data sets, such as the INRIA building data set [35] and ISPRS 2D Semantic Labeling Benchmark [36], are not distributed widely enough, which are built from just a few cities. Therefore, we create a new worldwide building data set to simulate the real application from 100 different cities as shown in Figure 2. This new data set is named as Building data set for Disaster Reduction and Emergency Management (DREAM-B). The training images are collected from Google Earth Engine [37], and the corresponding labels are obtained from the open-source map, Open Street Map (OSM) [31]. DREAM-B contains 626 image tiles with a size of

4096 \times 4096

. We split out 250 tiles for training, 63 for validation, and 313 for testing. This data set solely contains two classes: the buildings and the non-building class. Each image tile of the DREAM-B data set is composed of red (R), green (G), and blue (B) bands, and its spatial resolution is 30 cm.

For the task of incremental learning, the 250 training image tiles are divided into four groups based on their geospatial location rather than sampled uniformly. As shown in Figure 3, we build a tiny global data set for the first group and add the local data sets for the other groups. The amount of image tiles of four groups is roughly equal. This manner of division is to simulate the real situation that a global dataset might be built in an incremental manner, e.g., from coarse to fine and from one region to another one.

3.2. Implementation Details

We combine the U-Net model [38] with the NASNet-Mobile model [39] for the architecture of the base learners of the GeoBoost algorithm. Specifically, the convolutional modules in U-Net are replaced by the neural cell obtained via neural architecture searching [39], which is more efficient for the consideration of computation efficiency. This model is called U-NASNetMobile in this paper. The input size of U-NASNetMobile is

512 \times 512

, and the original image tiles of the DREAM-B data set is split into small patches with a size of

512 \times 512

to match the model. In the recent studies [40,41], researchers found that a larger training sample size produces better performance for neural networks. However, if the GPU memory is fixed, a larger training sample size will lead to the smaller batch size of gradient descending. According to the rule of linear scaling learning rate [40], a small batch size increases the noise in the gradient, so the learning rate of gradient descending may be decreased. For the consideration of the balance between the batch size and the training sample size, the input size of U-NASNetMobile is

512 \times 512

. The architecture of U-NASNetMobile is shown in Figure 4. Data augmentations are employed to avoid over-fitting, including random flipping horizontally and vertically, random rotation, and random brightness jittering.

The Adam optimizer [42] is used for optimization. We train models with the cosine learning rate annealing schedule [43] with a maximum learning rate of

3 \times 10^{- 4}

and a minimum learning rate of

1 \times 10^{- 6}

. Unless otherwise specified, all the experiments are trained for 200 epochs with a mini-batch size of 16. In addition, the Intersection over Union (IoU) is employed as the evaluation accuracy [44]. More specifically, the IoU accuracy is defined as

IoU = \frac{Prediction ⋂ GroundTruth}{Prediction ⋃ GroundTruth} .

(20)

The metrics of Precision, Recall, Overall Accuracy, and Kappa Coefficient are also provided for reference.

4. Results and Discussion

In this section, we first analyze the impacts of technical factors in Section 4.1. Then, the quantitative and qualitative evaluation of different models are presented in the Section 4.2 and Section 4.3. Finally, we dive into the regional impacts of prediction in Section 4.4.

4.1. Technical Analysis

4.1.1. Pre-Training

Base learners of the boosting algorithm are trained sequentially. After the base model

f_{0}

is trained, the weights of other base learners can be initialized by the trained weights of

f_{0}

instead of random initialization. This is an application of the pre-training approach [45]. As illustrated in Figure 5, the base model

f_{1}

with pre-training on

f_{0}

converges faster than the model trained from scratch and also leads to better performance. Therefore, we adopt the strategy of pre-training in the rest of the experiments.

4.1.2. Learning Rates

For the reason that training ensemble models are time-consuming, all the base learners in Table 1 are trained only 100 epochs to search for the optimum learning rate. From Table 1, it can be clearly seen that learning rates of base learners significantly affect the performance of the GeoBoost algorithm. With four base learners, the learning rate

ν

achieves the best result around

0.1

. This result is consistent with that in the gradient boosting algorithm [30] with much more base learners. Thus, the learning rate of

0.1

is a stable value in practice and is the default configuration in the rest of the boosting experiments.

4.1.3. Complexity Analysis

The GeoBoost algorithm trains M base learners to form an ensemble model, thus its model complexity is

O (M)

. Therefore, it contains more parameters comparing with a single model. However, when more data are available, GeoBoost can be adopted without tuning hyper-parameters of the base models, and the architectures of these base models can also be different. Since each base learner is only assigned a portion of data, the training time of GeoBoost is not increased and keeps up with that of a single model. When testing, the ensemble model slows down the inference speed with time complexity of

O (M)

. Nevertheless, all the base learners are independent from each other, so the model is highly parallelizable to reduce the inference time.

4.2. Quantitative Evaluation

As shown in Table 2, the result of a single net is obtained by training a U-NASNetMobile model with all the available training data

x_{0}

,

x_{1}

,

x_{2}

, and

x_{3}

from four areas. It is the ideal case to train a model when all of the datasets are available. It might not be a scenario for the practical application of global mapping using VHR satellite images. The single net model has the highest accuracy of

0.6458

. This result is the baseline result for comparison.

The end-to-end gradient boosting algorithm for incremental learning obtains the worst performance of

0.5465

(IoU) for U-net and

0.5887

(IoU) for U-NASNetMobile in all the experiments. It can be seen from Figure 6a that the first base learner of gradient boosting

f_{0}

already has an IoU accuracy of

0.5998

, and the successive results are only

0.5564

,

0.5747

, and

0.5887

. It means that the performance of the model is decreased by gradient boosting without considering the geospatial distribution of data.

Compared with a single net, GeoBoost obtains a comparable accuracy of

0.6372

, and it also outperforms the end-to-end gradient boosting algorithm by a large margin. Randomly sampling of training data for base learners can improve the performance of gradient boosting [46], whereas collecting satellite images from certain areas is actually equivalent to biased subsampling without repetition rather than uniformly sampling. That may be the reason that the end-to-end gradient boosting algorithm is failed on the DREAM-B data sets and GeoBoost achieves a satisfactory performance.

The results of experiments demonstrate the effectiveness of the GeoBoost algorithm by progressively utilizing the geospatial information of satellite images. Additionally, the result of a single net shows the importance of the large scale of data once more.

4.3. Qualitative Analysis

Figure 7 shows some semantic segmentation results of GeoBoost in different training stages from the cities of Chicago, Vienna, and Shanghai. Compared with the model

F_{0} (x) = σ (r_{0} f_{0} (x))

, the model

F_{3} (x) = σ (r_{0} f_{0} (x) + r_{1} f_{1} (x) + r_{2} f_{2} (x) + r_{3} f_{3} (x))

produces better prediction results in terms of both accuracy and visual sense. According to the location of the image from Chicago,

r_{2}

and

r_{3}

is 0, and

r_{0}

and

r_{1}

is 1 for this image based on Equations (19) and (18). Therefore, the prediction of the image from Chicago can be simplified as

F_{3} (x) = σ (f_{0} (x) + f_{1} (x))

. Similarly, predictions of images from Vienna and Shanghai can also be simplified in this way. Obviously, as pointed out by the yellow circles in Figure 7, some misclassified pixels of

F_{0} (x)

are corrected by the model

F_{3} (x)

, and the others are not affected. Taken together, Figure 6 and Figure 7 suggest that the GeoBoost method is capable of continually learning, both quantitively and qualitatively.

Figure 8 presents some semantic segmentation results of different models from the cities of Chicago, Berlin, and Shanghai. Yellow circles indicate some notable discrepancies among different prediction results. Three models produce quite similar results. The prediction of GeoBoost is more in accordance with that of the single model. The end-to-end gradient boosting model misclassifies some tiny buildings. It should be noticed that these visualization results are merely for reference. The tiny differences among these models can also be triggered by the randomness of the training process and weights initialization. Quantitative evaluations in Section 4.2 are more reliable.

4.4. Discussion on the Regional Impacts

Some typical samples from the areas of

B_{1}

,

B_{2}

, and

B_{3}

in Figure 3b are shown in Figure 9, Figure 10 and Figure 11, respectively. As shown in Figure 9, buildings in the area

B_{1}

have unified distributions. The size of most buildings in this area is quite small, and the orientation of them is consistent within each image tile. Gaps among the individual buildings are obvious. Thus, images in this area are easy to be predicted correctly.

It can be seen from Figure 10 that buildings in the area

B_{2}

are much different. Shapes of these buildings are not squares, and adjacent ones are connected to each other to form the big chunks.

Figure 11 shows that buildings in the area

B_{3}

are more cluttered, and this area contains more high-rise buildings. Some pixels on the left-bottom corner of Figure 11i are mislabeled. This suggests that there may be some label noise in this area.

The accuracy of GeoBoost in Table 2 is measured by the overall performance of the model. For comparison, we can dive into more details of the trained model in different areas around the world. As shown in Table 3, the performance of GeoBoost fluctuates from one area to another. The model achieves the worst performance of

0.4758

in East Asia, whereas the results of America and Europe are much better.

The divergence of the regional performance may be caused by two reasons. First, this may be caused by the accuracy measurement of IoU. It prefers big objects to the small ones. For simplicity, we assume that the buildings are squared with a side length of s pixels. Thus, the total number of pixels of a building is

s \times s = s^{2}

. Most misclassifications of the model occur near the border of buildings. Therefore, the total number of misclassified pixels is roughly

4 \times s

. The proportion of the misclassified pixels can be inferred as

\frac{4 s}{s^{2}} = \frac{4}{s}

. As the building size becomes bigger, the proportion of the misclassified pixels becomes much smaller, and the IoU becomes higher. The above inference is a rough estimation rather than a strict analysis. With all the available labels of the training set, we can take a glance at the distribution of building size in the DREAM-B data set. From Figure 12, it can be clearly seen that the building size of Europe is quite bigger than that of the other areas. Figure 10 presents the visualization of samples from three areas. Buildings in the city of Vienna are closely connected to each other and form the big building groups. In addition, big buildings have more context to infer the predicting results. This is why the prediction of Europe achieves the best result.

Second, the data of East Asia contain more high-rise buildings. The image in Figure 11f is a sample from the city of Shanghai. Obviously, the upper left corner of this sample has a considerable amount of high-rise buildings. This is very common in East Asia and does not appear very often in the other areas. Figure 13 better illustrates the affection of high-rise buildings. The labels of high-rise buildings are located on their roots. Though the model finds the positions of these high-rise buildings, there is no visual clue, such as edges and contrast, for precise segmentation. This is a tough task for CNNs to recognize 3D structures from a single image, whereas the segmentation of lower buildings is much easier and produces sharp edges of buildings as shown in Figure 7. For the reason that the image is not acquired with orthographic projection, the low accuracy of high-rise buildings is inevitable.

The above-mentioned two reasons may explain the divergence of the regional performance. Surprisingly, CNNs can capture 3D spatial information to some extent. These results qualitatively verify the necessity of utilizing the geospatial information of satellite images for semantic segmentation.

5. Conclusions

In this paper, we propose a novel approach, GeoBoost, for geographically incremental learning, which is trained in an end-to-end way. It enables the models of satellite images to learn continually based on geographical information of data without forgetting the previous knowledge. The effectiveness of the GeoBoost algorithm is verified on the large-scale data set of DREAM-B. The proposed method utilizing U-NASNetMobile as the base learner outperforms end-to-end gradient boosting by a large margin of 4.85% (IoU). Experiments with different base learners confirm that GeoBoost surpasses end-to-end gradient boosting consistently. At present, this algorithm is validated on the semantic segmentation task with the high-resolution satellite images. This method is quite flexible to be adapted to the lower resolution images. By adopting the appropriate base learner of GeoBoost, our proposed method may also be beneficial for the lower-resolution satellite images. The current GeoBoost algorithm focuses on the task of data-incremental learning. Based on its flexible framework, we will adapt it to the task of class-incremental learning in the future.

Author Contributions

Conceptualization, formal analysis, H.T. and N.Y.; methodology, software, validation, investigation, resources, data curation, writing—original draft preparation, and visualization, N.Y.; writing—review and editing, supervision, project administration, and funding acquisition, H.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant No. 41971280 and in part by the National Key R&D Program of China under Grant No. 2017YFB0504104.

Acknowledgments

We would like to thank the high-performance computing support from the Center for Geodata and Analysis, Faculty of Geographical Science, Beijing Normal University (https://gda.bnu.edu.cn/).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Xia, G.S.; Yang, W.; Delon, J.; Gousseau, Y.; Sun, H.; Maître, H. Structural high-resolution satellite image indexing. In Proceedings of the ISPRS TC VII Symposium-100 Years ISPRS, Vienna, Austria, 5–7 July 2010; Volume 38, pp. 298–303. [Google Scholar]
Sheng, G.; Yang, W.; Xu, T.; Sun, H. High-resolution satellite scene classification using a sparse coding based multiple feature combination. Int. J. Remote Sens. 2012, 33, 2395–2412. [Google Scholar] [CrossRef]
Hu, J.; Jiang, T.; Tong, X.; Xia, G.S.; Zhang, L. A benchmark for scene classification of high spatial resolution remote sensing imagery. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 5003–5006. [Google Scholar]
Ring, M.B. Continual Learning in Reinforcement Environments. Ph.D. Thesis, University of Texas at Austin, Austin, TX, USA, 1994. [Google Scholar]
Shin, H.; Lee, J.K.; Kim, J.; Kim, J. Continual learning with deep generative replay. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 2990–2999. [Google Scholar]
Parisi, G.I.; Kemker, R.; Part, J.L.; Kanan, C.; Wermter, S. Continual lifelong learning with neural networks: A review. Neural Netw. 2019, 113, 54–71. [Google Scholar] [CrossRef] [PubMed]
De Lange, M.; Aljundi, R.; Masana, M.; Parisot, S.; Jia, X.; Leonardis, A.; Slabaugh, G.; Tuytelaars, T. Continual learning: A comparative study on how to defy forgetting in classification tasks. arXiv 2019, arXiv:1909.08383. [Google Scholar]
Thrun, S.; Mitchell, T.M. Lifelong robot learning. Robot. Auton. Syst. 1995, 15, 25–46. [Google Scholar] [CrossRef]
Silver, D.L.; Mercer, R.E. The task rehearsal method of life-long learning: Overcoming impoverished data. In Proceedings of the Conference of the Canadian Society for Computational Studies of Intelligence, Calgary, AB, Canada, 27–29 May 2002; pp. 90–101. [Google Scholar]
Chen, Z.; Liu, B. Lifelong machine learning. Synth. Lect. Artif. Intell. Mach. Learn. 2016, 10, 1–145. [Google Scholar] [CrossRef] [Green Version]
Li, Z.; Hoiem, D. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2935–2947. [Google Scholar] [CrossRef] [Green Version]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [Green Version]
Zenke, F.; Poole, B.; Ganguli, S. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, Sydney, Australia, 6–11 August 2017; pp. 3987–3995. [Google Scholar]
Polikar, R.; Upda, L.; Upda, S.S.; Honavar, V. Learn++: An incremental learning algorithm for supervised neural networks. IEEE Trans. Syst. Man, Cybern. Part C Appl. Rev. 2001, 31, 497–508. [Google Scholar] [CrossRef] [Green Version]
Han, S.; Meng, Z.; Khan, A.S.; Tong, Y. Incremental boosting convolutional neural network for facial action unit recognition. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 109–117. [Google Scholar]
Rusu, A.A.; Rabinowitz, N.C.; Desjardins, G.; Soyer, H.; Kirkpatrick, J.; Kavukcuoglu, K.; Pascanu, R.; Hadsell, R. Progressive neural networks. arXiv 2016, arXiv:1606.04671. [Google Scholar]
Li, X.; Zhou, Y.; Wu, T.; Socher, R.; Xiong, C. Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting. arXiv 2019, arXiv:1904.00310. [Google Scholar]
Gepperth, A.; Karaoguz, C. A bio-inspired incremental learning architecture for applied perceptual problems. Cogn. Comput. 2016, 8, 924–934. [Google Scholar] [CrossRef] [Green Version]
Lopez-Paz, D.; Ranzato, M. Gradient episodic memory for continual learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6467–6476. [Google Scholar]
Rosenfeld, A.; Tsotsos, J.K. Incremental Learning Through Deep Adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 651–663. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Aljundi, R.; Chakravarty, P.; Tuytelaars, T. Expert Gate: Lifelong Learning with a Network of Experts. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7120–7129. [Google Scholar]
Xiao, T.; Zhang, J.; Yang, K.; Peng, Y.; Zhang, Z. Error-Driven Incremental Learning in Deep Convolutional Neural Network for Large-Scale Image Classification. In Proceedings of the 22nd ACM international conference on Multimedia, Orlando, FL, USA, 3–7 November 2014. [Google Scholar]
Parisi, G.I.; Tani, J.; Weber, C.; Wermter, S. Lifelong learning of human actions with deep neural network self-organization. Neural Netw. 2017, 96, 137–149. [Google Scholar] [CrossRef] [PubMed]
Marsland, S.; Shapiro, J.; Nehmzow, U. A self-organising network that grows when required. Neural Netw. 2002, 15, 1041–1058. [Google Scholar] [CrossRef]
Zhou, G.; Sohn, K.; Lee, H. Online Incremental Feature Learning with Denoising Autoencoders. In Proceedings of the AISTATS, Canary Islands, Spain, 21–23 April 2012. [Google Scholar]
Medera, D.; Babinec, S. Incremental Learning of Convolutional Neural Networks. In Proceedings of the IJCCI, Madeira, Portugal, 5–7 October 2009; pp. 547–550. [Google Scholar]
Freund, Y.; Schapire, R.; Abe, N. A short introduction to boosting. J. Jpn. Soc. Artif. Intell. 1999, 14, 1612. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–27 August 2016. [Google Scholar]
Ke, G.; Meng, Q.; Finely, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIP 2017), Long Beach, CA, USA, 24 January 2017. [Google Scholar]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Haklay, M.; Weber, P. Openstreetmap: User-generated street maps. IEEE Pervasive Comput. 2008, 7, 12–18. [Google Scholar] [CrossRef] [Green Version]
Zhang, F.; Du, B.; Zhang, L. Scene classification via a gradient boosting random convolutional network framework. IEEE Trans. Geosci. Remote Sens. 2015, 54, 1793–1802. [Google Scholar] [CrossRef]
Dong, M.; Yao, L.; Wang, X.; Benatallah, B.; Zhang, S. GrCAN: Gradient Boost Convolutional Autoencoder with Neural Decision Forest. arXiv 2018, arXiv:1806.08079. [Google Scholar]
Gordon, A.; Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees. Biometrics 1984, 40, 874. [Google Scholar] [CrossRef] [Green Version]
Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 3226–3229. [Google Scholar]
Wegner, J.D.; Rottensteiner, F.; Gerke, M.; Sohn, G. The ISPRS 2d Labelling Challenge. 2016. Available online: Http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html (accessed on 15 January 2018).
Gorelick, N.; Hancher, M.; Dixon, M.; Ilyushchenko, S.; Thau, D.; Moore, R. Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sens. Environ. 2017, 202, 18–27. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8697–8710. [Google Scholar]
He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; Li, M. Bag of Tricks for Image Classification with Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Zhang, Z.; Lin, H.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. ResNeSt: Split-Attention Networks. arXiv 2020, arXiv:2004.08955. [Google Scholar]
Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef] [Green Version]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [Green Version]
Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]

Figure 1. Images from the DREAM-B data set. The figures of (a–d) are sample images from the city of Paris, Vienna, Tokyo, and Los Angeles. All the samples are with a size of

800 \times 800

.

Figure 1. Images from the DREAM-B data set. The figures of (a–d) are sample images from the city of Paris, Vienna, Tokyo, and Los Angeles. All the samples are with a size of

800 \times 800

.

Figure 2. Geospatial distribution of the DREAM-B data set. Each point in this figure represents an image tile. The DREAM-B data set contains 626 tiles in total and covers 100 cities across several continents.

Figure 3. Geospatial distribution of data sets in different areas. The figure of (a) is the bounding box of the data group

x_{0}

, and the figure of (b) is the bounding boxes of the data groups

x_{1}

,

x_{2}

, and

x_{3}

. Each group is divided equally in terms of the number of image tiles, and the corresponding colored boxes are their coverage range. The bounding boxes of data groups

x_{0}

,

x_{1}

,

x_{2}

, and

x_{3}

are

B_{0}

,

B_{1}

,

B_{2}

, and

B_{3}

colored with red, green, purple, and blue, respectively.

Figure 3. Geospatial distribution of data sets in different areas. The figure of (a) is the bounding box of the data group

x_{0}

, and the figure of (b) is the bounding boxes of the data groups

x_{1}

,

x_{2}

, and

x_{3}

. Each group is divided equally in terms of the number of image tiles, and the corresponding colored boxes are their coverage range. The bounding boxes of data groups

x_{0}

,

x_{1}

,

x_{2}

, and

x_{3}

are

B_{0}

,

B_{1}

,

B_{2}

, and

B_{3}

colored with red, green, purple, and blue, respectively.

Figure 4. The architecture of U-NASNetMobile. Normal Cells and Reduction Cells are the structures obtained via neural architecture searching [39]. The yellow circles are concatenation layers.

Figure 5. Comparison of the model with pre-training and with random initialization. The figure plots the learning curves of the base model

f_{1}

. We can observe that

f_{1}

with pre-training on

f_{0}

converges faster and leads to better performance.

Figure 5. Comparison of the model with pre-training and with random initialization. The figure plots the learning curves of the base model

f_{1}

. We can observe that

f_{1}

with pre-training on

f_{0}

converges faster and leads to better performance.

Figure 6. Accuracy variation of the GeoBoost models. The base learner of boosting is the U-NASNetMobile model for (a) and the U-Net model for (b).

Figure 7. Semantic segmentation results of GeoBoost from the cities of Chicago, Vienna, and Shanghai. The figures of (a,d,g) colored in red are the ground truth of images. The figures of (b,e,h) colored in green are the results of

F_{0} (x)

, and the figures of (c,f,i) are the results of

F_{3} (x)

. The yellow circles indicate the notable learning progresses of GeoBoost. Each image in this figure is with a size of

1024 \times 1024

.

Figure 7. Semantic segmentation results of GeoBoost from the cities of Chicago, Vienna, and Shanghai. The figures of (a,d,g) colored in red are the ground truth of images. The figures of (b,e,h) colored in green are the results of

F_{0} (x)

, and the figures of (c,f,i) are the results of

F_{3} (x)

. The yellow circles indicate the notable learning progresses of GeoBoost. Each image in this figure is with a size of

1024 \times 1024

.

Figure 8. Semantic segmentation results of different models from the cities of Chicago, Berlin, and Shanghai. The figures of (a,e,i) colored in red are the ground truth of images, and the figures of (b,f,j) colored in green are the prediction of a single model. The figures of (c,j,k) are the prediction of end-to-end gradient boosting, and the figures of (d,h,l) are the prediction of GeoBoost. Yellow circles highlight some notable discrepancies among different prediction results. Each image in this figure has a size of

1024 \times 1024

.

Figure 8. Semantic segmentation results of different models from the cities of Chicago, Berlin, and Shanghai. The figures of (a,e,i) colored in red are the ground truth of images, and the figures of (b,f,j) colored in green are the prediction of a single model. The figures of (c,j,k) are the prediction of end-to-end gradient boosting, and the figures of (d,h,l) are the prediction of GeoBoost. Yellow circles highlight some notable discrepancies among different prediction results. Each image in this figure has a size of

1024 \times 1024

.

Figure 9. Samples from the cities of Toronto, Chicago, and Los Angeles in area

B_{1}

. The figures of (a,f,k) are the sample images, and the figures of (b,g,l) colored in red are the ground truth of images. The figures of (c,h,m) colored in green are the prediction of GeoBoost. The figures of (d,i,n) are the center crop of labels, and the figures of (e,j,o) are the center crop of GeoBoost’s prediction. The first three columns are with a size of

4096 \times 4096

, and the last two is the center crop of the big images for observing details with a size of

1024 \times 1024

.

Figure 9. Samples from the cities of Toronto, Chicago, and Los Angeles in area

B_{1}

. The figures of (a,f,k) are the sample images, and the figures of (b,g,l) colored in red are the ground truth of images. The figures of (c,h,m) colored in green are the prediction of GeoBoost. The figures of (d,i,n) are the center crop of labels, and the figures of (e,j,o) are the center crop of GeoBoost’s prediction. The first three columns are with a size of

4096 \times 4096

, and the last two is the center crop of the big images for observing details with a size of

1024 \times 1024

.

Figure 10. Samples from the cities of Berlin, Paris, and Vienna in area

B_{2}

. The figures of (a,f,k) are the sample images, and the figures of (b,g,l) colored in red are the ground truth of images. The figures of (c,h,m) colored in green are the prediction of GeoBoost. The figures of (d,i,n) are the center crop of labels, and the figures of (e,j,o) are the center crop of GeoBoost’s prediction. The first three columns are with a size of

4096 \times 4096

, and the last two is the center crop of the big images for observing details with a size of

1024 \times 1024

.

Figure 10. Samples from the cities of Berlin, Paris, and Vienna in area

B_{2}

. The figures of (a,f,k) are the sample images, and the figures of (b,g,l) colored in red are the ground truth of images. The figures of (c,h,m) colored in green are the prediction of GeoBoost. The figures of (d,i,n) are the center crop of labels, and the figures of (e,j,o) are the center crop of GeoBoost’s prediction. The first three columns are with a size of

4096 \times 4096

, and the last two is the center crop of the big images for observing details with a size of

1024 \times 1024

.

Figure 11. Samples from the cities of Seoul, Shanghai, and Tokyo in area

B_{3}

. The figures of (a,f,k) are the sample images, and the figures of (b,g,l) colored in red are the ground truth of images. The figures of (c,h,m) colored in green are the prediction of GeoBoost. The figures of (d,i,n) are the center crop of labels, and the figures of (e,j,o) are the center crop of GeoBoost’s prediction. The first three columns are with a size of

4096 \times 4096

, and the last two is the center crop of the big images for observing details with a size of

1024 \times 1024

.

Figure 11. Samples from the cities of Seoul, Shanghai, and Tokyo in area

B_{3}

. The figures of (a,f,k) are the sample images, and the figures of (b,g,l) colored in red are the ground truth of images. The figures of (c,h,m) colored in green are the prediction of GeoBoost. The figures of (d,i,n) are the center crop of labels, and the figures of (e,j,o) are the center crop of GeoBoost’s prediction. The first three columns are with a size of

4096 \times 4096

, and the last two is the center crop of the big images for observing details with a size of

1024 \times 1024

.

Figure 12. Distributions of building size for the DREAM-B data set.

Figure 13. Samples from the city of Shanghai. This sample is cropped from Figure 11g,h. The figure of (a) is the label of the sample, and the figure of (b) is the prediction results. All the samples are with a size of

1500 \times 1500

.

Figure 13. Samples from the city of Shanghai. This sample is cropped from Figure 11g,h. The figure of (a) is the label of the sample, and the figure of (b) is the prediction results. All the samples are with a size of

1500 \times 1500

.

Table 1. Results for GeoBoost with different learning rates.

Learning Rate	Validation IoU
$1.0$	$0.6041$
$0.5$	$0.6183$
$0.2$	$0.6272$
$0.1$	$0.6298$
$0.05$	$0.6259$

Table 2. Results for the comparison of different methods. The result of a single model is obtained by training a net with all the available training data

x_{0}

,

x_{1}

,

x_{2}

, and

x_{3}

from four areas. EGB is short for the end-to-end gradient boosting method.

Table 2. Results for the comparison of different methods. The result of a single model is obtained by training a net with all the available training data

x_{0}

,

x_{1}

,

x_{2}

, and

x_{3}

from four areas. EGB is short for the end-to-end gradient boosting method.

Method	IoU	Precision	Recall	Overall Accuracy	Kappa
A single model (U-Net)	$0.6295$	$0.8049$	$0.7843$	$0.8877$	$0.7101$
A single model (U-NASNetMobile)	$0.6458$	$0.8283$	$0.8001$	$0.8985$	$0.7375$
EGB (U-Net)	$0.5465$	$0.7915$	$0.7443$	$0.8753$	$0.6736$
GeoBoost (U-Net)	0.6164	$0.8019$	$0.7646$	$0.8831$	$0.6953$
EGB (U-NASNetMobile)	$0.5887$	$0.8199$	$0.7276$	$0.8815$	$0.6784$
GeoBoost (U-NASNetMobile)	0.6372	$0.8112$	$0.7969$	$0.8905$	$0.7213$

Table 3. Results for the comparison of different areas on the test set. The coverage of different areas are the same as the bounding boxes of

B_{1}

,

B_{2}

, and

B_{3}

in Figure 3.

Table 3. Results for the comparison of different areas on the test set. The coverage of different areas are the same as the bounding boxes of

B_{1}

,

B_{2}

, and

B_{3}

in Figure 3.

Area	Testing IoU
America ( $B_{1}$ )	$0.6575$
Europe ( $B_{2}$ )	$0.7100$
East Asia ( $B_{3}$ )	$0.4758$

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, N.; Tang, H. GeoBoost: An Incremental Deep Learning Approach toward Global Mapping of Buildings from VHR Remote Sensing Images. Remote Sens. 2020, 12, 1794. https://doi.org/10.3390/rs12111794

AMA Style

Yang N, Tang H. GeoBoost: An Incremental Deep Learning Approach toward Global Mapping of Buildings from VHR Remote Sensing Images. Remote Sensing. 2020; 12(11):1794. https://doi.org/10.3390/rs12111794

Chicago/Turabian Style

Yang, Naisen, and Hong Tang. 2020. "GeoBoost: An Incremental Deep Learning Approach toward Global Mapping of Buildings from VHR Remote Sensing Images" Remote Sensing 12, no. 11: 1794. https://doi.org/10.3390/rs12111794

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GeoBoost: An Incremental Deep Learning Approach toward Global Mapping of Buildings from VHR Remote Sensing Images

Abstract

1. Introduction

2. Methods

2.1. Gradient Boosting

2.2. End-to-End Gradient Boosting

2.3. GeoBoost

3. Experiment

3.1. Data Set

3.2. Implementation Details

4. Results and Discussion

4.1. Technical Analysis

4.1.1. Pre-Training

4.1.2. Learning Rates

4.1.3. Complexity Analysis

4.2. Quantitative Evaluation

4.3. Qualitative Analysis

4.4. Discussion on the Regional Impacts

5. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI