1. Introduction
Since the reform and opening-up, drastic urbanization has been taking place in China. In a stark contrast, the development of rural areas, however, is not in concert with that of urban areas, but is greatly lagging behind and restricted. Mass population migration, from rural to urban areas, has given rise to a succession of impacts on rural areas, including population decline, industry recession and land abandonment [
1,
2]. In 2018, China stepped up its efforts to revitalize rural regions. Building the new style of rural community with better infrastructure is one of the important measures to improve the wellbeing of rural people. Thus, a spatial-explicit understanding of rural settlements regarding their distributions is of critical essence to effective land management and policy making.
Satellite-based earth observation is a key enabler for capturing spatial information of buildings in rural areas. High spatial resolution (HSR) images open new opportunities for slums and informal settlement detection and rural land cover mapping [
3,
4]. Compared with medium resolution image which mainly offers spectral information (in terms of a single image) [
5], using HSR images can leverage both spectral and spatial information. HSR image analysis basically relies on image classification (e.g., pixel-based) and segmentation (e.g., Object-Based Image Analysis (OBIA)) techniques [
6,
7], with the help of handcrafted features extracted from spectral (e.g., reflectance and spectral indices, like Normalized Difference Vegetation Index (NDVI)) and spatial (texture statistics, morphological profiles, and oriented gradients) [
8,
9]. With an ever-increasing focus on rural areas, satellite images have been extensively used for rural settlement mapping [
10,
11]. Nevertheless, applying HSR images to rural settlement detection remains a challenging task due to the following issues. First, the size and spatial distribution of rural settlements varies significantly, e.g., clustered or scattered, because rural planning is changing over time. Second, the intra-class variation makes it difficult to distinguish rural settlements from construction materials when using spectral information alone. Third, when considering large spatial areas, the spectral and spatial responses from ground objects present an extremely complex pattern [
8]. In order to discriminate rural settlements, more context information is required in the classification. In previous studies, such as [
12,
13], landscape metrics were used as the spatial contextual information to identify rural settlements from HSR satellite imagery. These methods exploit tailored segment-based features and have achieved acceptable performance. However, parameters optimization and handcrafted features selection are laborious tasks and are highly hinged upon expert experience, and trial-and-error tests.
Deep learning methods, such as convolutional neural networks (CNNs), have shown great potential for automatically features learning without human intervention. CNNs are able to generate robust feature representations hierarchically and have become increasingly popular in image classification and semantic segmentation [
14]. Semantic segmentation for remote sensing data usually refers to extracting terrestrial objects from earth observation images using CNNs model, that is, each pixel is assigned a semantic label in pixel-based classification [
15]. The fully convolutional network (FCN) [
16] extends CNNs to segmentation, emerging as the preferred scheme for semantic labeling tasks. FCN inputs images of arbitrary sizes into a standard CNN, extract feature maps using layer-wise activation and abstraction, and then output high resolution predictions in an end-to-end fashion. The essential advantages of FCNs are the intrinsic ability to enhance feature representation and the flexibility to accept input images of any size. Previous studies have applied FCN and its variants to detect buildings and settlements [
17,
18,
19]. It is further found that incorporating contextual relations in CNNs can improve classification accuracy [
20,
21]. Nevertheless, most of the above-mentioned approaches are designed to extract target objects in urban areas from the standard datasets [
22]. In rural areas, built-up areas tend to be sparse and can be easily omitted [
23]. Due to the significant differences in the appearance of urban and rural buildings, directly employing existing deep approaches to map rural settlements does not guarantee good performance. In addition, the difficulty in image interpretation increases sharply as the spatial resolution increases. Therefore, we wish to make use of the advantages of deep learning technique to contribute to the area of rural settlements identification in HSR images. By far, only a few studies applied FCNs to extract rural residential areas [
24,
25], and most of them were limited by the spatial resolution of images or the extent of application. The effectiveness of FCNs in rural settlement mapping using HSR images requires further in-depth examination. In short, it is imperative to develop an effective method to buttress automatic extraction of rural settlements using HSR images.
The overall objective of this paper is to develop a framework for automatically identifying rural settlements in HSR satellite images based on deep learning technique. Our main contributions are: (1) This application introduces a deep FCN method to recognize rural settlements. Specifically, dilated convolutions are used to extract deep features at high spatial resolution. (2) A multiple scale context subnetwork, which adopts a popular squeeze and excitation (SE) module [
26] to aggregate multi-scale context, is exploited to generate discriminative representations. The proposed deep learning-based rural settlement extraction scheme can flexibly take multi-spectral HSR images as input to distinguish different types of rural settlements.
2. Study Area and Data
In this research, eleven towns of Tongxiang County were selected as study area, a typical rural region undergoing rapid rural development and transformation in the Yangtze River delta of China (120°30′13″E, 30°41′10″W). Tongxiang, located in the Hangjiahu plain, has a temperate climate with distinct seasonality. Since 2000, several land consolidation projects have been carried out to promote the construction of new countryside. Currently, the construction and renovation of countryside are still ongoing in Tongxiang, so the old scattered low-rise houses are mixed with uniformly planned residential buildings. Therefore, this area is an ideal study area to examine our proposed method. We preliminarily divide these settlements into two categories.
Figure 1 shows examples of two types of rural settlements in the study area—low-density settlements and high-density settlements.
Low-density settlements (LDS): most of LDS are old-style rural settlements which are scattered and disorderly distributed and have different orientations. These low-density rural settlements are mainly located close to rivers and streams in support of farming and transportation of smallholders. The boundaries of low-density settlements are obscured by the surrounding vegetation.
High-density settlements (HDS): newly built residential areas where multi-story buildings accommodate several families. Such settlements have a higher building density than low-density settlements, and buildings inside these settlements have an identical spacing and the same surface. High-density settlements mainly distribute adjacent to the newly built transportation roads, providing easy access to nearby towns.
China’s GaoFen-2 (GF-2) HSR images were used, comprising four multispectral bands (MSS) with a spatial resolution of 4 m and a panchromatic band (PAN) with a spatial resolution of 1 m. The acquisition time of two images was on July 2016. And we collected the land use data of the study area in 2015 (provided by the Bureau of Land and Resources, Tongxiang, China) to generate ground truth data.
4. Results and Discussions
4.1. Rural Settlements Identification
Figure 5 shows the resulting rural settlements of our study area.
Table 2 and
Table 3 present confusion matrices on test sets. The proposed method achieved the OA of 98.31% with a Kappa coefficient of 0.9724 on the point test set, and the UA and PA of two settlements classes reached about 98%. The classification accuracy on polygon-based testing samples was different, the accuracies of low-density class (UA of 88.00% and PA of 84.97%) were higher than those of high-density class (UA of 85.22% and PA of 84.68%). In terms of overall classification, the Kappa coefficient of 0.8591 in the polygon-based testing method were lower than that of the point-based test set. This was because the polygon-based test method had strict requirements on the object boundary. Visual interpretation indicates that the proposed method can effectively distinguish rural residential areas from other man-made structures (white circle in
Figure 5). It was observed that the footprints of HDS were more smoothed than the LDS, where the latter ones were inclined to be obscured by the surroundings, e.g., tress and shadows. The introduction of multi-scale context made it easier for HDS with relatively uniform scales to be detected, which was reflected by the PA. In addition, a few LDS houses on the edge of HDS were misclassified into isolated houses within HDS. This was caused by a similar roofs and ambient vegetation (red circle in
Figure 5). It further suggests that the polygon-based testing method is necessary. Previous studies considered recognition accuracy, but sometimes did not include area accuracy of rural settlements.
4.2. Ablation Experiments of Model
This study proposed a deep learning-based approach to extract rural settlements using HSR images. Experiments were carried out to explore the contribution of each part of the proposed deep method.
Table 3 compares the performance of models with different settings based on the polygon test set. As showed in
Table 4, when applying the original ResNet50 for segmentation, the accuracies of low-density class (UA of 82.50% and PA of 83.30%) were higher than those of high-density class (UA of 80.45% and PA of 67.75%). The low classification of PA indicates extracting HDS is rather challenging than LDS. When the last two stages of the baseline network were replaced by dilated convolutions, the PA index of high-density class was increased significantly by about 9%, while the UA of high-density class and the PA of low-density class had a moderate decrease. These indicated that the sub-module (+Dilation) was still insufficient. The possible reasons for the inconsistent changes in accuracies are the contradiction between the improvements brought by dilated convolutions and the defects of using single-scale feature. When comparing with the sub-module (+Dilation), another sub-module (+Dilation+Multiscale) yielded better accuracy on high-density class (UA of 84.88% and PA of 83.19%), with a slight increase in PA of low-density class, indicating that multi-scale context information enhanced the recognition power of the model. From
Table 4, it can be seen that the proposed model achieved the largest OA of 98.68% with a Kappa coefficient of 0.8591. At the top of the aggregation layer, SE block captured feature dependencies in the channel dimension, and such feature selection process further improved the model performance.
Figure 6 shows the visualization results of test set samples before and after recalibration with the SE block, implemented by t-SNE [
40] technique. After the SE block, some samples of rural settlements classes gathered and were away from the background group, implying that the output of the channel relation module is more helpful for this classification task.
4.3. Data Input Strategies
Further experiments on two data input strategies, i.e., four channels and three channels, were conducted on the polygon test set. It was found that the classification accuracy of NIR-R-G-B composite images was slightly better than that of the R-G-B, but no significant difference was observed (
Figure 7). It indicates that additional information of NIR band has positive effects on rural settlement extraction, while the powerful ability of CNNs to extract texture information from R-G-B images offset the gap between the two input strategies. Although the NIR band did not provide as great an improvement in accuracy as the DSM information [
34], the strategy of using pre-trained weights of RGB data to initialize multispectral remote sensing images could be extended in the future.
4.4. Comparative Studies with Different Methods
Five state-of-the-art methods were compared, including an object-based image analysis (OBIA) method and four FCN based deep models. These methods have been proven effective in delineation of settlements and/or object detection for satellite images. The detailed information of each method can be found in the publication and we just briefly summarized their key technologies.
OBIA [
12]: a novel object-based image classification method which integrates hierarchical multi-scale segmentation and landscape analysis. This method makes use of spatial contextual information and subdivides different types of rural settlements with high accuracy.
FCN [
25]: a proposed fully convolutional network which comprises an encoder based on the VGG-16 network and a decoder consists of three stacked deconvolution layers. As far as we know, this is the first time that a deep learning FCN model has been used for rural residential areas extraction.
UNet [
41]: a robust CNN architecture which consists of two symmetric contracting and expansive paths, which are made up of successive convolution layers. UNet is one of the deep learning methods often applied in the remote sensing field due to its efficiency and simplicity [
42].
SegNet [
43]: an encoder-decoder architecture uses the pooling indices to perform upsampling. It is a classic and efficient model that is often used as a baseline for semantic segmentation. Persello et al. [
44] successfully delineated agricultural fields in smallholder farms from satellite images using SegNet.
DeeplabV3+ [
20]: a state-of-the-art semantic segmentation model combining spatial pyramid pooling module and encode-decoder structure. It has achieve a performance of 89% on the PASCAL VOC 2012 semantic segmentation dataset.
Figure 8 demonstrates samples selected from classification results of all six methods based on the polygon test set. Quantitative results are presented in
Table 5. In terms of overall performance, all six methods exhibited a high accuracy (OA > 0.97), and the results of the Kappa coefficient were consistent with OA. However, there were obvious differences about class-specific measures among the methods. With regards to UA, PA, the proposed method achieved the best accuracies, slightly better than the accuracies of DeeplabV3+. The UA and PA of SegNet and UNet were relatively close, but not as good as the proposed method. Unfortunately, the PA of FCN was lowered than other methods, indicating FCN is not the best choice to distinguish settlements pixels. Finally, the results of OBIA indicate that, for high-density class, the object-based method performs better than SegNet and UNet in PA significantly and slightly worse in UA, but lags far behind in Kappa values.
For the low-density class, all deep techniques, except FCN, achieved satisfying performance because the number of low-density pixels was relatively large in the training data, which was an advantage for data-driven deep learning methods. The FCN model only used deep features for classification, and the loss of spatial information led to blurred building boundaries. In contrast, the object-based method performed better for HDS identification. Unlike the end-to-end deep methods, the performance of object-based method was heavily depended on the scale parameter of segmentation. The new-style HDS’ scale was relatively uniform and could be effectively extracted using OBIA method, even with a small sample size. Comparatively, LDS had a large size variation and were more sensitive to the choice of segmentation scale. Although the multi-context OBIA method exploited multiple segmentation scales to obtain the objects to be classified, it was still insufficient to separate the LDS of different sizes from the surrounding vegetation.
Figure 8b shows that the OBIA method tends to intermingle the adjacent houses with vegetation or ground due to an improper segmentation scale selection. Moreover, manually designed features reduced the generalizability of methods in a large region. SegNet and UNet struggled in scenes where LDS and HDS are co-existed and mixed (
Figure 8d,e). Compared with SegNet and UNet, using multi-scale context information helped the proposed method and DeeplabV3+ to reduce the misclassification of HDS. However, it inevitably induces some ambiguities on the boundaries of polygons (
Figure 8f,g).
Table 6 lists the computing time of the proposed method and other methods. For the OBIA method, the segmentation and classification were conducted separately, and thereby showed the least time consumption. Instead, deep learning methods were end-to-end approaches. Among deep learning methods, FCN consumed fewer computing resources and had the shortest inference time because FCN had abandoned the full connection layers with lots of parameters. Therefore, the lack of feature representation capability limited the performance of FCN in this task. The proposed model showed similar model size and inference time with SegNet, but it took less training time to reach convergence. UNet and DeeplabV3+ have more parameters and they take longer to converge. Overall, the proposed method is more efficient.
4.5. Analysis and Potential Improvements
In our analysis, we found that all selected deep methods, except the proposed method and DeeplabV3+, were not as effective in the high-density category as in the low-density category. One possible reason was that the downsampling operation of the comparative methods was aggressive. Instead, using dilated residual convolutional network retained the spatial resolution of features. Given the input image patch (256 × 256), the deepest feature map of the proposed network maintains an appropriate size (32 × 32), which helps to restore the geometry of settlements. In this way, the accuracy of HDS increased greatly. However, the problem of scale selection remained. Unsynchronized scales of different types of settlements made it difficult to determine the optimal scale. The proposed multi-scale context subnetwork involved multiple scales, thereby reducing the dependence on a single optimal scale to a certain extent. However, the minimum scale (32 × 32) of representations applicable in the Tongxiang dataset may not match other HSR data. Thus, if the proposed method is applied to other data, determining an appropriate scale range would depend on the size of settlements objects and input images.
In some areas, HDS and LDS could not be easily recognized as they were in similar shapes, structures. Deep features at multiple scale could handle such complex patterns of settlements objects of different sizes, and the SE block modeled the global contextual relation of fused features, enabling feature selection in the channel dimension. The multi-scale context subnetwork gave more confident predictions at pixel level. The way that DeeplabV3+ uses the spatial pyramid module to encode multi-scale context information has achieved similar effects as our context subnetwork. The experimental results demonstrated that the proposed multi-scale network distinguish two types of settlements objects effectively. Nevertheless, contours of rural settlements needed to be further refined. Blurred object boundaries were an inherent and common defect of CNN-based semantic segmentation models. The downsampling process in the CNN model inevitably lost spatial details, which was detrimental to the preservation of edge information. However, this was a trade-off between spatial resolution and semantic feature representation of segmentation models. Our results showed that in this application, the use of dilated convolution instead of downsampling alleviated the loss of boundary details.
Segmentation and classification are conducted separately in OBIA method, which makes the classification result greatly affected by the performance of segmentation algorithm. Besides, handcrafted features used in OBIA are difficult to achieve an optimal balance between discriminability and robustness, since these features cannot easily consider the details of real data, especially in the case of HSR images that images can change a lot in large extent [
45]. Instead, deep learning methods conduct segmentation and classification at the same time, and the classification results in
Table 5 prove the superiority of the proposed method. Though deep learning methods take longer to train, it takes only a few seconds for a trained network to classify images. From the perspective of application, this is more applicable to the situation of big data of HSR images. Moreover, observation from the OBIA results, image segments could preserve the precise edges if under the appropriate segmentation scale. According to this observation, it is promising to combine the segmentation of OBIA and the feature representation of deep learning to classify rural settlements. Furthermore, this leaves open the question of whether a non-differentiable segmentation algorithm can be integrated into CNNs. In future, we hope to find a way to integrate the advantage of OBIA segmentation into the proposed framework of a deep network for rural settlement mapping.
5. Conclusions
Rural settlements classification using HSR remotely sensed image remains a challenging task, due to the intra-class spectral variation and spatial scale variation. This paper presents an effective rural settlements extraction method based on a deep fully convolutional network (FCN) from HSR satellite images. In the proposed multi-scale FCN model, dilated convolution was utilized to extract feature representations with high spatial resolution. A subnetwork improved the discrimination power of the network by aggregating and re-weighting multi-scale context information across layers. High spatial resolution representations and multi-scale context information helped to locate and further subdivide rural settlements. Experimental results on GF-2 images acquired over a typical rural area located in Tongxiang, China, showed the proposed method produced the most accurate classification results of rural settlements, comparing with other state-of-the-art methods and the sub-modules. In summary, our proposed method was promising in terms of its potential for rural settlements extraction from HSR images. From a rural management perspective, this work describes a scheme for rapid identification of rural settlements in a large region by using HSR images. The classification method presented here could be extended to the identification of rural settlements in a larger area, and the results can be used as a guide for on-site verification or enforcement in cadastral inventory.
In future works, further improvements could be made by integrating multi-temporal HSR images and multi-modal data, so that the dynamics of rural settlements can be characterized.