*4.2. Ablation Experiments of Model*

This study proposed a deep learning-based approach to extract rural settlements using HSR images. Experiments were carried out to explore the contribution of each part of the proposed deep method. Table 3 compares the performance of models with different settings based on the polygon test set. As showed in Table 4, when applying the original ResNet50 for segmentation, the accuracies of low-density class (UA of 82.50% and PA of 83.30%) were higher than those of high-density class (UA of 80.45% and PA of 67.75%). The low classification of PA indicates extracting HDS is rather challenging than LDS. When the last two stages of the baseline network were replaced by dilated convolutions, the PA index of high-density class was increased significantly by about 9%, while the UA of high-density class and the PA of low-density class had a moderate decrease. These indicated that the sub-module (+Dilation) was still insufficient. The possible reasons for the inconsistent changes in accuracies are the contradiction between the improvements brought by dilated convolutions and the defects of using single-scale feature. When comparing with the sub-module (+Dilation), another sub-module (+Dilation+Multiscale) yielded better accuracy on high-density class (UA of 84.88% and PA of 83.19%), with a slight increase in PA of low-density class, indicating that multi-scale context information enhanced the recognition power of the model. From Table 4, it can be seen that the proposed model achieved the largest OA of 98.68% with a Kappa coefficient of 0.8591. At the top of the aggregation layer, SE block captured feature dependencies in the channel dimension, and such feature selection process further improved the model performance. Figure 6 shows the visualization results of test set samples before and after recalibration with the SE block, implemented by t-SNE [40] technique. After the SE block, some samples of rural settlements classes gathered and were away from the background group, implying that the output of the channel relation module is more helpful for this classification task.



**Figure 6.** Visualization of test set samples before (**A**) and after recalibration (**B**) with SE block. Different colors represent different categories.

#### *4.3. Data Input Strategies*

Further experiments on two data input strategies, i.e., four channels and three channels, were conducted on the polygon test set. It was found that the classification accuracy of NIR-R-G-B composite images was slightly better than that of the R-G-B, but no significant difference was observed (Figure 7). It indicates that additional information of NIR band has positive effects on rural settlement extraction, while the powerful ability of CNNs to extract texture information from R-G-B images offset the gap between the two input strategies. Although the NIR band did not provide as great an improvement in accuracy as the DSM information [34], the strategy of using pre-trained weights of RGB data to initialize multispectral remote sensing images could be extended in the future.

**Figure 7.** Accuracy assessment of different data input strategies.

#### *4.4. Comparative Studies with Di*ff*erent Methods*

Five state-of-the-art methods were compared, including an object-based image analysis (OBIA) method and four FCN based deep models. These methods have been proven effective in delineation of settlements and/or object detection for satellite images. The detailed information of each method can be found in the publication and we just briefly summarized their key technologies.

1. OBIA [12]: a novel object-based image classification method which integrates hierarchical multi-scale segmentation and landscape analysis. This method makes use of spatial contextual information and subdivides different types of rural settlements with high accuracy.


Figure 8 demonstrates samples selected from classification results of all six methods based on the polygon test set. Quantitative results are presented in Table 5. In terms of overall performance, all six methods exhibited a high accuracy (OA > 0.97), and the results of the Kappa coefficient were consistent with OA. However, there were obvious differences about class-specific measures among the methods. With regards to UA, PA, the proposed method achieved the best accuracies, slightly better than the accuracies of DeeplabV3+. The UA and PA of SegNet and UNet were relatively close, but not as good as the proposed method. Unfortunately, the PA of FCN was lowered than other methods, indicating FCN is not the best choice to distinguish settlements pixels. Finally, the results of OBIA indicate that, for high-density class, the object-based method performs better than SegNet and UNet in PA significantly and slightly worse in UA, but lags far behind in Kappa values.

**Figure 8.** Example of results on Tongxiang polygon test set. (**a**) Original images, (**b**) OBIA, (**c**) FCN, (**d**) UNet, (**e**) SegNet, (**f**) DeeplabV3+, (**g**) The proposed method.

97


**Table 5.** Accuracy assessment of different methods, where values in bold are the best.

For the low-density class, all deep techniques, except FCN, achieved satisfying performance because the number of low-density pixels was relatively large in the training data, which was an advantage for data-driven deep learning methods. The FCN model only used deep features for classification, and the loss of spatial information led to blurred building boundaries. In contrast, the object-based method performed better for HDS identification. Unlike the end-to-end deep methods, the performance of object-based method was heavily depended on the scale parameter of segmentation. The new-style HDS' scale was relatively uniform and could be effectively extracted using OBIA method, even with a small sample size. Comparatively, LDS had a large size variation and were more sensitive to the choice of segmentation scale. Although the multi-context OBIA method exploited multiple segmentation scales to obtain the objects to be classified, it was still insufficient to separate the LDS of different sizes from the surrounding vegetation. Figure 8b shows that the OBIA method tends to intermingle the adjacent houses with vegetation or ground due to an improper segmentation scale selection. Moreover, manually designed features reduced the generalizability of methods in a large region. SegNet and UNet struggled in scenes where LDS and HDS are co-existed and mixed (Figure 8d,e). Compared with SegNet and UNet, using multi-scale context information helped the proposed method and DeeplabV3+ to reduce the misclassification of HDS. However, it inevitably induces some ambiguities on the boundaries of polygons (Figure 8f,g).

Table 6 lists the computing time of the proposed method and other methods. For the OBIA method, the segmentation and classification were conducted separately, and thereby showed the least time consumption. Instead, deep learning methods were end-to-end approaches. Among deep learning methods, FCN consumed fewer computing resources and had the shortest inference time because FCN had abandoned the full connection layers with lots of parameters. Therefore, the lack of feature representation capability limited the performance of FCN in this task. The proposed model showed similar model size and inference time with SegNet, but it took less training time to reach convergence. UNet and DeeplabV3+ have more parameters and they take longer to converge. Overall, the proposed method is more efficient.


**Table 6.** The efficiency of different methods.

#### *4.5. Analysis and Potential Improvements*

In our analysis, we found that all selected deep methods, except the proposed method and DeeplabV3+, were not as effective in the high-density category as in the low-density category. One possible reason was that the downsampling operation of the comparative methods was aggressive. Instead, using dilated residual convolutional network retained the spatial resolution of features. Given the input image patch (256 × 256), the deepest feature map of the proposed network maintains an appropriate size (32 × 32), which helps to restore the geometry of settlements. In this way, the accuracy of HDS increased greatly. However, the problem of scale selection remained. Unsynchronized scales of different types of settlements made it difficult to determine the optimal scale. The proposed multi-scale context subnetwork involved multiple scales, thereby reducing the dependence on a single optimal scale to a certain extent. However, the minimum scale (32 × 32) of representations applicable in the Tongxiang dataset may not match other HSR data. Thus, if the proposed method is applied to other data, determining an appropriate scale range would depend on the size of settlements objects and input images.

In some areas, HDS and LDS could not be easily recognized as they were in similar shapes, structures. Deep features at multiple scale could handle such complex patterns of settlements objects of different sizes, and the SE block modeled the global contextual relation of fused features, enabling feature selection in the channel dimension. The multi-scale context subnetwork gave more confident predictions at pixel level. The way that DeeplabV3+ uses the spatial pyramid module to encode multi-scale context information has achieved similar effects as our context subnetwork. The experimental results demonstrated that the proposed multi-scale network distinguish two types of settlements objects effectively. Nevertheless, contours of rural settlements needed to be further refined. Blurred object boundaries were an inherent and common defect of CNN-based semantic segmentation models. The downsampling process in the CNN model inevitably lost spatial details, which was detrimental to the preservation of edge information. However, this was a trade-off between spatial resolution and semantic feature representation of segmentation models. Our results showed that in this application, the use of dilated convolution instead of downsampling alleviated the loss of boundary details.

Segmentation and classification are conducted separately in OBIA method, which makes the classification result greatly affected by the performance of segmentation algorithm. Besides, handcrafted features used in OBIA are difficult to achieve an optimal balance between discriminability and robustness, since these features cannot easily consider the details of real data, especially in the case of HSR images that images can change a lot in large extent [45]. Instead, deep learning methods conduct segmentation and classification at the same time, and the classification results in Table 5 prove the superiority of the proposed method. Though deep learning methods take longer to train, it takes only a few seconds for a trained network to classify images. From the perspective of application, this is more applicable to the situation of big data of HSR images. Moreover, observation from the OBIA results, image segments could preserve the precise edges if under the appropriate segmentation scale. According to this observation, it is promising to combine the segmentation of OBIA and the feature representation of deep learning to classify rural settlements. Furthermore, this leaves open the question of whether a non-differentiable segmentation algorithm can be integrated into CNNs. In future, we hope to find a way to integrate the advantage of OBIA segmentation into the proposed framework of a deep network for rural settlement mapping.

#### **5. Conclusions**

Rural settlements classification using HSR remotely sensed image remains a challenging task, due to the intra-class spectral variation and spatial scale variation. This paper presents an effective rural settlements extraction method based on a deep fully convolutional network (FCN) from HSR satellite images. In the proposed multi-scale FCN model, dilated convolution was utilized to extract feature representations with high spatial resolution. A subnetwork improved the discrimination power of the network by aggregating and re-weighting multi-scale context information across layers. High spatial resolution representations and multi-scale context information helped to locate and further subdivide rural settlements. Experimental results on GF-2 images acquired over a typical rural area located in Tongxiang, China, showed the proposed method produced the most accurate classification

results of rural settlements, comparing with other state-of-the-art methods and the sub-modules. In summary, our proposed method was promising in terms of its potential for rural settlements extraction from HSR images. From a rural management perspective, this work describes a scheme for rapid identification of rural settlements in a large region by using HSR images. The classification method presented here could be extended to the identification of rural settlements in a larger area, and the results can be used as a guide for on-site verification or enforcement in cadastral inventory.

In future works, further improvements could be made by integrating multi-temporal HSR images and multi-modal data, so that the dynamics of rural settlements can be characterized.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/1424-8220/20/21/6062/s1, Table S1: Specification of our network architecture.

**Author Contributions:** Conceptualization, Z.Y.; methodology, Z.Y.; software, Z.Y.; validation, R.Z.; investigation, B.S.; resources, B.S.; data curation, R.Z.; writing—original draft, Z.Y.; writing—review and editing, Y.L. and Q.Z.; visualization, Y.L.; supervision, L.H. and K.W.; funding acquisition, K.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Key Research and Development Program of China, grant no. 2016YFC0502704, the National Natural Science Foundation of China, grant no. 41701638, the National Natural Science Foundation of China, grant no. 41971236, the Basic Public Welfare Research Program of Zhejiang Province, grant no. LGJ19D010001, and Zhejiang University Student Research Training Program (2019).

**Acknowledgments:** The authors appreciate the reviewers and the editor for their constructive comments and suggestions. We would especially like to thank Dr. Xinyu Zheng for his help.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*
